Breakpoint Graphs and Ancestral Genome Reconstructions

Size: px
Start display at page:

Download "Breakpoint Graphs and Ancestral Genome Reconstructions"

Transcription

1 Breakpoint Graphs and Ancestral Genome Reconstructions Max A. Alekseyev and Pavel A. Pevzner Department of Computer Science and Engineering University of California at San Diego, U.S.A. Classification: Genome Rearrangements, Ancestral Genome Reconstruction, Molecular Evolution Corresponding author: Pavel Pevzner Mail address: 9500 Gilman Dr., La Jolla, CA , U.S.A. Phone: Fax:

2 Abstract Recently completed whole genome sequencing projects marked the transition from gene-based phylogenetic studies to phylogenomics analysis of entire genomes. We developed an algorithm MGRA for reconstructing ancestral genomes and used it to study the rearrangement history of seven mammalian genomes: human, chimpanzee, macaque, mouse, rat, dog, and opossum. MGRA relies on the notion of the multiple breakpoint graphs to overcome some limitations of the existing approaches to ancestral genome reconstructions. MGRA also generates the rearrangement-based characters guiding the phylogenetic tree reconstruction when the phylogeny is unknown. 2

3 INTRODUCTION The first attempts to reconstruct the genomic architecture of ancestral mammals predated the era of genomic sequencing and were based on the cytogenetic approaches (Wienberg and Stanyon, 1997). The rearrangement-based phylogenomic studies were pioneered by Sankoff and co-authors (Sankoff et al., 1992; Sankoff and Blanchette, 1998; Blanchette et al., 1997) and were based on analyzing the breakpoint distances. Moret et al. (2001) further optimized this approach and developed a popular GRAPPA software for rearrangement analysis. MGR, another genome rearrangement tool (Bourque and Pevzner, 2002), uses the genomic distances instead of breakpoint distances for ancestral reconstructions. Since genomic distances lead to more accurate ancestral reconstructions (Moret et al., 2002; Tang and Moret, 2003), GRAPPA has been modified for genomic distances as well. While MGR has been used in a number of phylogenomic studies (Bourque et al., 2005; Murphy et al., 2005; Pontius et al., 2007; Bulazel et al., 2007; Xia et al., 2007; Deuve et al., 2008; Cardone et al., 2008), both MGR and GRAPPA have limited ability to distinguish reliable from unreliable rearrangements and to address the weak associations problem in ancestral reconstructions (Bourque et al., 2004, 2005; Froenicke et al., 2006; Bourque et al., 2006). Recently, Ma et al. (2006) made an important step towards reliable reconstruction of the ancestral genomes. In contrast to MGR and GRAPPA (which analyze both reliable and unreliable rearrangements), they have chosen to focus on the reliable breakpoint reconstruction in the ancestral genomes and to avoid assignments in the case of weak associations (complex breakpoints). This proved to be a valuable approach since, as it turned out, most breakpoints in the ancestral mammalian genomes can be reliably reconstructed. However, there are some limitations (discussed in Rocchi et al. (2006)) that this approach has to overcome to scale for large sets of genomes. First, while the Ma et al. (2006) infercars algorithm assumes that the phylogeny is known, it remains a subject of enduring debates even in the case of the primate rodent carnivore split (which is assumed to be resolved in Ma et al. (2006)). With the increase in the number of species, the reliability of the phylogeny will become even a bigger concern, thus raising the question of devising an approach that does not assume a fixed phylogeny but instead uses rearrangements as new characters for constructing phylogenetic trees (see Chaisson et al. (2006)). While MGR does not assume a fixed phylogeny, its heuristically derived weak associations are less reliable. The challenge then is to integrate the reliability of infercars with the flexibility of MGR. Another avenue to improve infercars algorithm is to find out how to deal with complex breakpoints that create gaps in reconstructions. Note that the Ma et al. (2006) approach focuses on the reliable ancestor reconstruction rather than on the specific rearrangements that happened in the course of the evolution. These are related but different problems that both can benefit from incorporating them into a single computational framework. Indeed, Ma et al. (2006) consider individual breakpoints and do not distinguish between particular types of rearrangements that generated a breakpoint of interest. In reality, the reversals and translocations operate on pairs of dependent breakpoints rather than individual breakpoints. Some rearrangements (and synteny associations) cannot be inferred from the analysis of single breakpoints but become tractable via analyzing the breakpoint graph. 1 As a result, while MGR constructs provably optimal scenarios in the absence of breakpoint re-use, it is not clear whether the same result holds for infercars. Recently, Zhao and Bourque (2007) developed the EMRAE algorithm, which reconstructs both reliable rearrangements and ancestors, thus addressing the shortcomings of both MGR (difficulty in distinguishing between reliable and putative rearrangement events) and infercars (ancestor reconstruction only). However, EMRAE (in contrast to MGR) does not attempt to reconstruct the 1 The breakpoint graphs represent a popular technique for the rearrangement analysis since they reveal pairs of breakpoints representing footprints of the rearrangement events. See Chapter 10 of Pevzner (2000) for background information on genome rearrangements and breakpoint graphs. 3

4 genome P = (+a +b c) a) b P c a b) b a h b t b t a t P b h b h c t c h c) a h b t b h c h breakpoint graph G(P,Q) of the genomes P and Q a h b t b h G(P,Q) c h Q c a h Q c h a t c t a t c t a genome Q = (+a b +c) a t c t Figure 1: a) Unichromosomal genome P = (+a + b c) represented as a black-obverse cycle. b) Unichromosomal genome Q = (+a b + c) represented as a green-obverse cycle. c) The breakpoint graph G(P, Q) with and without obverse edges. phylogenetic tree and is limited to unichromosomal genomes. Below we address some limitations of MGR, EMRAE and infercars by developing the Multiple Genome Rearrangements and Ancestors (MGRA) algorithm (available from In particular, MGRA constructs provably optimal scenarios even when there is some breakpoint re-use and when other tools do not guarantee optimality. MGRA is suitable for ancestral reconstructions of multichromosomal genomes (in contrast to EMRAE). MGRA is conceptually simpler and orders of magnitude faster than MGR. MGRA is not limited to reconstructing ancestral genomes in the case of known phylogeny (like infercars and EMRAE). Instead, it can guide the rearrangement-based reconstruction of phylogenetic trees. MGRA does not require prior information about the approximate lengths of the branches of the phylogenetic trees (in contrast to infercars). To evaluate the performance of MGRA, we compared ancestral reconstructions generated by MGRA and infercars. Despite the fact that MGRA and infercars are very different algorithms, their reconstructions turned out to be remarkably similar (98.5% of synteny associations are identical). We further analyzed some differences between MGRA, infercars, and the cytogenetics approach. METHODS 1 From Pairwise to Multiple Breakpoint Graphs We start with analysis of rearrangements in circular genomes (i.e., genomes consisting of circular chromosomes) and later extend it to genomes with linear chromosomes. We assume that each genome is formed by the same set of synteny blocks, which are arranged differently in different genomes. We will find it convenient to represent a chromosome formed by synteny blocks b 1,..., b n as a cycle with n directed labeled edges (corresponding to blocks) alternating with n undirected unlabeled edges (connecting adjacent blocks). The directions of the edges correspond to signs (strand) of the blocks. We label the tail and head of a directed edge b i as b t i and bh i respectively (Fig. 1) and represent a genome 4

5 x1 y1 x1 a) reversal y2 x2 b) y1 x1 y1 x1 y1 y2 x2 y2 x2 fusion / translocation y2 x2 fission y1 x1 y2 x2 c) d) y1 x1 y1 x1 1 y1 x1 fusion y x1 y2 x2 reversal y2 x2 y2 x2 fission y2 x2 Figure 2: a) A 2-break on edges (x 1, x 2 ) and (y 1, y 2 ) from the same chromosome corresponds to either a reversal, or a fission. b) A 2-break on edges (x 1, x 2 ) and (y 1, y 2 ) from different chromosomes corresponds to a translocation/fusion. c) A 2-break on edges (y 1, y 2 ) and (x 1, ) of a linear chromosome corresponds to a reversal affecting a chromosome end x 1 and creating a new chromosome end y 1. d) A 2-break on edges (x 1, ) and (y 1, ) from different chromosomes models a fusion. Fissions can be modeled as 2-breaks operating on an irregular loop edge (, ) and an arbitrary regular edge in the genome. as a set of disjoint cycles (one for each chromosomes). The edges in each cycle alternate between two colors: one color (e.g., black ) used for undirected edges while the other color (traditionally called obverse ) used for directed edges. Let P be a genome represented as a collection of alternating black-obverse cycles (a cycle is alternating if the colors of its edges alternate). For any two black edges (x 1, x 2 ) and (y 1, y 2 ) in the genome (graph) P we define a 2-break rearrangement (first introduced as DCJ rearrangement in Yancopoulos et al. (2005) and recently studied in Bergeron et al. (2006); Lin and Moret (2008)) as replacement of these edges with either a pair of edges (x 1, y 1 ), (x 2, y 2 ), or a pair of edges (x 1, y 2 ), (x 2, y 1 ) (Fig. 2a,b). In the case of circular genomes, 2-breaks correspond to the standard rearrangement operations of reversals, fissions, or fusions/translocations (Fig. 2). 2 Let P and Q be genomes on the same set of blocks B. The (pairwise) breakpoint graph G(P, Q) is simply the superposition of genomes (graphs) P and Q (Fig. 1c). Formally, the breakpoint graph G(P, Q) is defined on the set of vertices V = {b t, b h b B} with edges of three colors: obverse (connecting vertices b t and b h ), black (connecting adjacent blocks in P), and green (connecting adjacent blocks in Q). The black and green edges form the black-green alternating cycles that play an important role in analyzing rearrangements (Bafna and Pevzner, 1996). From now on we will ignore the obverse edges in the breakpoint graph so that it becomes simply a collection of (black-green) cycles (Fig. 1). The 2-break distance d 2 (P, Q) between genomes P and Q is defined as the minimum number of 2-breaks required to transform one genome into the other. In contrast to the Genomic Distance Problem (Hannenhalli and Pevzner, 1995; Tesler, 2002a; Ozery-Flato and Shamir, 2003) (for linear multichromosomal genomes), the 2-Break Distance Problem for circular multichromosomal genomes has a trivial solution (Yancopoulos et al., 2005; Alekseyev and Pevzner, 2007): d 2 (P, Q) = b(p, Q) c(p, Q), where b(p, Q) = B is the number of synteny blocks in P and Q, and c(p, Q) is the number of black-green cycles in G(P, Q). 2 In this paper we use the term reversal (common in bioinformatics literature) instead of the term inversion (common in biology literature). For circular chromosomes, fusions and translocations are not distinguishable, i.e., every fusion of circular chromosomes can be viewed as a translocation, and vice versa. 5

6 a) a t a h h c c t h t t h t h t h b b d d e e f f a t a h d h t h t h t t h h t d c c b b e e f f b) a h G(P 1,P 2,P 3,P 4) h b t c P 1=(+a c b)(+d+e+f) P 3=(+a d)( c b+e f) T h c h d P 2=(+d+e+b+c)(+a+f) t h t h t h t h d d e e b b c c a t a h f t h f t d d h a h a t c h t c P 4=(+d a c b+e f) h b t b t e h e h t f f a t t f t b t e h e h f t d c) P 1=(+a c b)(+d+e+f) P 3=(+a d)( c b+e f) r 6 T r 4 Q =(+a+b+c)(+d+e+f) 3 r 2 X=(+a+b+c+d+e+f) Q 1=(+a d c b+e+f) r 1 r 3 Q 2=(+a d c b+e f) r 7 r 5 P 2=(+d+e+b+c)(+a+f) P 4=(+d a c b+e f) Figure 3: a) A phylogenetic tree T with four linear genomes P 1, P 2, P 3, P 4 (represented as green, blue, red, and yellow graphs respectively) at the leaves. The obverse edges are not shown. b) The multiple breakpoint graph G(P 1, P 2, P 3, P 4 ) is a superposition of graphs representing genomes P 1, P 2, P 3, P 4. The multidegrees of regular vertices vary from 1 (e.g., vertex b h ) to 3 (e.g., vertex e h ). c) The same phylogenetic tree T with all intermediate genome specified and a genome X selected as a root. A T-consistent transformation of X into P 1, P 2, P 3, P 4 can viewed as a transformation of the quadruple (X, X, X, X) into the quadruple (P 1, P 2, P 3, P 4 ) where a rearrangement at each step is applied to some copies of the same genome in the quadruple. A particular such transformation takes the following steps: (X, X, X, X) r 3 r 4 r 5 r 6 r 1 (X, X, Q1, Q 1 ) r 2 (Q3, Q 3, Q 1, Q 1 ) (Q3, Q 3, Q 2, Q 2 ) (Q3, Q 3, P 3, Q 2 ) (Q3, Q 3, P 3, P 4 ) (P1, Q 3, P 3, P 4 ) (P1, P 2, P 3, P 4 ), where r 1 is a reversal in two copies of X; r 2 is a fission in two copies of X; r 3 is a reversal in both copies of Q 1 ; r 4 is a fission in one copy of Q 2, r 5 is a reversal in the other copy of Q 2 ; r 6 is a reversal in one copy of Q 3, r 7 is a translocation in the other copy of Q 3. r 7 A linear genome is a collection of linear chromosomes represented as sequences of signed synteny blocks. Each linear chromosome on n blocks is represented as a path of n directed obverse edges (encoding blocks and their direction) alternating with n 1 undirected black edges (connecting adjacent blocks). In addition, we introduce an extra vertex and connect it by an undirected (irregular) black edge with every vertex representing a chromosomal end (hence, the degree of vertex is twice the number of linear chromosomes). A linear chromosome is an alternating path of black and obverse edges, starting and ending at the vertex, and a linear genome is a collection of such paths. The 2-breaks involving irregular edges model the rearrangements affecting the chromosome ends (Fig. 2c,d). Analyzing reversals, translocations, fusions, and fissions in linear genomes poses additional algorithmic challenges as compared to analyzing 2-breaks in circular genomes. However, rearrangement scenarios in linear genomes are well approximated by 2-break scenarios in circular genomes (Alekseyev, 2008). Hence, we use 2-breaks as a single substitute for reversals, translocations, fusions, and fissions, admitting that 2-breaks may violate linearity of the genomes by creating circular chromosomes. While previous rearrangement studies (e.g., MGR) were limited to analyzing the pairwise breakpoint graphs, MGRA uses multiple breakpoint graphs (Caprara, 1999b), which simplify the rearrange- 6

7 ment analysis. Let P 1,..., P k be genomes on the same set of synteny blocks B. Similarly to the pairwise breakpoint graph, the (multiple) breakpoint graph G(P 1,..., P k ) is simply the superposition of genomes (graphs) P 1,..., P k on the same vertex set V = {b t, b h b B} { } (Fig. S20 and Fig. 3a,b). Fig. 4 shows the breakpoint graph on 1357 synteny blocks 3 of six mammalian genomes: M (mouse), R (rat), D (dog), Q (macaque), H (human), and C (chimpanzee). A vertex in the breakpoint graph is regular if it is different from. Similarly, an edge is regular if both its endpoints are regular, and irregular otherwise. The edges of G(P 1,..., P k ) are represented by undirected edges from the genomes P 1,..., P k of k different colors (hence, the degree of each regular vertex is k). To simplify the notation, we will use P 1,..., P k also to refer to the colors of edges in the multiple breakpoint graph, and denote the set of all colors C = {P 1,..., P k }. Furthermore, any non-empty subset of C is called a multicolor. All edges connecting vertices x and y in the (multiple) breakpoint graph form the multi-edge (x, y) of the multicolor represented by the colors of these edges (e.g., the multi-edge (e h, f h ) in Fig. 3b has multicolor {P 3, P 4 } shown as red and yellow edges). The number of multi-edges incident to a vertex (also equal to the number of adjacent vertices) is called the multidegree (note that the multidegree of a vertex may be smaller than its degree, e.g., the vertex e h in Fig. 3b has degree 4 and multidegree 3). Multi-edges correspond to adjacent synteny blocks that are conserved across multiple species and thus, represent valuable phylogenetic characters (Sankoff and Blanchette, 1998). A breakpoint in the multiple breakpoint graph G(P 1, P 2,..., P k ) is a vertex of the multidegree greater than 1. A multiple breakpoint graph without breakpoints is an identity breakpoint graph G(X,..., X) of some genome X. Alternatively, the identity breakpoint graph can be characterized as a breakpoint graph consisting of complete multi-edges (i.e., multi-edges of the multicolor C) that correspond to the synteny blocks adjacencies in X. 2 Multiple Genome Rearrangement Problem The key observation in studies of pairwise genome rearrangements is that every 2-break transformation of a black genome P into a green genome Q corresponds to a transformation of the breakpoint graph G(P, Q) into the identity breakpoint graph G(Q, Q) (Fig. S21) with 2-breaks on pairs of black edges (black 2-breaks). MGR (Bourque and Pevzner, 2002) implicitly applies a similar observation and attempts to come up with rearrangements that bring the multiple breakpoint graph G(P 1, P 2,..., P k ) closer to the identity multiple breakpoint graph G(P i, P i,..., P i ) for i varying from 1 to k. However, this approach does not allow one to utilize the internal edges of the phylogenetic tree for finding reliable rearrangements. Below we formalize the Multiple Genome Rearrangement Problem in terms of multiple breakpoint graphs. The key element of MGRA is finding a shortest transformation of the multiple breakpoint graph G(P 1, P 2,..., P k ) into an arbitrary identity multiple breakpoint graph G(X, X,..., X) for some a priori unknown genome X. We first illustrate this concept with pairwise breakpoint graphs. Let G(P 1, P 2 ) G(X, X) be an m-step transformation of G(P 1, P 2 ) into G(X, X) by either black or green 2-breaks (in contrast to the standard breakpoint graph analysis based on black 2-breaks only). 4 It is easy to see that every such transformation corresponds to a transformation P 1 X P 2 that uses m black 2-breaks. Therefore, instead of searching for a shortest transformation G(P 1, P 2 ) G(P 2, P 2 ), one can search for a shortest transformation of G(P 1, P 2 ) into any identity breakpoint graph G(X, X) without knowing X in advance. 3 The detailed information about synteny blocks and assembly builds is provided in the Supplementary File. Out of 1360 synteny blocks (kindly provided by Jian Ma) three synteny blocks represent intermixed segments of the chromosome X and other chromosomes (the mouse chromosome 7 and the rat chromosomes 15 and 20). Since these blocks are short (16, 47, and 17 KB respectively), we have discarded them to simplify the chromosome X analysis below. For better illustration of the breakpoint graphs, the vertex is shown in multiple copies as black dots, each connected by a single multi-edge to regular vertices. 4 Switching from black rearrangements to a mixture of black and green rearrangements is a simple but powerful paradigm that proved to be useful in previous studies (Bafna and Pevzner, 1998; Tannier and Sagot, 2004). 7

8 1000h 1001t 999t 1000t 1002t 999h 1001h 1002h 410t 1003t 1034h 1003h 1004t 1035h 1035t 1004h 1005t 122h 1005h 1006t 1016t 868t 1006h 1007h 1007t 504h 1008t 1008h 1009h 1009t 1010t 100h 101t 123h 100t 99h 1010h 1011t 1012t 1011h 1013t 1012h 1013h 1014h 1014t 1015t 1017h 1015h 261h 1018t 1016h 1017t 296t 702h 295h 1018h 1019t 515h 1019h 1020h 1020t 77t 101h 102h 102t 124t 1021t 1027t 1028t 1021h 1022h 1022t 1030t 1023t 1023h 1024t 1024h 1025t 1025h 1026t 1026h 1029h 1027h 1029t 1028h 103t 1030h 1031h 1031t 1032t 1032h 1033h 1033t 1034t 840t 970h 1036h 1037t 469h 1036t 471h 53h 1037h 1038t 419t 470t 1038h 1039t 1040t 1039h 1041t 1040h 103h 104t 1132t 1041h 1042t 1043t 1042h 1045h 1047t 1043h 1044t 1046h 1044h 1045t 1049h 831t 1048h 1046t 1047h 1048t 667h 1049t 1214t 1050t 830h 104h 105h 105t 185t 1050h 1051h 1051t 1052t 1052h 1053t 877t 1053h 1054t 1055t 877h 1054h 1056t 1055h 1056h 1057t 1059h 1057h 1058t 1060t 1059t 1058h 1060h 1061t 106t 106h 1061h 1062t 555t 820t 1062h 1063t 14t 1182t 1191h 1063h 1064h 1064t 13h 1065t 1065h 1066t 1248t 1066h 1067t 409h 1256t 1067h 1068t 992h 1068h 1069h 1069t 992t 1070h 1070t 107t 1071t 1071h 1072t 1072h 1073h 1073t 343h 1074t 1074h 1075h 1075t 1076t 1076h 1077h 1077t 1078t 1078h 1079t 1080t 1079h 1081t 1080h 107h 108h 108t 1081h 1h 72t 1082h 1083t 1100h 1082t 1084t 1083h 1103h 1104t 1084h 1085h 1085t 1086t 1086h 1087t 751h 1087h 1088h 1088t 866t 1089t 1089h 1090t 1116h 109t 1090h 1091h 1091t 1117t 1092t 1092h 1093h 1093t 1094t 1095h 1094h 1095t 1096h 1096t 1097t 1097h 1098t 27h 76h 1098h 1099h 1099t 1105t 1100t 109h 110t 585h 10h 11t 9h 10t 1101t 1101h 1102h 1102t 1105h 1103t 1104h 1106t 1106h 1107h 1107t 1116t 1108t 1120t 1108h 1109t 1110t 1109h 1111t 1110h 110h 111t 112h 113h 252t 1111h 1112t 3h 1112h 1113t 1114h 1121t 1113h 1114t 1115t 1115h 1128h 1117h 1118t 1119t 1118h 1120h 1119h 111h 112t 113t 114t 1121h 1122h 1122t 1123t 1123h 1124t 1125h 1124h 1125t 1127t 1126t 1126h 1127h 1128t 3t 1129t 1129h 55t 1130h 1131t 768t 1130t 1143t 1131h 98h 1254t 1132h 1133h 1133t 1134t 1134h 1135t 1135h 1136h 1136t 99t 1137t 1137h 1138t 435h 1138h 1139t 1140h 463t 464t 1139h 1140t 1154t 1141t 1158h 1141h 1142h 1142t 1151h 1153h 1153t 540t 1143h 1144h 1144t 1145t 1145h 1146h 1146t 1147t 1147h 1148h 1148t 1149t 1149h 1150t 140h 114h 115h 115t 1150h 1151t 61t 1172h 1152t 1152h 1154h 1155h 1155t 1156t 1157t 1156h 1158t 1157h 1159t 1159h 1160h 1160t 136t 430t 116h 116t 1161t 1161h 1162t 135t 137h 1162h 1163h 1163t 1164t 1164h 1165h 1165t 1166t 1166h 1167h 1167t 1168t 1168h 1169h 1169t 1170t 117t 1170h 1171h 1171t 1172t 539t 1173h 1174t 1177h 1173t 1183t 1174h 1175t 1176t 889t 912t 1175h 1177t 1176h 1178t 1180h 1178h 1179h 1179t 1181t 931h 1180t 964t 117h 118t 119t 1184h 1255h 488h 1181h 573h 863t 1182h 337t 1183h 1184t 1238t 1185t 1237h 1185h 1186t 796t 1187t 1186h 1187h 1188t 1188h 1189t 15t 1189h 1190t 207t 655h 656h 15h 118h 120t 119h 1190h 1191t 1192t 1193t 21h 1192h 1214h 1193h 1194t 219h 1194h 1195t 1196h 219t 1195h 1196t 1197h 1197t 1198h 1202h 1198t 1200h 1201h 1203t 1199t 1202t 1199h 1200t 1201t 1204t 11h 12h 12t 1203h 1204h 1205t 1205h 1206h 1206t 1207t 1207h 1208t 1209t 1208h 1210t 1209h 120h 121t 131h 27t 1210h 1211t 740t 1211h 1212h 1212t 740h 1213t 1213h 1219h 1219t 1215t 1215h 1216t 1216h 1217h 1217t 1218t 1218h 730t 121h 122t 128t 1220h 1221t 1224h 1220t 1225t 1221h 1222h 1222t 130h 1223t 1223h 1224t 1239t 205h 1225h 1226h 1226t 1227t 1227h 1228t 1229t 1228h 1230t 1229h 123t 1230h 1231h 1232h 1232t 1231t 216t 1233t 1233h 1234t 1235t 1234h 1236t 1235h 1236h 1237t 231t 489h 1238h 1246h 1239h 1240t 871h 872h 872t 141t 1240h 1241h 1241t 1242t 280t 1242h 1243h 1243t 1244t 1244h 1245t 1245h 1246t 781h 887h 887t 1247h 1247t 1254h 779t 1249t 917t 1248h 1256h 1249h 1250t 1251t 124h 125t 963t 1250h 1251h 1252t 1252h 1253t 920t 1253h 577t 1255t 291t 292h 911h 916h 1257t 1257h 1258t 1259h 1258h 1259t 1260t 125h 126h 126t 431t 1260h 1261h 1261t 1262t 1262h 1263h 1263t 1264t 1264h 888t 900h 1265h 1266t 1313t 1265t 1274h 1266h 1267t 1275h 1267h 1269t 1287t 1288t 1269h 1270t 1271t 127t 184h 1270h 1272t 1271h 1272h 1273h 1273t 1274t 1275t 1288h 1276t 1286t 1276h 1277h 1277t 1283t 1278t 1278h 1279h 1279t 1280t 127h 535h 792h 913t 531t 1280h 1281t 1282h 1281h 1282t 1315t 1314t 1283h 1284t 1285h 1284h 1285t 1313h 1286h 1289t 1287h 1312h 1304t 1289h 1290h 1290t 128h 129h 129t 1291t 1291h 1292h 1292t 1293t 1293h 1294h 1294t 1295t 1295h 1296h 1296t 1297t 1297h 1298h 1298t 1299t 1299h 1300h 1300t 130t 13t 1301t 1301h 1302h 1302t 1303t 1303h 1328t 1304h 1305h 1305t 1306t 1306h 1307t 1308h 1307h 1308t 1309t 1309h 1310t 1311t 131t 1310h 1311h 1312t 1359h 1314h 1315h 1316t 1316h 1317h 1317t 1318t 1318h 1319h 1319t 1320t 132t 145h 147t 1320h 1321t 1322t 1321h 1323t 1322h 1323h 1324h 1324t 1325t 1325h 1326h 1326t 1327t 1327h 1328h 1329h 1329t 1330t 132h 133h 133t 143t 1330h 1331h 1331t 1332t 1332h 1333h 1333t 1334t 1334h 1335t 1336t 1335h 1337t 1336h 1337h 1338t 1339h 1339t 1338h 1340h 1340t 134t 1341t 1341h 1342h 1342t 1343t 1343h 1344h 1344t 1345h 1345t 1346t 1346h 1347t 1349t 1347h 1348t 1349h 1348h 1350t 134h 139h 1350h 1351t 1352h 1351h 1352t 1354t 1354h 1355t 1356t 1355h 1356h 1357h 1357t 1358t 1358h 1359t 135h 137t 139t 136h 138t 138h 140t 436t 141h 142h 142t 333t 505t 143h 144h 144t 145t 146t 146h 71h 147h 148t 758t 148h 149h 149t 741h 150t 14h 16t 150h 151t 151h 152h 152t 153t 153h 154t 155t 154h 156t 155h 156h 157h 157t 158t 158h 159t 855h 159h 160t 963h 160h 161t 170h 161h 162h 162t 171t 163t 163h 164h 164t 165t 165h 166t 167t 166h 168t 167h 168h 169h 169t 170t 16h 17t 171h 172h 172t 173t 173h 174t 645t 174h 175t 176t 656t 175h 177t 176h 177h 424t 178h 179h 179t 178t 215h 180t 17h 18h 18t 971t 180h 181t 253t 181h 182t 250h 250t 182h 183t 211t 253h 908h 183h 184t 774t 209t 254t 491t 492t 185h 186t 471t 989h 186h 187h 187t 188t 188h 189h 189t 190t 19t 190h 191h 191t 192t 210h 192h 193h 193t 199t 194t 194h 195t 204h 195h 196t 202h 204t 196h 197t 203t 197h 198t 281t 198h 208t 280h 199h 200h 200t 19h 20h 20t 2t 1t 22t 290t 201t 201h 202t 203h 205t 206t 206h 207h 208h 254h 209h 210t 795t 21t 990t 211h 212h 212t 213t 213h 214h 214t 215t 216h 217h 217t 218t 218h 220t 220h 221t 222t 221h 223t 222h 223h 224h 224t 225t 225h 226h 226t 227t 227h 228t 228h 229h 229t 230t 22h 23t 26h 230h 231h 232h 232t 233t 233h 234h 234t 235t 235h 236t 237t 236h 238t 237h 238h 239t 241h 239h 240h 240t 242t 23h 24t 25t 715h 241t 242h 243h 243t 244t 244h 245t 245h 246h 246t 247t 247h 248h 248t 249t 249h 289t 24h 25h 26t 251t 251h 252h 909t 771h 255t 255h 256h 256t 257t 257h 258h 258t 259t 259h 260h 260t 261t 262t 262h 263t 476t 479t 263h 264t 673h 277h 264h 265t 266t 932t 265h 268t 267h 266h 267t 285t 284h 268h 269h 269t 270t 270h 271t 955t 271h 272t 346h 955h 272h 273t 274h 274t 275t 273h 278t 275h 276h 276t 277t 278h 279t 279h 334t 374t 28t 281h 282t 283t 282h 284t 283h 285h 286t 286h 287h 287t 288t 288h 289h 28h 29t 332t 290h 920h 44h 291h 292t 298t 293t 297h 293h 294t 296h 294h 295t 311h 297t 312t 298h 299h 299t 300t 29h 30h 30t 2h 4t 300h 301h 301t 302t 302h 303h 303t 304t 304h 305h 305t 306t 306h 307h 307t 308t 309t 308h 599t 598h 309h 310h 310t 31t 311t 312h 313t 314t 313h 51h 52t 314h 315t 320h 315h 316t 321t 316h 317h 317t 318t 318h 319t 319h 320t 357h 31h 32h 32t 321h 322t 41t 322h 323t 586h 324t 323h 586t 571h 324h 325t 326t 325h 327t 326h 327h 328t 333h 328h 329t 330t 40h 329h 331t 330h 33t 331h 45t 332h 334h 335h 335t 336t 336h 935h 337h 338t 338h 339h 339t 912h 340t 340h 33h 34t 35h 341t 341h 342h 342t 343t 344t 344h 345h 345t 62h 346t 347t 347h 348h 348t 654t 349t 349h 350t 351t 34h 35t 36t 350h 352h 353h 351h 352t 353t 354h 354t 355t 355h 356h 356t 357t 358t 358h 359h 359t 925t 360t 360h 361t 658h 361h 362h 362t 646t 363t 363h 364h 364t 365t 365h 366h 366t 367t 367h 368h 368t 369t 369h 370h 370t 36h 37h 37t 371t 371h 372t 372h 373t 373h 653h 374h 375h 375t 376t 418h 536h 542t 376h 377h 377t 692h 717h 378t 684h 693t 378h 379t 380h 379h 380t 382t 381t 38t 381h 382h 383t 383h 384t 417h 384h 385t 386t 418t 385h 387t 386h 387h 388t 391t 388h 389h 389t 391h 390t 38h 39h 39t 390h 392t 392h 393t 393h 394t 395t 394h 396t 395h 396h 397t 398t 397h 399t 398h 399h 400t 40t 400h 401h 401t 462h 402t 402h 403t 403h 404t 406h 404h 405t 406t 405h 407t 407h 408h 408t 409t 410h 411t 412t 411h 413t 412h 413h 414h 414t 415t 415h 416h 416t 417t 419h 420h 420t 41h 42h 42t 421t 421h 422h 422t 423t 423h 554h 424h 425h 425t 426h 426t 427t 427h 428h 428t 429t 429h 430h 43t 43h 431h 432h 432t 433t 433h 434h 434t 435t 436h 437t 461t 437h 438t 473h 460h 773h 438h 439h 439t 474t 440t 44t 440h 441h 441t 442t 442h 443t 444h 444t 443h 450t 445h 445t 447t 446t 446h 447h 448t 449t 448h 450h 449h 451t 451h 452t 453t 452h 454t 453h 454h 455t 456h 455h 456t 457h 457t 458t 465h 458h 459t 466t 459h 460t 725t 45h 46h 46t 461h 462t 704t 752t 463h 465t 464h 466h 467t 78t 467h 468t 469t 475h 468h 807t 806h 47t 470h 489t 472t 472h 473t 729h 474h 475t 476h 477t 478t 477h 730h 478h 479h 480h 480t 47h 48h 48t 481t 481h 482h 482t 483t 483h 484t 574t 484h 485t 486t 499h 485h 487h 488t 486h 487t 538t 490t 49t 490h 793t 989t 491h 492h 493t 496t 493h 494h 494t 495h 495t 496h 498t 497t 497h 498h 499t 500t 49h 50h 50t 4h 5t 6t 500h 501t 510h 678h 501h 502h 502t 511t 503t 503h 504t 505h 506t 507t 506h 507h 508t 508h 509h 509t 510t 51t 511h 512h 512t 513t 513h 514h 514t 515t 516t 516h 517t 670t 671h 716t 517h 518t 519t 533t 518h 520t 519h 520h 521t 522t 521h 524t 523h 522h 523t 545h 546t 524h 525h 525t 526t 526h 527h 527t 528t 528h 529h 529t 530t 52h 53t 57h 530h 531h 554t 532h 532t 553h 536t 789t 533h 534h 534t 535t 537t 540h 537h 541t 538h 539h 54t 58t 541h 542h 543h 543t 544t 544h 545t 546h 547t 551h 547h 548t 551t 552t 548h 549h 549t 553t 550t 54h 60h 550h 552h 555h 556h 556t 611h 611t 557t 557h 558t 559t 558h 559h 560h 560t 55h 56h 56t 561t 609t 561h 562h 562t 608h 563t 563h 564t 624t 564h 565t 616t 623h 565h 566t 567t 566h 568t 567h 568h 569t 644t 569h 570t 98t 643h 57t 570h 571t 643t 615h 572t 642h 572h 573t 870h 574h 575t 618t 575h 576t 577h 617h 576h 73t 578t 578h 579t 638h 639t 579h 580h 580t 641h 581t 581h 582t 582h 583t 584h 584t 583h 585t 587t 587h 588h 588t 919h 589t 589h 590t 594h 58h 59h 59t 590h 591t 592t 596h 591h 596t 595h 592h 593h 593t 594t 595t 597t 597h 598t 614h 615t 599h 600t 824t 60t 5h 7t 6h 600h 601t 602t 632t 601h 604h 603t 602h 603h 604t 605t 605h 606h 606t 607t 607h 608t 609h 610h 610t 927h 927t 928t 612h 612t 613t 613h 614t 616h 617t 79t 973t 618h 619h 619t 620t 61h 62t 75h 620h 621h 621t 622t 622h 623t 624h 625h 625t 626t 626h 627h 627t 628t 628h 629h 629t 630t 63t 630h 631t 631h 633t 632h 633h 634t 635h 634h 635t 636h 636t 637t 637h 638t 642t 639h 640h 640t 63h 64t 75t 641t 644h 998h 645h 969h 646h 647h 647t 648t 648h 649h 649t 650t 64h 65h 65t 650h 651h 651t 652t 652h 653t 940t 659t 654h 655t 666h 657t 657h 658t 663h 937h 939h 663t 659h 660t 665h 66t 660h 661t 665t 666t 661h 662t 662h 664t 664h 941t 667t 668t 668h 669t 674t 669h 691h 66h 67t 68t 670h 671t 672t 672h 673t 685t 679t 674h 675h 675t 676t 676h 677h 677t 678t 679h 680t 681t 67h 69t 68h 680h 682t 681h 682h 683h 683t 684t 685h 686h 686t 687t 687h 688h 688t 689t 689h 690h 690t 691t 692t 892t 693h 694t 694h 695h 695t 696t 696h 697h 697t 698t 698h 699h 699t 700t 69h 70h 70t 700h 701h 701t 702t 703h 703t 718t 704h 705t 778h 705h 706h 706t 724t 707t 707h 708h 708t 709t 709h 710h 710t 71t 711t 711h 712h 712t 713t 713h 714h 714t 715t 716h 717t 732t 731h 718h 719h 719t 856t 720t 737h 78h 720h 721h 721t 722t 722h 723h 723t 724h 725h 726t 731t 726h 727t 742h 727h 728h 728t 741t 729t 72h 766h 782t 732h 733h 733t 734t 734h 735t 736t 735h 737t 736h 738t 738h 739t 742t 739h 743t 73h 74t 763h 743h 744h 744t 745t 745h 746t 747h 746h 747t 749t 750t 748t 748h 750h 749h 751t 74h 770h 752h 753h 753t 754t 754h 755t 755h 756t 756h 757t 760t 757h 758h 764h 759t 759h 765t 76t 760h 761t 769h 761h 762h 762t 767t 763t 764t 767h 765h 766t 768h 769t 770t 771t 994t 801h 772t 774h 802t 772h 773t 798h 775t 775h 776t 799t 776h 777t 778t 777h 799h 800t 779h 780h 780t 77h 862h 781t 782h 783h 783t 784t 784h 785t 786t 785h 787t 786h 787h 788h 788t 789h 790t 791t 790h 792t 791h 793h 794t 794h 795h 796h 797h 797t 798t 79h 80t 81t 7h 8h 8t 800h 801t 802h 803h 803t 804t 804h 805h 805t 806t 807h 808h 808t 809t 809h 810t 80h 84t 83h 810h 811h 811t 812t 812h 813h 813t 814t 814h 815h 815t 816t 816h 817t 818t 817h 819t 818h 819h 81h 82h 82t 820h 821t 822h 821h 822t 823t 823h 824h 825t 854h 825h 826h 826t 855t 827t 827h 828t 859h 828h 829h 829t 860t 830t 83t 831h 832t 833t 832h 834t 833h 834h 835h 835t 836t 836h 837t 838t 837h 839t 838h 839h 840h 841t 857t 841h 842h 842t 856h 843t 843h 844t 845h 844h 845t 847t 846h 846t 847h 848h 848t 849t 849h 850h 850t 84h 85h 85t 851t 851h 852h 852t 853t 853h 854t 857h 858t 858h 859t 86t 860h 861h 861t 862t 863h 864h 864t 865t 865h 866h 867t 867h 891h 868h 869h 869t 870t 86h 87h 87t 871t 873t 879h 873h 874t 875t 880t 874h 876t 875h 876h 878h 878t 879t 88t 88h 880h 881h 881t 882t 882h 883t 897h 883h 884t 885t 898t 884h 886t 885h 886h 888h 889h 890h 890t 89t 891t 892h 893t 894t 893h 894h 895t 895h 896h 896t 897t 898h 899h 899t 900t 89h 90h 90t 9t 901t 901h 902h 902t 971h 903t 903h 904t 905t 904h 905h 906t 906h 907h 907t 908t 909h 910h 910t 91t 911t 913h 914t 915t 914h 916t 915h 917h 918h 918t 919t 91h 92h 92t 921h 922t 921t 925h 922h 923t 938h 923h 924h 924t 939t 926t 972h 926h 938t 928h 929h 929t 930t 93t 930h 931t 932h 933t 934t 933h 935t 934h 936h 936t 937t 942t 93h 94h 94t 940h 941h 942h 943h 943t 944t 944h 945h 945t 946t 946h 947h 947t 948t 948h 949h 949t 950t 95t 950h 951h 951t 952t 952h 953h 953t 954t 954h 956t 956h 957t 958h 957h 958t 959t 959h 960t 961t 95h 96h 96t 960h 962t 961h 962h 964h 965t 965h 966h 966t 967t 967h 968h 968t 969t 97t 990h 970t 972t 995h 973h 974h 974t 975t 975h 976t 977t 976h 978t 977h 978h 979h 979t 980t 97h 980h 981h 981t 982t 982h 983t 984t 983h 984h 985t 985h 986t 987t 986h 988t 987h 988h 991t 991h 993t 993h 994h 996t 995t 996h 997h 997t 998t Chromosome colors: X Figure 4: The breakpoint graph G(M, R, D, Q, H, C) (obverse edges are not shown) of six mammalian genomes: Mouse (red edges), Rat (blue edges), Dog (green edges), macaque (violet edges), Human (orange edges), and Chimpanzee (yellow edges). The graph has = 2714 vertices labeled as nt or nh (where n is a synteny block number) and colored in 23 colors representing chromosomes in the human genome. In the case of k 2 genomes P 1, P 2,..., P k, 2-breaks can be applied to multi-edges in the multiple breakpoint graph G(P 1, P 2,..., P k ) of as many as 2 k 2 different multicolors formed by proper subsets of C. However, not every series of such 2-breaks makes sense in terms of ancestral genome reconstructions. A basic property of ancestral genome reconstructions is that 2-breaks on multi-edges of 8

9 X MRD+QHC MRD QHC+MRD QHC MR+DQHC MR HC+MRDQ HC M+RDQHC R+MDQHC D+MRQHC Q+MRDHC H+MRDQC C+MRDQH M R D Q H C Figure 5: The phylogenetic tree T of six mammalian genomes: Mouse (red), Rat (blue), Dog (green), macaque (violet), Human (orange), and Chimpanzee (yellow) with a root X on the MRD + QHC branch. The branches are directed towards X and labeled with the corresponding pairs of complementary T-consistent multicolors. The T-consistent multicolor from each pair also labels the starting node of the corresponding directed branch. Note that the tree orientation may not necessary correlate with the time scale and the root genome X may not necessary be a common ancestor of the leaf genomes. multicolor Q C can be applied only when all genomes corresponding to colors in Q are merged into a single genome. We give an alternative definition of this property as follows: a transformation (series of 2-breaks) S of the multiple breakpoint graph G(P 1, P 2,..., P k ) is strict if for any 2-breaks ρ 1, ρ 2 S operating on multi-edges of multicolors Q 1 Q 2, ρ 1 precedes ρ 2 in S. The Multiple Genome Rearrangement Problem is reformulated as follows: Multiple Genome Rearrangement Problem (MGRP). Given genomes P 1,..., P k, find a shortest strict series of 2-breaks that transforms the breakpoint graph G(P 1,..., P k ) into an identity breakpoint graph. Let T be an (unrooted) phylogenetic tree of the genomes P 1,..., P k (Fig. 3a). The tree T consists of k leaf nodes (or simply leaves), k 2 internal nodes, and 2k 3 branches connecting pairs of nodes, so that the degree of each leaf is 1 while the degree of each internal node is 3. Removing a branch from T breaks it into two subtrees, each of which is induced by the set of its own leaves. A multicolor consisting of all colors (leaves) of either of these induced subtrees is called T-consistent. Let G be the set of all T-consistent multicolors. Note that if a multicolor Q is T-consistent then its complement Q = C \ Q is also T-consistent. Therefore, there is a one-to-one correspondence between the pairs of complementary T-consistent multicolors and the branches of T (Fig. 5). When a phylogenetic tree is given, MGRA addresses a restricted version of MGRP where 2-breaks are applied only to multicolors consistent with the phylogenetic tree. Tree-Consistent Multiple Genome Rearrangement Problem (TCMGRP). Given genomes P 1,..., P k at the leaves of a phylogenetic tree T, find a shortest strict series of T-consistent 2-breaks, transforming the breakpoint graph G(P 1,..., P k ) into an identity breakpoint graph. Note that MGRP and TCMGRP problems in the case of three unichromosomal genomes correspond to the median problem that is NP-complete (Caprara, 1999a; Tannier et al., 2008). While existence of exact polynomial algorithms for solving MGRP and TCMGRP is unlikely, we describe a heuristic approach to eliminating breakpoints in G(P 1,..., P k ) that uses reliable rearrangements. In particular, MGRA optimally solves these problems in case of semi-independent rearrangement scenarios with some breakpoint re-uses (see below). 9

10 We will find it convenient to fix a branch X of the tree T and assume that this branch contains a root X (viewed as yet another node) the precise location of which is to be determined later. The choice of X defines directions towards X on all branches of the tree T (Fig. 5). We label every leaf node P i of the directed tree T with the corresponding singleton multicolor {P i }, and then recursively label each internal node with the union of the multicolors of the starting nodes of all incoming branches (e.g., in Fig. 5 a common endpoint of branches coming from the leaf nodes M and R is labeled as MR). The multicolors forming node labels of the tree T are called T-consistent. Alternatively, T-consistent multicolors can be defined as T-consistent multicolors whose induced subtrees do not contain X. Note that exactly one of the multicolors in each pair of complementary T-consistent multicolors is T-consistent and it labels the starting node of the corresponding directed branch in T (except for the multicolors corresponding to the branch X that both are T-consistent). MGRA transforms the genomes P 1,..., P k into X along the directed branches of T, using 2-breaks on T-consistent multicolors ( T-consistent 2-breaks). In terms of breakpoint graphs, MGRA eliminates breakpoints in G(P 1, P 2,..., P k ) with T-consistent 2-breaks and transforms it into the identity breakpoint graph G(X,..., X). 5 This transformation defines a reverse transformation of the genome X into the genomes P 1,..., P k by T-consistent 2-breaks (such as in Fig. 3c). MGRA keeps the track of rearrangements applied to the breakpoint graph G(P 1,..., P k ) during its transformation into an identity breakpoint graph G(X,..., X). The recorded rearrangements (in the reverse order) define a reverse transformation that passes through every internal node of the tree T and, thus, can be used to reconstruct the ancestral genomes at the internal nodes of T. While initial steps in transformation of the breakpoint graph G(P 1,..., P k ) into an identity breakpoint graph usually correspond to reliable rearrangements, sooner or later one needs to employ less reliable heuristic arguments in order to complete the transformation. However, sometimes it is preferable to stop after reaching certain level of reliability even if the transformation is not complete (and the TCMGRP problem is not solved). In this case we stop short of reconstructing the ancestral genomes since the transformation has not resulted in an identity breakpoint graph. In Supplement C we describe an alternative method (not requiring solution of the TCMGRP problem) for reliable reconstruction of (parts of) ancestral genomes (similar to CARs from Ma et al. (2006)) at internal nodes of the phylogenetic tree. RESULTS 3 MGRA Algorithm Supplement A introduces the notion of independent (no breakpoint re-uses), semi-independent (breakpoint re-uses may occur only within single branches of the phylogenetic tree), and weaklyindependent rearrangements (breakpoint re-uses are limited to adjacent branches of the phylogenetic tree). MGRA optimally solves the MGRP problem in case of semi-independent 2-breaks and uses heuristics to move beyond the semi-independent assumption. Below we show that most 2-breaks in mammalian evolution are either independent, semi-independent, or weakly independent resulting in reliable ancestral reconstructions. 3.1 Cycles and paths in the breakpoint graph Visual inspection of a rather complex breakpoint graph in Fig. 4 (the giant component contains 630 vertices) reveals a large number of cycles and simple paths that are characteristic to independent and 5 The use of T-consistent 2-breaks here is motivated by an important property that every T-consistent transformation can be turned into a strict T-consistent transformation by changing the order of 2-breaks. Therefore, we do not directly address the strictness requirement in MGRA that first produces a T-consistent transformation of the genomes P 1, P 2,..., P k into the genome X and then reorders it into a strict transformation. 10

11 semi-independent rearrangement scenarios. MGRA uses the cycles/paths in the breakpoint graphs as a guidance for finding reliable ancestor reconstructions. We note that the immediate result of a 2-break performed along a branch Q+Q in the phylogenetic tree T is a cycle of four multi-edges whose multicolors alternate between Q and Q. All vertices in this cycle have multidegree 2 and represent breakpoints that were not reused. Even if one of these multi-edges is used in later rearrangements, the remaining three multi-edges still form an alternating path that serves as a footprint of the 2-break. This observation motivates a search for alternating paths and cycles in the breakpoint graphs. We introduce the following definitions to analyze such cycles/paths. We define a simple vertex as a regular vertex of the multidegree 2 and a simple multi-edge as a multiedge connecting two simple vertices. Simple multi-edges form simple cycles/paths in the breakpoint graphs, i.e., cycles/paths in which multicolors of consecutive multi-edges alternate between Q and Q. Simple multi-edges/paths/cycles are called good if their multicolors are T-consistent. Multicolors Multiedges Simple vertices Simple multi-edges Simple paths R + MDQHC MR + DQHC D + MRQHC M + RDQHC MRD + QHC Q + MRDHC HC + MRDQ C + MRDQH H + MRDQC QC + MRDH MRQ + DHC MD + RQHC QH + MRDC RQ + MDHC DC + MRQH DQ + MRHC MRDQHC MRC + DQH MDQO O + MDQ M + DQO Q + MDO D + MQO MO + DQ MD + QO MQ + DO Table 1: Top table: The statistics of the breakpoint graph of the Mouse, Rat, Dog, macaque, Human, and Chimpanzee genomes. For every pair of complementary multicolors, we show the number of multi-edges of these multicolors, the number of simple vertices that are incident to such multi-edges, the number of simple multi-edges, and the number of simple paths and cycles. The T-consistent multicolors are shown in bold. Only 18 out of 32 possible multicolors are shown (the remaining 14 multicolors have zero corresponding multi-edges). Bottom table: The statistics of the breakpoint graph of the Mouse, Dog, macaque, and Opossum genomes after MGRA Stages 1 and 2 on confident branches (Fig. S18, bottom). Table 1(top) describes the statistics of the breakpoint graph and illustrates how rearrangement analysis contributes to construction of phylogenetic trees. Indeed, all three internal branches (correct tree partitions) are supported by large numbers of good paths/cycles and good multi-edges (86 and 305 for MR+DQHC, 37 and 111 for MRD+QHC, 30 and 87 for HC+MRDQ). Each of 32 incorrect partitions (only 8 of them are shown in the Table 1(top)) have at most one simple path/cycle and at most six 11

12 simple multi-edges, an order of magnitude smaller number than non-trivial correct partitions. This observation illustrates that reconstruction of the correct tree topology is a simple exercise in this case (see Chaisson et al. (2006)). This and other statistics produced by MGRA (see below) may be used to determine the phylogenetic tree rather than to assume that it is given. In contrast to Cannarozzi et al. (2007), MGRA provides a large number of certificates supporting the tree topology in Fig 5. Below we show how MGRA reconstructs the ancestral genomes. 3.2 MGRA Stage 1: Processing good cycles and paths Alternating cycles represent well-studied objects in the case of the pairwise breakpoint graphs. Every such cycle of length 2m is formed by (m 1) 2-breaks (Alekseyev and Pevzner, 2008) in each most parsimonious scenario. 6 Therefore, there is little difference between alternating cycles in the pairwise breakpoint graphs and good cycles in the multiple breakpoint graphs: indeed the good cycles with alternating multicolors Q and Q in the breakpoint graph model the rearrangements separating the sets of the genomes Q and Q exactly in the same way as in the pairwise genome comparison. We therefore argue that such alternating cycles (and the corresponding rearrangements) can be reliably assigned to the branch Q+Q in the phylogenetic tree T. This operation generalizes the notion of good rearrangements in MGR by extending them from cycles alternating multicolors P i and P i = {P 1,... P i 1, P i+1,..., P k } to cycles alternating any complementary T-consistent multicolors. While MGR attempts to find rearrangements bringing P i closer to all genomes from P i (i.e., rearrangements on the leaf branches of the phylogenetic tree), MGRA processes reliable rearrangements on all (both leaf and internal) branches of the phylogenetic tree (compare to Zhao and Bourque (2007)). Similarly, good paths can be also assigned to branches of the phylogenetic tree by transforming them into good cycles first. Consider a good path x 1, x 2,..., x m consisting of m 1 multi-edges with T-consistent multicolors alternating between a multicolor Q of the multi-edge (x 1, x 2 ) and its complement Q. We extend this path by vertices x 0 and x m+1 incident to its first and last vertices respectively, resulting in the path p = (x 0, x 1, x 2,..., x m+1 ). If the first and the last multi-edges in this path have the same T-consistent multicolor, we perform a 2-break over the multi-edges (x 0, x 1 ) and (x m, x m+1 ) to transform p into a good cycle c = (x 1, x 2,..., x m ) and a multi-edge (x 0, x m+1 ) (Fig. 6a,c). 7 If the first or/and last multi-edges of p are of non- T-consistent multicolor, we remove it/them to obtain a path flanked by a T-consistent multicolor that is processed (if it is longer than one edge) as above. Note that processing good cycles/paths in the breakpoint graph can create new good cycles/paths. We therefore process the good cycles/paths in an iterative fashion until no more good cycles/paths remain. 8 Fig. 7 (top panel) shows the breakpoint graph after processing (i.e., removing) good cycles/paths and illustrates that it is significantly simplified as compared to Fig. 4. The size of the giant component is reduced from 630 to 193 vertices and the overall number of vertices (not counting vertices incident to complete multi-edges) is reduced tenfold from 2712 to 253. Fig. 7 (top panel) illustrates how MGRA improves upon MGR: MGR is able to reduce the same graph only three-fold to 924 vertices (414 vertices in the giant component) before it runs out of good rearrangements. While MGRA Stage 1 greatly reduces the rearrangement distance between the analyzed genomes, Table S5 (center panel) illustrates that it still does not reveal the ancestral genomes MR, MRD, HC, and QHC. Moreover, it is not clear how to derive these ancestors based on a rather complex topology of the breakpoint graph 6 While this representation is not unique, all these representations are equivalent (i.e., they produce the same final result). Fig. 6b illustrates transformation of a simple cycle on six vertices into three complete multi-edges with two 2-breaks). 7 In the special case x 0 = x m+1 = and the flanking edges are of the same T-consistent multicolor, we perform a fusion 2-break as shown in Fig. 6d. In the case of m = 1 (i.e., when p contains a single simple multi-edge) c represents a complete multi-edge rather than a cycle (Fig. 6c) and does not require further processing. 8 One can prove that the topology of the resulting graph does not depend on the order in which good cycles/paths are processed. 12

13 x 0 x 7 x 0 x 7 a) b) x 1 x 6 x 1 x 6 x 1 x 6 x 1 x 6 x 2 x 5 x 2 x 5 x 2 x 5 x 2 x 5 x 3 x 4 x 3 x 4 x 3 x 4 x 3 x 4 c) d) x 3 x 3 x 1 x 2 x 1 x 2 x 1 x 2 x 1 x 2 x 1 y 1 x y x 1 y 1 x 2 y 2 x 1 y 1 x y x y x 2 y 2 x 1 y 1 x 2 y 2 x y x 2 y 2 Figure 6: Top panel: Processing good paths using a T-consistent red-blue multicolor. a) A good path on vertices x 1, x 2,..., x 6 is transformed into a cycle on the same vertices by extending it into x 0, x 1, x 2,..., x 6, x 7 and performing a 2-break on the multi-edges (x 0, x 1 ) and (x 6, x 7 ). b) Transformation of a good cycle on 6 vertices into complete multi-edges with a 2-break on the multi-edges (x 1, x 2 ), (x 3, x 4 ) followed by a 2-break on the multi-edges (x 1, x 4 ), (x 5, x 6 ). c) A 2-break on an irregular edge corresponds to a reversal involving chromosome ends. d) A 2-break on two irregular edges corresponds to a fusion. Bottom panel: Two ways of transforming a fair edge (x, y) into a good edge: by a 2-break on yellow edges (top) or by a 2-break on green edges (bottom). In either case, the follow-up processing of the generated simple path results in the same graph with the complete multi-edge (x, y). in Fig. 7 (top panel). MGRA Stage 2 introduces the notion of fair cycles/paths that allows one to reveal the rearrangements that violate the semi-independence assumption and to further simplify the graph in Fig. 7 (top panel). The results of MGRA Stage 1 already reveal valuable insights about the ancestral genomes (even without MGRA Stage 2). To simplify the analysis of the Boreoeutherian ancestral reconstruction 9 by 9 We use the MRD node of the phylogenetic tree in Fig.5 to approximate the Boreoeutherian ancestor. While this paper focuses on the Boreoeutherian ancestor, MGRA reconstructs ancestral genomes for every node of the phylogenetic tree. 13

14 668t 932t 667h 931h 1047h 77h 78t 1181t 1178t 1048t 586h 1180h 759t 1177h 1186h 1184h 1219h 758t 587t 888h 1187t 1187h 1173h 147h 1185t 730t 729h 889t 148t 769h 75t 1188t 344t 752t 963h 1174t 1186t 142t 794h 964t 770t 343h 74h 771t 795t 1257t 141h 1255h 1179h 1256h 1185h 795h 209h 770h 1198h 1256t 796t 1180t 1196h 1247h 210t 1248t 990t 746h 1199t 872h 1197t 1202t 1246h 142h 989h 749t 103h 748h 1201h 1239h 1247t 104t 1199h 1204t 1241h 747t 1289h 185t 1239t 143t 1142h 1200t 1203h 917t 1291t 1290t 794h 970t 250t 971t 1141h 1153h 1154t 1158h 1290h 184h 796t 795t 1239t 1224h 78h 132t 1142t 1254h 795h 81t 1159t 1161h 374t 79h 135t 134h 71h 131h 126h 80h 376t 125t 1246t 78h 1162t 798h 375h 72t 776t 80t 772t 84t 124h 77t 120h 127t 1245h 76t 127h 793t 774t 773h 775t 927t 83h 81t 792h 926h 610t 535h 79h 771h 76h 774h 80h 531h 887t 75h 611t 609h 886h 887h 532t 610h 1196h 443t 445t 888t 1264h 1299h 536t 1199t 1202t 80t 84t 1301t 1300t 517t 1299h 1289h 442h 444h 667t 1198h 83h 1300h 1301t 1300t 1291t 1290t 970t 374t 516h 376t 671h 443h 672t 1300h 1290h 444t 666h 935h 941t 971t 418h 375h 447h 940h 970h 1035h 691h 670h 671t 447t 941h 655h 419t 448t 446h 656t 1265t 1288h 1189h 692t 1217h 1274h 1289t 656h 1190t 1266h 1267t 1216h 1218t 717h 1314t 1267h 657t 1192t 1269t 1215t 1275t 1217t 1214h 1313h 684h 1191h 1286t 926h 1285h 685t 673t 1062t 927t 672h 679t 483h 1061h 484t 678h 610t 610h 611t 555t 609h 1231t 216t Figure 7: The breakpoint graph G(M, R, D, Q, H, C) (the complete multi-edges are not shown) after MGRA Stage 1 (top panel) and after MGRA Stages 1-2 (bottom panel). The edge colors represent Mouse (red), Rat (blue), Dog (green), macaque (violet), Human (orange), and Chimpanzee (yellow) genomes. Vertices are labeled and colored similarly to Fig. 4. We emphasize that while reconstruction starts with selection of the root branch (as in Fig. 5), the choice of this branch and the exact location of the root X on this branch are rather arbitrary and not correlated with a specific ancestral genome of interest (in contrast to the alternative root-driven approach described in Supplement C). As described in Section 3.4, the ancestral genomes are defined by the reverse transformation from the (whatever) root genome X to the leaf genomes. 14

15 MGRA Stage 1, we restrict the set of genomes to single representatives of rodents (mouse), carnivores (dog), and primates (macaque). The resulting breakpoint graph (with obverse edges shown) reveals many long unicolored paths formed by alternating obverse edges and complete multi-edges (Fig. S10). Such paths represent parts of different human chromosomes in the reconstructed ancestor genome. We compress every such path into a single rectangular vertex as shown in Fig. S11 (top panel), resulting in a rather small graph. We further show the chromosomal associations present in this graph in Fig. S12. We emphasize that MGRA Stage 1 reveals some subtle but reliable adjacencies that other ancestral reconstrution algorithms may miss. In particular, it reveals two adjacencies that are absent in any of the extant genomes and many adjacencies that are present in only one of the extant genomes. The compressed breakpoint graph reveals only 5 complete multi-edges connecting vertices of different colors: , , , 4 + 8, and These are exactly the same 5 adjacencies 12a + 22a, 12b + 22b, , 4a + 8p, revealed in Ma et al. (2006). It also reveals the CARs corresponding to the human chromosomes 2, 2, 5, 6, 7, 8, 9, 10, 10, 11, 17, 18, and X (represented as isolated boxes in Fig. S12), exactly the same as the ancestral chromosomes revealed by previous cytogenetics analysis (Froenicke et al., 2006) (2q, 2pq, 5, 6, 7a, 8q, 9, 10q, 11, 17, 18, X) with a single exception: the second segment from chromosome 10 is identified as an isolated chromosome by us and is tentatively assigned as 10p + 12a + 22a by Froenicke et al. (2006). However, Froenicke et al. (2006) acknowledged that the association of 10p and 12a is only weakly supported (indicated by a question mark in Froenicke et al. (2006)). 10 Our analysis also rules out the associations , , , , and suggested in Murphy et al. (2005) as weak associations and later criticized by Froenicke et al. (2006) as unreliable. Supplement D further focuses on the connected component of the breakpoint graph representing the human chromosomes 7, 16, and 19 where the cytogenetics approach disagrees with Ma et al. (2006). 3.3 MGRA Stage 2: Processing fair cycles and paths M+ R+ MR+ D+ MRD+ Q+ HC+ H+ C+ DQ+ QH+ RD+ M R MR D MRD Q HC H+ 1 C DQ+ 1 QH+ 2 RD+ 2 2 Table 2: The statistics of composite multi-edges (non-zero counts only) in the breakpoint graph G(M, R, D, Q, H, C) after MGRA Stage 1. Each pair of complementary multicolors is denoted by one of its representative multicolors (e.g., M+ stands for the complementary multicolors M + RDQHC). The bold row/column labels correspond to T-consistent (pairs of) multicolors. The grayed cell entries correspond to pairs of adjacent branches in the phylogenetic tree T and account for 87% of all composite multi-edges. Figure 7 (top panel) reveals many pairs of vertices of multidegree three connected by a multiedge. Each such multi-edge (x, y) corresponds to six vertices x, x 1, x 2, y, y 1, y 2 and five multi-edges Ideally, different choices of the root branch and locations of the root X itself will result in the same set of ancestral genomes. 10 We are not claiming that this association does not exist since it may be present in some of 100+ genomes with available cytogenetic data. However, there is no support for this association in the six mammalian genomes. We remark that Ma et al. (2006) also did not find support for this association. 15

16 (x, y), (x, x 1 ), (x, x 2 ), (y, y 1 ), (y, y 2 ) (including cases with x i =, y i =, or x i = y j for some 1 i, j 2). A multi-edge (x, y) is called composite if edges (x, x 1 ) and (y, y 1 ) have the same multicolor Q 1 and edges (x, x 2 ) and (y, y 2 ) have the same multicolor Q 2. A composite multi-edge is called fair if Q 1 and Q 2 represent T-consistent multicolors (Fig. 6, bottom panel). Table 2 shows the statistics of composite multi-edges (depending on pairs of complementary multicolors Q 1 + Q 1 and Q 2 + Q 2 ) and reveals that (i) most composite multi-edges are fair and (ii) while some types of composite multiedges are common (e.g., (M+, R+), (M+, MR+), (R+, MR+), (MR+, D+), (D+, QHC+), (MR+, QHC+)), others (e.g., (Q+, R+)) are either rare or absent. Table 2 illustrates the extremely biased statistics of composite multi-edges: the branches Q 1 + Q 1 and Q 2 + Q 2 corresponding to the multicolors Q 1 and Q 2 of a composite multi-edge are likely adjacent in the phylogenetic tree (compare to the weakly-independent rearrangements). Table 2 provides yet another illustration of utility of MGRA for deriving phylogenetic trees. Indeed, it reveals valuable information about the topology of the phylogenetic tree (incident edges) that can be combined with information (valid partitions) in Table 1 to infer the trees. Every fair multi-edge (x, y) can be transformed into a good multi-edge by a 2-break (fair 2-break) either on multi-edges (x, x 1 ) and (y, y 1 ) (of multicolor Q 1 ) or on multi-edges (x, x 2 ) and (y, y 2 ) (of multicolor Q 2 ) (Fig. 6, bottom panel). In the former case, (x, y) is transformed into a good multi-edge of color Q 2, while in the latter case it is transformed into a good multi-edge of color Q 1. The resulting good paths (formed by fair 2-breaks) can be further processed as described in MGRA Stage 1. An important observation is that the final result of processing a fair multi-edge does not depend on whether we start with a 2-break on Q 1 or Q 2 multicolor (see Fig. 6, bottom panel). A cycle/path in the breakpoint graph is called fair if (i) all its edges are either good or fair and (ii) it can be transformed into a good cycle/path by some fair 2-breaks. MGRA Stage 2 detects fair paths/cycles, transforms them into good paths/cycles by fair 2-breaks, and further processes the resulting good paths/cycles as in MGRA Stage 1. In some cases fair paths in Stage 2 should be chosen with caution since the choice of fair paths may influence ancestral reconstructions in some nodes (see Supplement E). Figure 7 (bottom panel) shows the breakpoint graph after processing fair cycles/paths and illustrates that it becomes so small that it now can be analyzed in a step-by-case fashion by brute-force analysis of every connected component. 3.4 Reconstructing Ancestral Genomes After removing vertex, the breakpoint graph (after MGRA Stage 2) consists of only 9 connected components (Fig. 7, bottom panel). Five out of 9 components contain vertices corresponding to both start and end of the same synteny blocks 80, 610, 795, 1290, and This is surprising since generally the start and end of a synteny block are not expected to be present in the same (small) connected component unless this block was subject to a micro-inversion (Chaisson et al., 2006). Indeed, blocks 80, 610, 795, 1290, and 1300 turned out to be short (all under 500Kb) with blocks 610, 1290, and 1300 even shorter than 100Kb (block 1300 is 91Kb in human, 41Kb in mouse, only 10Kb in dog genome) that is near the threshold of 50Kb used in Ma et al. (2006) for generating reliable synteny blocks. The simplest way to deal with such short blocks is to simply remove them from the set of input synteny blocks (Supplement J). Such removal will not significantly affect the architecture of the ancestral genomes (indeed, these blocks are well below the resolution of the cytogenetic approaches) while at the same time resolving 5 out of 9 remaining components in the graph. Supplement F describes a different approach that attempts to find the positions and orientations of such short synteny block in the ancestors by processing complex breakpoints (MGRA Stage 3). We remark that processing at MGRA Stage 3 is viewed as less reliable and the resulting associations are not considered in the proposed ancestral reconstructions (see below). Recall that a strict T-consistent rearrangement scenario uniquely defines ancestral genomes at all internal nodes of the phylogenetic tree T. However, because of use of 2-breaks instead of rever- 16

17 sals/translocations/fissions/fusions, the ancestral genomes initially obtained by MGRA may contain (a small number of) circular chromosomes. Whenever possible, MGRA linearizes them by rearranging 2-breaks in the transformation. While circular chromosomes may occasionally appear in the initial rearrangement scenario obtained by MGRA, their appearance is a result of either 2-breaks applied in wrong order (that can be avoided by reordering the 2-breaks (see Pevzner (2000))), or a shortcut in processing hurdles that can be remedied by introducing additional 2-breaks ((Hannenhalli and Pevzner, 1999)). MGRA eliminates possible circular chromosomes in the reconstructed genomes at the post-processing stage. We emphasize that the outcome of MGRA is the set of ancestral (linear) genomes while the 2-break rearrangement scenario produced by MGRA is considered only as a starting point for constructing the reversals/translocations/fusions/fissions scenario. An optimal linear rearrangement scenario can be found by applying GRIMM to the ancestral genomes reconstructed by MGRA. Fig. S14 illustrates the results of ancestral genome reconstruction for the chromosome X for six mammalian genomes. Supplement H shows the pairwise rearrangement distances between the ancestral and leaf genomes, following the strict T-consistent transformation constructed by MGRA, and compares them to the genomic distances computed by GRIMM (Tesler, 2002b). The differences between these distances are rather small, suggesting that the T-consistent transformation found by MGRA is close to the most parsimonious. 4 Benchmarking MGRA Benchmarking of the ancestral genome reconstruction algorithms may be challenging since the architecture of ancestral genomes is not known. While MGR, GRAPPA, infercars, and MGRA showed excellent performance on simulated datasets, these benchmarks were mainly designed for rearrangements generated according to the Random Breakage Model (RBM). Since MGRA improves on MGR and is guaranteed to produce optimal solutions for semi-independent scenarios, it is bound to provide even better results than MGR on such benchmarks. Supplement L compares MGRA and infercars on simulated data and illustrates that MGRA generates more accurate ancestral reconstructions for all choices of parameters. However, analyzing all these tools on simulated data may generate overoptimistic results since RBM does not reflect the realities of mammalian evolution (Murphy et al., 2005; van der Wind et al., 2004; Bailey et al., 2004; Zhao et al., 2004; Webber and Ponting, 2005; Hinsch and Hannenhalli, 2006; Ruiz-Herrera et al., 2006; Yue and Haaf, 2006; Kikuta et al., 2007; Mehan et al., 2007; Caceres et al., 2007; Gordon et al., 2007). We therefore decided to analyze the differences between MGRA and infercars reconstructions and to further track evidence for each such difference in the case-by-case fashion. MGRA and infercars produce highly consistent ancestral reconstructions. For illustration purposes, we have chosen to focus on the reconstruction of the MRD ancestral genome (Fig. 5), remarking that the results for the other ancestor genomes are similar. As an input to infercars we provided six mammalian genomes and the same phylogenetic tree as used in Ma et al. (2006). The MRD genomes reconstructed by MGRA and infercars consist of 25 and 30 chromosomes (CARs) respectively. 11 However, MGRA does not consider associations obtained at the Stage 3 (Fig. 7, bottom panel) as reliable. Most of these associations correspond to micro-inversions and thus do not significantly affect the ancestral reconstructions. 11 infercars reconstructions slightly differ from those reported in Ma et al. (2006) since we use the synteny blocks from the latest builds of mammalian genomes (provided by Jian Ma). Similarly to Ma et al. (2006); Kemkemer et al. (2006), we ignore very short CARs blocks in both infercars and MGRA reconstructions to simplify the analysis (see Table S14). 17

18 4.1 Comparison of two infercars reconstructions and using MGRA to improve infercars ancestral reconstructions We start by comparing infercars with itself on two inputs: the original 6 mammalian genomes M, R, D, Q, H, C and the genomes M, R, D, Q, H, C produced by MGRA Stage 1 (Fig. 7, top panel). We denote the reconstructed MRD genomes as MRD CARs and MRD CARs respectively. Since MGRA Stage 1 processes only good cycles/paths that are unambiguously present in every optimal rearrangement scenario, one can safely assume that any optimal ancestral reconstruction should include the rearrangements performed at Stage 1. Therefore, running infercars on M, R, D, Q, H, C genomes should ideally produce the same results as running infercars on the equivalent M, R, D, Q, H, C genomes. However, since infercars makes some greedy decisions and does not claim optimality, it does not guarantee to produce the same results on M, R, D, Q, H, C as compared to M, R, D, Q, H, C. Any such inconsistency would point to either somewhat less reliable CARs reconstructed by infercars or to reliable adjacencies missed by infercars. Therefore, infercars reconstructions can be potentially improved if MGRA Stage 1 runs before infercars as a pre-processing step. Comparison of the reconstructed genomes MRD CARs and MRD CARs indicates that while they share the overwhelming majority (99.0%) of reconstructed adjacencies, there are 13 adjacencies present MRD CARs but absent in MRD CARs and 13 adjacencies absent in MRD CARs but present in MRD CARs (out of the 1325 reconstructed adjacencies). Fig. 8(top) displays the breakpoint graph between the corresponding MRD CARs and MRD CARs reconstructions and reveals that the MRD CARs reconstruction is arguably more reliable than the MRD CARs reconstruction. Indeed, Fig. 8(top) reveals that while most of adjacencies (12 out of 13) present in MRD CARs but not in MRD correspond to ambiguous CARs joins (in terms of Ma et al. (2006)), MRD contains 4 reliable adjacencies (i.e., resolved by MGRA CARs Stage 1) that are nevertheless absent in MRD CARs. To resolve the conflicts between infercars results on equivalent inputs we analyze each of these adjacencies ((658h, 652h), (871t, 873t), (770t, 771t), and (1014t, 1017h)) in a case by case fashion. For example, in the case of the (658h, 652h) adjacency, infercars failed to connect them, since the vertices 658h and 652h represent breakpoints of multidegree 3 (Fig. 8, bottom panels) and it is not immediately clear how to process them using local rules employed by infercars. infercars turns 658h into a CAR end in MRD although it is not a chromosome end in any of the six genomes. The breakpoint graph provides a clear view of connection between 658h and 652h by revealing good paths connecting them (see Fig. 8, bottom panels). 4.2 Comparison of infercars and MGRA reconstructions Fig. S19 displays the breakpoint graph of the three MRD reconstructions: MRD MGRA, MRD CARs, and MRD CARs, and illustrates that the number of differences between MRD CARs and MRD CARs (we consider the latter reconstruction to be more reliable) is comparable to the number of differences between MRD MGRA and MRD CARs. Indeed, MRD CARs differs from MRD CARs by 30 adjacencies and differs from MRD MGRA by 39 adjacencies. Since the large-scale architecture of MRD CARs was shown to be largely consistent with previous cytogenetic reconstructions (Ma et al., 2006) and since MRD CARs (that is arguably even more reliable than MRD CARs) and MRD MGRA share at least 98.5% of all adjacencies, all these reconstructions can be viewed as largely consistent with the cytogeneticsbased reconstructions. Remarkably, most differences between MGRA and infercars reconstructions are represented by ambiguous joins that MGRA labels as less reliable anyway (shown as dashed edges). In particular, infercars reports eight less reliable adjacencies as unambiguous (complete multi-edges with dashed purple edges in Fig. S19). However, most of them correspond to microinversions and have minor effects on the large-scale ancestral architectures (see Supplement I for detailed comparison of MGRA and infercars reconstructions). Table 3 shows the genomic distances from MRD MGRA and MRD CARs to each of the six leaf genomes and illustrates that MGRA results in a 18

19 1061h (16) 1192t (19) 1254h (22) 1035h (15) 1191h (19) 1017h (15) 1246t (22) 1003h (15) 1062t (16) 940h (13) 1014t (15) 941t (13) 935h (13) 76t (1) 1190t (19) 555t (7) 872h (12) 769h (10) 770t (10) 873t (12) 75h (1) 1189h (19) 771t (10) 871t (12) 770h (10) 78t (1) 586h (7) 587t (7) 77h (1) 74h (1) 1200t (19) 72t (1) 667t (8) 1199h (19) 75t (1) 1197t (19) 1245h (22) 666h (8) 658h (8) 652h (8) 645h 645h 645h 645h 646t 646t 646t 646t 361t 361t 361t 361t 360h 360h 360h 360h 658h 658h 658h 939h 659t 939h 659t 939h 659t 658h 939h 659t 940t 653t 940t 653t 940t 653t 652h 652h 652h 940t 653t 652h Figure 8: Top panel: The breakpoint graph of the genomes MRD CARs (cyan) and MRD CARs (orange) reconstructed by infercars (common adjacencies are not shown). Bold edges represent reliable adjacencies (resolved by MGRA Stage 1), while dashed edges represent ambiguous joins (see Ma et al. (2006)) made by infercars. Vertex colors are coded as in the Fig. 4. Bottom panel: A most parsimonious transformation of one connected component (containing vertices 658h and 652h) of the breakpoint graph G(M, R, D, Q, H, C) from Fig. 4. Initial component (first panel) is transformed with a 2-break in primates (second panel), a 2-break in rodents (third panel), and two 2-breaks in dog resulting from processing of a good D + MRQCH path (forth panel). 19

20 X MROD+QHC MROD QHC+MROD QHC MRO OD MR+ODQHC HC+MRODQ MR HC M+RODQHC R+MODQHC O+MRDQHC D+MROQHC Q+MRODHC H+MRODQC C+MRODQH M R O D Q H C Figure 9: Left panel: The primate rodent carnivore controversy: an alternative between the primate rodent (green tree) and the primate carnivore clades (red tree). Right panel: The phylogenetic tree T of seven mammalian genomes: Mouse (red), Rat (blue), Dog (green), macaque (violet), Human (orange), Chimpanzee (yellow), and Opossum (brown). Since the Opossum branch is subject to a controversy, the dashed branches represent possible variations while the solid branches are confident and do not depend on the Opossum branch. slightly more parsimonious scenario as compared to infercars (the total distance is 1503 for MGRA and 1518 for infercars). M R D Q H C Total MRD CARs MRD MGRA Table 3: The genomic distances between the MRD reconstructions MRD CARs and MRD MGRA and the genomes M, R, D, Q, H, C. 4.3 The Primate Rodent Carnivore Split in Mammalian Evolution Knowledge of the correct phylogeny is an important prerequisite for many comparative genomics approaches (Blanchette and Tompa, 2002; Kellis et al., 2003). However, even the basic features of the mammalian phylogeny (e.g., the primate rodent carnivore split) remain controversial (Fig. 9). While the morphology studies support the primate rodent clade (Shoshani and McKenna, 1998), the early molecular studies supported the primate carnivore clade (Graur, 1993; Kumar and Hedges, 1998; Reyes et al., 2000; Janke et al., 1994). Although starting from Murphy et al. (2001) the phylogeny based on the primate rodent clade (Madsen et al., 2001; Poux et al., 2002; Amrine-Madsen et al., 2003; Thomas et al., 2003; Reyes et al., 2004) has become widely accepted, the question is far from being settled: recent studies provided arguments against the primate rodent clade (Jorgensen et al., 2005; Arnason et al., 2002; Misawa and Janke, 2003; Cannarozzi et al., 2007; Huttley et al., 2007; Niimura and Nei, 2007; Huerta-Cepas et al., 2007). Below we analyze some rearrangement-based characters supporting both the primate-carnivore and (to a smaller extent) the primate-rodent clade. Similarly to other approaches, the rearrangement analysis reveals some pros and cons for each alternative but does not definitely resolve the long-stranding controversy. Chaisson et al. (2006) made an attempt to analyze mammalian phylogeny using micro-rearrangements in the CFTR region representing 0.06% of mammalian genomes. However, the small size of this region and ambiguities in revealing micro-rearrangements between distant mammals, made it difficult to find micro-rearrangements that can certify the deep branches of the mammalian phylogenetic tree. Cannarozzi et al. (2007) made an attempt to analyze large-scale rearrangements (as opposed to micro- 20

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM MENG ZHANG College of Computer Science and Technology, Jilin University, China Email: zhangmeng@jlueducn WILLIAM ARNDT AND JIJUN TANG Dept of Computer Science

More information

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering Genome Rearrangements In Man and Mouse Abhinav Tiwari Department of Bioengineering Genome Rearrangement Scrambling of the order of the genome during evolution Operations on chromosomes Reversal Translocation

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Alexandru Tomescu, Leena Salmela, Veli Mäkinen, Esa Pitkänen 582670 Algorithms for Bioinformatics Lecture 5: Combinatorial Algorithms and Genomic Rearrangements 1.10.2015 Background

More information

Analysis of Gene Order Evolution beyond Single-Copy Genes

Analysis of Gene Order Evolution beyond Single-Copy Genes Analysis of Gene Order Evolution beyond Single-Copy Genes Nadia El-Mabrouk Département d Informatique et de Recherche Opérationnelle Université de Montréal mabrouk@iro.umontreal.ca David Sankoff Department

More information

The combinatorics and algorithmics of genomic rearrangements have been the subject of much

The combinatorics and algorithmics of genomic rearrangements have been the subject of much JOURNAL OF COMPUTATIONAL BIOLOGY Volume 22, Number 5, 2015 # Mary Ann Liebert, Inc. Pp. 425 435 DOI: 10.1089/cmb.2014.0096 An Exact Algorithm to Compute the Double-Cutand-Join Distance for Genomes with

More information

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes Cedric Chauve 1, Eric Tannier 2,3,4,5 * 1 Department of Mathematics,

More information

On Reversal and Transposition Medians

On Reversal and Transposition Medians On Reversal and Transposition Medians Martin Bader International Science Index, Computer and Information Engineering waset.org/publication/7246 Abstract During the last years, the genomes of more and more

More information

Isolating - A New Resampling Method for Gene Order Data

Isolating - A New Resampling Method for Gene Order Data Isolating - A New Resampling Method for Gene Order Data Jian Shi, William Arndt, Fei Hu and Jijun Tang Abstract The purpose of using resampling methods on phylogenetic data is to estimate the confidence

More information

An Improved Algorithm for Ancestral Gene Order Reconstruction

An Improved Algorithm for Ancestral Gene Order Reconstruction V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 46 53 http://ceur-ws.org/vol-1214, Series ISSN 1613-0073, c 2014 A. Herencsár B. Brejová

More information

Multiple Whole Genome Alignment

Multiple Whole Genome Alignment Multiple Whole Genome Alignment BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 206 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by

More information

Multi-Break Rearrangements and Breakpoint Re- Uses: From Circular to Linear Genomes

Multi-Break Rearrangements and Breakpoint Re- Uses: From Circular to Linear Genomes University of South Carolina Scholar Commons Faculty Publications Computer Science and Engineering, Department of 11-8-008 Multi-Break Rearrangements and Breakpoint Re- Uses: From Circular to Linear Genomes

More information

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre PhD defense Chromosomal rearrangements in mammalian genomes : characterising the breakpoints Claire Lemaitre Laboratoire de Biométrie et Biologie Évolutive Université Claude Bernard Lyon 1 6 novembre 2008

More information

Reconstructing contiguous regions of an ancestral genome

Reconstructing contiguous regions of an ancestral genome Reconstructing contiguous regions of an ancestral genome Jian Ma, Louxin Zhang, Bernard B. Suh, Brian J. Raney, Richard C. Burhans, W. James Kent, Mathieu Blanchette, David Haussler and Webb Miller Genome

More information

On the complexity of unsigned translocation distance

On the complexity of unsigned translocation distance Theoretical Computer Science 352 (2006) 322 328 Note On the complexity of unsigned translocation distance Daming Zhu a, Lusheng Wang b, a School of Computer Science and Technology, Shandong University,

More information

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven) BMI/CS 776 Lecture #20 Alignment of whole genomes Colin Dewey (with slides adapted from those by Mark Craven) 2007.03.29 1 Multiple whole genome alignment Input set of whole genome sequences genomes diverged

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

A Practical Algorithm for Ancestral Rearrangement Reconstruction

A Practical Algorithm for Ancestral Rearrangement Reconstruction A Practical Algorithm for Ancestral Rearrangement Reconstruction Jakub Kováč, Broňa Brejová, and Tomáš Vinař 2 Department of Computer Science, Faculty of Mathematics, Physics, and Informatics, Comenius

More information

Computational Genetics Winter 2013 Lecture 10. Eleazar Eskin University of California, Los Angeles

Computational Genetics Winter 2013 Lecture 10. Eleazar Eskin University of California, Los Angeles Computational Genetics Winter 2013 Lecture 10 Eleazar Eskin University of California, Los ngeles Pair End Sequencing Lecture 10. February 20th, 2013 (Slides from Ben Raphael) Chromosome Painting: Normal

More information

GASTS: Parsimony Scoring under Rearrangements

GASTS: Parsimony Scoring under Rearrangements GASTS: Parsimony Scoring under Rearrangements Andrew Wei Xu and Bernard M.E. Moret Laboratory for Computational Biology and Bioinformatics, EPFL, EPFL-IC-LCBB INJ230, Station 14, CH-1015 Lausanne, Switzerland

More information

Perfect Sorting by Reversals and Deletions/Insertions

Perfect Sorting by Reversals and Deletions/Insertions The Ninth International Symposium on Operations Research and Its Applications (ISORA 10) Chengdu-Jiuzhaigou, China, August 19 23, 2010 Copyright 2010 ORSC & APORC, pp. 512 518 Perfect Sorting by Reversals

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Multiple genome rearrangement

Multiple genome rearrangement Multiple genome rearrangement David Sankoff Mathieu Blanchette 1 Introduction Multiple alignment of macromolecular sequences, an important topic of algorithmic research for at least 25 years [13, 10],

More information

Genomes Comparision via de Bruijn graphs

Genomes Comparision via de Bruijn graphs Genomes Comparision via de Bruijn graphs Student: Ilya Minkin Advisor: Son Pham St. Petersburg Academic University June 4, 2012 1 / 19 Synteny Blocks: Algorithmic challenge Suppose that we are given two

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Supplementary Materials for

Supplementary Materials for advances.sciencemag.org/cgi/content/full/1/8/e1500527/dc1 Supplementary Materials for A phylogenomic data-driven exploration of viral origins and evolution The PDF file includes: Arshan Nasir and Gustavo

More information

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review Ron Zeira and Ron Shamir August 9, 2018 Dedicated to Bernard Moret upon his retirement. Abstract Genome rearrangement problems

More information

Reconstructing genome mixtures from partial adjacencies

Reconstructing genome mixtures from partial adjacencies PROCEEDINGS Open Access Reconstructing genome mixtures from partial adjacencies Ahmad Mahmoody *, Crystal L Kahn, Benjamin J Raphael * From Tenth Annual Research in Computational Molecular Biology (RECOMB)

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Scaffold Filling Under the Breakpoint Distance

Scaffold Filling Under the Breakpoint Distance Scaffold Filling Under the Breakpoint Distance Haitao Jiang 1,2, Chunfang Zheng 3, David Sankoff 4, and Binhai Zhu 1 1 Department of Computer Science, Montana State University, Bozeman, MT 59717-3880,

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review Ron Zeira and Ron Shamir June 27, 2018 Dedicated to Bernard Moret upon his retirement. Abstract Problems of genome rearrangement

More information

Regions with Duplications ABSTRACT

Regions with Duplications ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 15, Number 8, 2008 Mary Ann Liebert, Inc. Pp. 1 21 DOI: 10.1089/cmb.2008.0069 DUPCAR: Reconstructing Contiguous Ancestral Regions with Duplications JIAN MA, 1 AAKROSH

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

Dominating Set Counting in Graph Classes

Dominating Set Counting in Graph Classes Dominating Set Counting in Graph Classes Shuji Kijima 1, Yoshio Okamoto 2, and Takeaki Uno 3 1 Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan kijima@inf.kyushu-u.ac.jp

More information

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES MAGNUS BORDEWICH 1, CATHERINE MCCARTIN 2, AND CHARLES SEMPLE 3 Abstract. In this paper, we give a (polynomial-time) 3-approximation

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

The Mixed Chinese Postman Problem Parameterized by Pathwidth and Treedepth

The Mixed Chinese Postman Problem Parameterized by Pathwidth and Treedepth The Mixed Chinese Postman Problem Parameterized by Pathwidth and Treedepth Gregory Gutin, Mark Jones, and Magnus Wahlström Royal Holloway, University of London Egham, Surrey TW20 0EX, UK Abstract In the

More information

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Supplementary Note S2 Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Phylogenetic trees reconstructed by a variety of methods from either single-copy orthologous loci (Class

More information

Comparative Genomics II

Comparative Genomics II Comparative Genomics II Advances in Bioinformatics and Genomics GEN 240B Jason Stajich May 19 Comparative Genomics II Slide 1/31 Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Eric Tannier* - Chunfang Zheng - David Sankoff - *Corresponding author

Eric Tannier* - Chunfang Zheng - David Sankoff - *Corresponding author BMC Bioinformatics BioMed Central Methodology article Multichromosomal median and halving problems under different genomic distances Eric Tannier* 1,, Chunfang Zheng 3 and David Sankoff 3 Open Access Address:

More information

Some Algorithmic Challenges in Genome-Wide Ortholog Assignment

Some Algorithmic Challenges in Genome-Wide Ortholog Assignment Jiang T. Some algorithmic challenges in genome-wide ortholog assignment. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(1): 1 Jan. 2010 Some Algorithmic Challenges in Genome-Wide Ortholog Assignment Tao

More information

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data Mary E. Cosner Dept. of Plant Biology Ohio State University Li-San Wang Dept.

More information

The Fragile Breakage versus Random Breakage Models of Chromosome Evolution

The Fragile Breakage versus Random Breakage Models of Chromosome Evolution The Fragile Breakage versus Random Breakage Models of Chromosome Evolution Qian Peng 1, Pavel A. Pevzner 1, Glenn Tesler 2* 1 Department of Computer Science and Engineering, University of California San

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2015 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

Efficient Reassembling of Graphs, Part 1: The Linear Case

Efficient Reassembling of Graphs, Part 1: The Linear Case Efficient Reassembling of Graphs, Part 1: The Linear Case Assaf Kfoury Boston University Saber Mirzaei Boston University Abstract The reassembling of a simple connected graph G = (V, E) is an abstraction

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109 CONTENTS ix Preface xv Acknowledgments xxi Editors and contributors xxiv A computational micro primer xxvi P A R T I Genomes 1 1 Identifying the genetic basis of disease 3 Vineet Bafna 2 Pattern identification

More information

Graphs, permutations and sets in genome rearrangement

Graphs, permutations and sets in genome rearrangement ntroduction Graphs, permutations and sets in genome rearrangement 1 alabarre@ulb.ac.be Universite Libre de Bruxelles February 6, 2006 Computers in Scientic Discovery 1 Funded by the \Fonds pour la Formation

More information

An Efficient Probabilistic Population-Based Descent for the Median Genome Problem

An Efficient Probabilistic Population-Based Descent for the Median Genome Problem An Efficient Probabilistic Population-Based Descent for the Median Genome Problem Adrien Goëffon INRIA Bordeaux Sud-Ouest 351 cours de la Libération 33400 Talence, France goeffon@labri.fr Macha Nikolski

More information

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016) FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016) The final exam will be on Thursday, May 12, from 8:00 10:00 am, at our regular class location (CSI 2117). It will be closed-book and closed-notes, except

More information

A Heuristic Algorithm for Reconstructing Ancestral Gene Orders with Duplications

A Heuristic Algorithm for Reconstructing Ancestral Gene Orders with Duplications A Heuristic Algorithm for Reconstructing Ancestral Gene Orders with Duplications Jian Ma 1,AakroshRatan 2, Louxin Zhang 3, Webb Miller 2,andDavidHaussler 1 1 Center for Biomolecular Science and Engineering,

More information

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method Phylogeny 1 Plan: Phylogeny is an important subject. We have 2.5 hours. So I will teach all the concepts via one example of a chain letter evolution. The concepts we will discuss include: Evolutionary

More information

Martin Bader June 25, On Reversal and Transposition Medians

Martin Bader June 25, On Reversal and Transposition Medians Martin Bader June 25, 2009 On Reversal and Transposition Medians Page 2 On Reversal and Transposition Medians Martin Bader June 25, 2009 Genome Rearrangements During evolution, the gene order in a chromosome

More information

Genes order and phylogenetic reconstruction: application to γ-proteobacteria

Genes order and phylogenetic reconstruction: application to γ-proteobacteria Genes order and phylogenetic reconstruction: application to γ-proteobacteria Guillaume Blin 1, Cedric Chauve 2 and Guillaume Fertin 1 1 LINA FRE CNRS 2729, Université de Nantes 2 rue de la Houssinière,

More information

Evolution of Tandemly Arrayed Genes in Multiple Species

Evolution of Tandemly Arrayed Genes in Multiple Species Evolution of Tandemly Arrayed Genes in Multiple Species Mathieu Lajoie 1, Denis Bertrand 1, and Nadia El-Mabrouk 1 DIRO - Université de Montréal - H3C 3J7 - Canada {bertrden,lajoimat,mabrouk}@iro.umontreal.ca

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems Rani M. R, Mohith Jagalmohanan, R. Subashini Binary matrices having simultaneous consecutive

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

The Strong Largeur d Arborescence

The Strong Largeur d Arborescence The Strong Largeur d Arborescence Rik Steenkamp (5887321) November 12, 2013 Master Thesis Supervisor: prof.dr. Monique Laurent Local Supervisor: prof.dr. Alexander Schrijver KdV Institute for Mathematics

More information

Generating p-extremal graphs

Generating p-extremal graphs Generating p-extremal graphs Derrick Stolee Department of Mathematics Department of Computer Science University of Nebraska Lincoln s-dstolee1@math.unl.edu August 2, 2011 Abstract Let f(n, p be the maximum

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 1, 2001 Mary Ann Liebert, Inc. Pp. 69 78 Perfect Phylogenetic Networks with Recombination LUSHENG WANG, 1 KAIZHONG ZHANG, 2 and LOUXIN ZHANG 3 ABSTRACT

More information

1.1 The (rooted, binary-character) Perfect-Phylogeny Problem

1.1 The (rooted, binary-character) Perfect-Phylogeny Problem Contents 1 Trees First 3 1.1 Rooted Perfect-Phylogeny...................... 3 1.1.1 Alternative Definitions.................... 5 1.1.2 The Perfect-Phylogeny Problem and Solution....... 7 1.2 Alternate,

More information

Search and Lookahead. Bernhard Nebel, Julien Hué, and Stefan Wölfl. June 4/6, 2012

Search and Lookahead. Bernhard Nebel, Julien Hué, and Stefan Wölfl. June 4/6, 2012 Search and Lookahead Bernhard Nebel, Julien Hué, and Stefan Wölfl Albert-Ludwigs-Universität Freiburg June 4/6, 2012 Search and Lookahead Enforcing consistency is one way of solving constraint networks:

More information

Strongly chordal and chordal bipartite graphs are sandwich monotone

Strongly chordal and chordal bipartite graphs are sandwich monotone Strongly chordal and chordal bipartite graphs are sandwich monotone Pinar Heggernes Federico Mancini Charis Papadopoulos R. Sritharan Abstract A graph class is sandwich monotone if, for every pair of its

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Exact Algorithms for Dominating Induced Matching Based on Graph Partition

Exact Algorithms for Dominating Induced Matching Based on Graph Partition Exact Algorithms for Dominating Induced Matching Based on Graph Partition Mingyu Xiao School of Computer Science and Engineering University of Electronic Science and Technology of China Chengdu 611731,

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement ABSTRACT

Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 12, Number 6, 2005 Mary Ann Liebert, Inc. Pp. 812 821 Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement DAVID SANKOFF 1 and PHIL TRINH 2 ABSTRACT In

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees Francesc Rosselló 1, Gabriel Valiente 2 1 Department of Mathematics and Computer Science, Research Institute

More information

Mathematics of Evolution and Phylogeny. Edited by Olivier Gascuel

Mathematics of Evolution and Phylogeny. Edited by Olivier Gascuel Mathematics of Evolution and Phylogeny Edited by Olivier Gascuel CLARENDON PRESS. OXFORD 2004 iv CONTENTS 12 Reconstructing Phylogenies from Gene-Content and Gene-Order Data 1 12.1 Introduction: Phylogenies

More information

arxiv: v1 [cs.dm] 29 Oct 2012

arxiv: v1 [cs.dm] 29 Oct 2012 arxiv:1210.7684v1 [cs.dm] 29 Oct 2012 Square-Root Finding Problem In Graphs, A Complete Dichotomy Theorem. Babak Farzad 1 and Majid Karimi 2 Department of Mathematics Brock University, St. Catharines,

More information

Permutations and Combinations

Permutations and Combinations Permutations and Combinations Permutations Definition: Let S be a set with n elements A permutation of S is an ordered list (arrangement) of its elements For r = 1,..., n an r-permutation of S is an ordered

More information

Advances in Phylogeny Reconstruction from Gene Order and Content Data

Advances in Phylogeny Reconstruction from Gene Order and Content Data Advances in Phylogeny Reconstruction from Gene Order and Content Data Bernard M.E. Moret Department of Computer Science, University of New Mexico, Albuquerque NM 87131 Tandy Warnow Department of Computer

More information

Greedy Algorithms. CS 498 SS Saurabh Sinha

Greedy Algorithms. CS 498 SS Saurabh Sinha Greedy Algorithms CS 498 SS Saurabh Sinha Chapter 5.5 A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of length l. Enumerative approach O(l n

More information

Codes on graphs. Chapter Elementary realizations of linear block codes

Codes on graphs. Chapter Elementary realizations of linear block codes Chapter 11 Codes on graphs In this chapter we will introduce the subject of codes on graphs. This subject forms an intellectual foundation for all known classes of capacity-approaching codes, including

More information

Algorithms Exam TIN093 /DIT602

Algorithms Exam TIN093 /DIT602 Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405

More information

Preliminaries and Complexity Theory

Preliminaries and Complexity Theory Preliminaries and Complexity Theory Oleksandr Romanko CAS 746 - Advanced Topics in Combinatorial Optimization McMaster University, January 16, 2006 Introduction Book structure: 2 Part I Linear Algebra

More information

4-coloring P 6 -free graphs with no induced 5-cycles

4-coloring P 6 -free graphs with no induced 5-cycles 4-coloring P 6 -free graphs with no induced 5-cycles Maria Chudnovsky Department of Mathematics, Princeton University 68 Washington Rd, Princeton NJ 08544, USA mchudnov@math.princeton.edu Peter Maceli,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Exact and Approximate Equilibria for Optimal Group Network Formation

Exact and Approximate Equilibria for Optimal Group Network Formation Exact and Approximate Equilibria for Optimal Group Network Formation Elliot Anshelevich and Bugra Caskurlu Computer Science Department, RPI, 110 8th Street, Troy, NY 12180 {eanshel,caskub}@cs.rpi.edu Abstract.

More information

Induced Subgraph Isomorphism on proper interval and bipartite permutation graphs

Induced Subgraph Isomorphism on proper interval and bipartite permutation graphs Induced Subgraph Isomorphism on proper interval and bipartite permutation graphs Pinar Heggernes Pim van t Hof Daniel Meister Yngve Villanger Abstract Given two graphs G and H as input, the Induced Subgraph

More information

Cleaning Interval Graphs

Cleaning Interval Graphs Cleaning Interval Graphs Dániel Marx and Ildikó Schlotter Department of Computer Science and Information Theory, Budapest University of Technology and Economics, H-1521 Budapest, Hungary. {dmarx,ildi}@cs.bme.hu

More information

1.3 Vertex Degrees. Vertex Degree for Undirected Graphs: Let G be an undirected. Vertex Degree for Digraphs: Let D be a digraph and y V (D).

1.3 Vertex Degrees. Vertex Degree for Undirected Graphs: Let G be an undirected. Vertex Degree for Digraphs: Let D be a digraph and y V (D). 1.3. VERTEX DEGREES 11 1.3 Vertex Degrees Vertex Degree for Undirected Graphs: Let G be an undirected graph and x V (G). The degree d G (x) of x in G: the number of edges incident with x, each loop counting

More information

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study Li-San Wang Robert K. Jansen Dept. of Computer Sciences Section of Integrative Biology University of Texas, Austin,

More information

Discrete Applied Mathematics

Discrete Applied Mathematics Discrete Applied Mathematics 159 (2011) 1641 1645 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam Note Girth of pancake graphs Phillip

More information

arxiv: v1 [cs.cc] 9 Oct 2014

arxiv: v1 [cs.cc] 9 Oct 2014 Satisfying ternary permutation constraints by multiple linear orders or phylogenetic trees Leo van Iersel, Steven Kelk, Nela Lekić, Simone Linz May 7, 08 arxiv:40.7v [cs.cc] 9 Oct 04 Abstract A ternary

More information

Genomic Midpoints: Computation. and Evolutionary Implications

Genomic Midpoints: Computation. and Evolutionary Implications Genomic Midpoints: Computation and Evolutionary Implications Richard Durrett* and Yannet Interian Dept of Mathematics, Cornell University, Ithaca NY 14853* Dept of Bioengineering, U. of California, Berkeley

More information

Recent Advances in Phylogeny Reconstruction

Recent Advances in Phylogeny Reconstruction Recent Advances in Phylogeny Reconstruction from Gene-Order Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131 Department Colloqium p.1/41 Collaborators

More information

Reconstruction of Ancestral Genome subject to Whole Genome Duplication, Speciation, Rearrangement and Loss

Reconstruction of Ancestral Genome subject to Whole Genome Duplication, Speciation, Rearrangement and Loss Reconstruction of Ancestral Genome subject to Whole Genome Duplication, Speciation, Rearrangement and Loss Denis Bertrand, Yves Gagnon, Mathieu Blanchette 2, and Nadia El-Mabrouk DIRO, Université de Montréal,

More information

Tree sets. Reinhard Diestel

Tree sets. Reinhard Diestel 1 Tree sets Reinhard Diestel Abstract We study an abstract notion of tree structure which generalizes treedecompositions of graphs and matroids. Unlike tree-decompositions, which are too closely linked

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch

More information