Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes

Size: px
Start display at page:

Download "Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes"

Transcription

1 Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes W. James Kent*, Robert Baertsch*, Angie Hinrichs*, Webb Miller, and David Haussler *Center for Biomolecular Science and Engineering and Howard Hughes Medical Institute, Department of Computer Science, University of California, Santa Cruz, CA 95064; and Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA Edited by Michael S. Waterman, University of Southern California, Los Angeles, CA, and approved July 11, 2003 (received for review April 9, 2003) This study examines genomic duplications, deletions, and rearrangements that have happened at scales ranging from a single base to complete chromosomes by comparing the mouse and human genomes. From whole-genome sequence alignments, 344 large (>100-kb) blocks of conserved synteny are evident, but these are further fragmented by smaller-scale evolutionary events. Excluding transposon insertions, on average in each megabase of genomic alignment we observe two inversions, 17 duplications (five tandem or nearly tandem), seven transpositions, and 200 deletions of 100 bases or more. This includes 160 inversions and 75 duplications or transpositions of length >100 kb. The frequencies of these smaller events are not substantially higher in finished portions in the assembly. Many of the smaller transpositions are processed pseudogenes; we define a syntenic subset of the alignments that excludes these and other small-scale transpositions. These alignments provide evidence that 2% of the genes in the human mouse common ancestor have been deleted or partially deleted in the mouse. There also appears to be slightly less nontransposon-induced genome duplication in the mouse than in the human lineage. Although some of the events we detect are possibly due to misassemblies or missing data in the current genome sequence or to the limitations of our methods, most are likely to represent genuine evolutionary events. To make these observations, we developed new alignment techniques that can handle large gaps in a robust fashion and discriminate between orthologous and paralogous alignments. comparative genomics cross-species alignments synteny chromosomal inversion breakpoints Evolution creates new forms and functions from the interplay of reproduction, variation, and selection. There are many types of variation; the most common and well studied is the substitution of one base for another. Small insertions and deletions are also quite common. Large-scale insertions usually involve the duplication of part of the genome. These duplications can be the starting point for the development of a new gene with a new function. The evolution of nonduplicated genes generally is quite constrained by selection, because the existing function of the gene must be maintained. After duplication, one copy is free to lose its original function and possibly assume a new function (1, 2). Deletion and rearrangement also play important roles in the long-term evolution of genomes. This study examines patterns of variation observed at all scales by comparing the human and mouse genomes to each other. Human and mouse are at an excellent distance for studying all types of variation. The genomes are still similar enough that it is possible to align the majority of orthologous sequence at the DNA level (3) yet distant enough that a great deal of variation has had the opportunity to accumulate. Chromosomal rearrangements of 1 megabase can be observed by comparing genetic maps between organisms (4) and by chromosome painting (5). Approximately 200 conserved blocks of synteny between human and mouse were discovered by gene order comparisons before the genome sequences became available, with recent estimates ranging from 98 (6) to 529 blocks (7), depending on details of definition and method. The length distribution of synteny blocks was found to be consistent with the theory of random breakage introduced by Nadeau and Taylor (8, 9) before significant gene order data became available. In recent comparisons of the human and mouse genomes, rearrangements of 100,000 bases were studied by comparing 558,000 highly conserved short sequence alignments (average length 340 bp) within 300-kb windows. An estimated 217 blocks of conserved synteny were found, formed from 342 conserved segments, with length distribution roughly consistent with the random breakage model (3). Subsequent analysis of these data found 281 conserved synteny blocks of size at least 1 megabase, with a few thousand further microrearrangements within these blocks, about one per megabase (10). The most common variations are single-base transitions, that is C T and G A substitutions (11, 12). Single-base insertions and deletions are also quite common, although they are rapidly selected out of coding regions. Substitutions and small ( 20- base) insertions and deletions can be studied in traditional nucleotide alignments of homologous genomic sequences. A traditional pairwise alignment consists of two segments of genomic DNA with gap characters put in to maximize the number of matching bases. A simple example is ACAGTAACTCGGGAG ACGTG---TCG-GAG. If the two sequences are derived from a common ancestor, then a mismatch can result from a substitution in either sequence relative to their common ancestor. Similarly, an alignment gap could be caused either by an insertion in one sequence or a deletion in the other. At the heart of the pairwise alignment process is a scoring function that assigns positive values to matching nucleotides and negative values to mismatches and gaps. Most modern programs use what is called affine gap scoring, where the first gap character in a gap incurs a substantial gap opening cost, and each subsequent gap character incurs a somewhat lesser gap extension cost. Because gaps are frequently more than a single base long, affine scoring schemes model the underlying biological processes much better than fixed gap scoring systems. Affine gap scores generally work fairly well for protein alignments, where gaps are rare and tend to be short but do not represent the frequency of longer gaps as well (13 15). Nucleotide alignments, particularly outside of coding regions, tend to require many more gaps than protein alignments, and some of the gaps can be too large to be found by traditional pairwise alignment programs (16). Furthermore, in traditional pairwise alignment programs, at any given location a gap can occur only on one sequence. Independent deletion events in each species that delete some but not all of the same ancestral sequence cannot be represented, even though these are quite common. In these instances, the alignment program will either break the alignment into two or This paper was submitted directly (Track II) to the PNAS office. To whom correspondence should be addressed. kent@biology.ucsc.edu by The National Academy of Sciences of the USA PNAS September 30, 2003 vol. 100 no cgi doi pnas

2 Fig. 1. Mouse human alignments at Actinin -3 before and after chaining and netting, as displayed at the genome browser at http: genome.ucsc.edu. The RefSeq genes track shows the exon intron structure of this human gene, which has an ortholog as well as several paralogs and pseudogenes in the mouse. The all BLASTZ Mouse track shows BLASTZ alignments colored by mouse chromosome. The orthologous gene is on mouse chromosome 19, which is colored purple. Although BLASTZ finds the homology in a very sensitive manner, it is fragmented. The chained BLASTZ track shows the alignments after chaining. The chaining links related fragments. The orthologous genes and paralogs are each in a single piece. The chaining also merges some redundant alignments and eliminates a few very low-scoring isolated alignments. The Mouse Human Alignment Net track is designed to show only the orthologous alignments. In this case, there has been no rearrangement other than moderate-sized insertions and deletions, so the net track is quite simple. Clicking on a chain or net track allows the user to open a new browser on the corresponding region in the other species. force nonhomologous bases to align. Traditional programs are also not able to accommodate inversions, translocations, or duplications. They can align only shorter segments of genomic DNA in which none of these events has occurred. Therefore, whereas considerable analysis of variation at the single base scale is available from traditional sequence alignments to complement the analysis of large-scale rearrangements, analyzing variation at the middle scales is still an interesting challenge. The role of segmental duplication of several thousand to several million bases in our own evolution (17) is one indication of the importance of variation at the middle scales. In this article, we describe automated methods for linking together traditional alignments into larger structures, chains and nets, that can effectively bridge the gap between chromosome painting and sequence alignment (Fig. 1). These methods can accommodate inversions, translocations, duplications, largescale deletions, and overlapping deletions. We apply these tools to BLASTZ-generated alignments (18) to investigate the patterns of variation that have occurred at all scales since the divergence of the mouse and human lineages. Methods The November 2002 freeze of the human genome and the February 2002 freeze of the mouse genome were taken from http: genome.ucsc.edu. The two genomes were aligned by the BLASTZ program as described in ref. 18, except that in addition to masking out transposon repeats [identified by REPEATMASKER (19)], simple repeats of period 12 or less found by TANDEM REPEAT FINDER (20) were masked out, and the artifact-prone chrun mouse sequence, a sequence that could not be assembled into sizeable contigs or effectively mapped to the mouse chromosomes, was excluded. The BLASTZ alignments were converted to axt format (see ref. 18) by using the program LAVTOAXT. By using the same nucleotide scoring matrix as BLASTZ but a novel piecewise linear gap scoring scheme, a new program, AXTCHAIN, formed maximally scoring chained alignments out of the gapless subsections of the input alignments (Fig. 1). A chained alignment (or chain ) between two species consists of an ordered sequence of traditional pairwise nucleotide alignments ( blocks ) separated by larger gaps, some of which may be simultaneous gaps in both species. The order of blocks within the chain must be consistent with the genomic sequence order in both species. Thus, a chain cannot have local inversions, translocations, or duplications among the parts of the DNA that it aligns. However, a chain is allowed to skip over segments of DNA in either or both species. In particular, intervening DNA in one species that does not align with the other because it is locally inverted or has been inserted in by lineage-specific translocation or duplication is skipped over during construction of the chain. Thus, the chain can represent widely scattered pieces of genomic DNA in the two contemporary species that are in fact descended from a single genomic segment in the common ancestor without rearrangement. To build chains efficiently, AXTCHAIN uses a variation of the k-dimensional tree (kd-tree) based algorithm described in ref. 21. To detect cases of overlapping deletions in both species, scoring is defined such that the alignment program will typically open a simultaneous gap rather than forcing alignments of nonhomologous regions, because penalties incurred by a run of mismatches exceed the simultaneous gap penalty. This strategy for allowing simultaneous gaps also helps cope with local inversions and missing sequence (blocks of Ns) in unfinished genomes. The resulting chains often span multiple megabases. For many purposes, one wants to select only a single best alignment for every region of the human genome. Previously (18), we developed a program, AXTBEST, for this purpose. However, the chains in many ways represent a better substrate for picking the best alignment than single BLASTZ alignments. We developed a new program, CHAINNET, to improve on this process. In this program, all bases in all chromosomes are initially marked as unused. The chains are then put into a list sorted with the highest-scoring chain first. The program goes into a loop, at each iteration taking the next chain off of the list, throwing out the parts of the chain that intersect with bases EVOLUTION Kent et al. PNAS September 30, 2003 vol. 100 no

3 Table 1. Comparison of BLASTZ alignments chains before and after processing with AXTCHAIN and after building the human net BLASTZ AXTCHAIN Number of alignments chains 8,560, ,445 Longest human span, bp 63, ,044,604 Average human span, bp ,830 Most aligning bases, bp 59,559 27,056,473 Average aligning bases, bp 574 7,062 Bases aligned in human genome, % already covered by previously taken chains, and then marking the bases that are left in the chain as covered. The program uses red black trees to keep track of which areas of a chromosome are already covered. If a chain covers bases that are in a gap in a previously taken chain, it is marked as a child of the previous chain. In this way, a hierarchy of chains is formed that we call a net. The CHAINNET program also keeps track of which bases are covered by more than one chain to help distinguish duplicated from nonduplicated regions. The resulting net files are further annotated by the program NETSYNTENY, which notes which chains in the net are inverted relative to their parents, displaced, or on different chromosomes. Additional annotations on repeats, including lineage-specific repeats and repeats that predate the mouse human split, are added by the NETCLASS program. Short chains embedded within a longer chain that come from regions distinct from that of the longer chain, as well as short chains between long chains at the top level of the net, are often processed pseudogenes. Because these can confound many types of analysis, we also prepared a subset of the chains that are judged to be syntenic. To be considered syntenic, a chain has to either have a very high score itself or be embedded in a larger chain, on the same chromosome, and come from the same region as the larger chain. Thus, inversions and tandem duplications are considered syntenic. The syntenic subset of the alignments was created with the NETFILTER program by using the syn flag. Although NETFILTER does not eliminate pseudogenes that arise from tandem duplication, it does eliminate processed pseudogenes, which are much more plentiful. Further details of these algorithms are described in the source code, which, along with Linux executables for LAVTOAXT, AXTCHAIN, CHAINNET, NETSYNTENY, NETCLASS, and NETFILTER, are available at kent. The chains and nets are displayed alongside other annotations at the genome browser at http: genome.ucsc.edu (22) and may also be downloaded in bulk from that site. Results The initial BLASTZ mouse alignments cover 35.9% of the human genome. This is less than the 39.9% reported in ref. 3, due to masking of the tandem repeats of period 12 and less and removal of the artifact-prone mouse sequence in ChrUn. The construction of longer chains from the initial BLASTZ alignments resulted in fewer and substantially longer alignments (Fig. 1). Chained alignment length can be measured in two different ways. We define the (human) span of a chain to be the distance in bases in the human genome from the first to the last human base in the chain, including gaps, and we define the size of the chain as the number of aligning bases in it, not including gaps. Both of these showed substantial increases due to chain construction, with the average human span increasing from 608 to 22,830 bp and the average size increasing from 574 to 7,062 bp. Because many smaller BLASTZ alignments were often merged into a single longer chain, there was also a substantial reduction in the total number of alignments involved in the human mouse comparison, from 8.5 million to 150,000. In some cases, small low-scoring BLASTZ alignments are discarded after chaining as well. This results in a decrease from 35.9% to 34.6% of the bases in human genome being aligned to mouse. These and other comparative statistics are listed in Table 1. The genomewide collection of these chains was used to study the length distribution of gaps induced by indels of all sizes since the divergence of these species, from single base to multikilobase-size indels. The affine gap score model conventionally used in sequence alignment programs corresponds to a statistical model where indel sizes follow a geometric distribution, in which the number of gaps of size N 1 would be a constant fraction of the number of gaps of size N for all N. The length distribution we observed violates this model. Fig. 2 shows a histogram of gap sizes observed in the set of chains we constructed, with frequency of occurrence for various indel lengths plotted on a logarithmic scale. A geometric distribution would appear as a straight line on such a plot. In actuality, the plot is fairly complex. In general, there are more short and long gaps than we would see in a geometric distribution, and there are sharp spikes in the numbers of gaps observed in the human sequence at 300 bases corresponding to ALU insertions (19). Fig. 2. (a) Small insertions and deletions. (b) Large insertions and deletions and transposons. Counts of genomewide insertions and deletions plotted vs. their size. The blue line shows a combination of human insertions and mouse deletions, whereas the red line shows mouse insertions and human deletions. The vertical scale is the natural logarithm of the number of insertions deletions of that size. In b, the counts are grouped in bins of 25. The green line shows the percentage of bases in mouse gaps of that size that are covered by human-specific transposons. The peaks in the insertion deletion graphs appear to be due to transposons cgi doi pnas Kent et al.

4 Fig. 3. (a) Log histogram of gap frequencies for gaps 50 bases long. (b) Log histogram of gap frequencies for gaps up to 1,500 bases long. (c) Log histogram of gap frequencies for gaps up to 50,000 bases long. Relative frequency of simultaneous and single gaps are shown in both sequences. The horizontal axis is used for gaps in the mouse sequence, which represent either insertions in human or deletions in mouse. The vertical axis is used for gaps in human. The log of the number of simultaneous gaps of a particular size range is converted into a level of gray to create a 2D histogram. The horizontal and vertical axes are not drawn in; the dark lines where the axis would be reflect the log frequencies of gaps that are in only one sequence. (a) Gaps of 50 bases. Gaps that are purely in mouse or purely in human are especially prominent here. (b) Gaps of 10 1,500 bases. The transposon-induced effects in Fig. 1 can also be seen here. Note also the relative concentration near the diagonal for inserts of 200 bases. This occurrence is mostly due to small inversions and locally divergent sequence. (c) Gaps of 1,000 50,000 bases. In this range, the log frequency of simultaneous gaps of a given combined (human and mouse) gap size differs roughly by a constant from the sum of the log frequencies of the individual one-sided gap sizes in each species. In this sense, these longer simultaneous gaps act as if they arise from independent gaps in each individual species. The chains also included a substantial number of simultaneous gaps in both species (Fig. 3). For smaller gap sizes, simultaneous gaps are quite rare, but the phenomenon becomes increasingly important with increasing gap size. The frequency of large simultaneous gaps is roughly consistent with a model in which they arise from two independent indel events, one in each species (Fig. 3c). The nested net structure of chains produced by the CHAIN- NET program was used to examine the rearrangements that have occurred since the divergence of human and mouse in the form of inversion (Fig. 4), deletion, transposition, retrotransposition, tandem duplication, and interspersed duplication. Because of various types of duplications, 52% of coding regions and 3.3% of the human genome as a whole are covered by more than one BLASTZ mouse alignment (3) even after excluding transposons. This can make it difficult to separate ortholog from paralog and gene from pseudogene. The net structure helps to disambiguate these. The net structure is not symmetric between species: the human net is constructed by allowing only the single best aligning DNA from the mouse genome to align to any single place in the human genome and the mouse net is constructed in the opposite manner. This one-sided best-in-genome requirement is not as strong as the requirement of reciprocal best matching, where only alignments that are best-in-genome for both species are allowed. The latter prohibits the investigation of lineage-specific duplications by cross-species alignments, because it allows at most one copy of the duplicated region to align to the other species, and often not even one copy can be fully aligned. With human and mouse, this results in a substantial reduction of the total amount of genomic DNA covered by cross-species alignments. We found that requiring reciprocal-best alignment reduces the coverage in the human net by 11%, and in the mouse net by 9%. These numbers are quite close, suggesting that a similar level of duplication has been occurring in both genomes since the common ancestor. In what follows, we will explore the human net further, constructed as described in Methods, without the reciprocal best requirement. On the basis of analysis of the human net, the frequency of various rearrangements on the draft mouse sequence as a whole and on the 48-million-base (2%) subset of the sequence that is finished is shown in Table 2. Overall rearrangement patterns were very similar in the finished subset of the mouse genome as in the genome as a whole. The same held for the finished subset of the human genome (data not shown). There were significantly more local duplications in the finished subset, likely reflecting the collapse of local duplications, a common artifact of the EVOLUTION Fig. 4. A 15,000-base inversion containing two transcripts and showing chr7: in the November 2002 assembly of the human genome. Numerous smaller rearrangements are also visible in the net track in this picture. In some cases, the smaller ones simply represent paralogous mouse regions filling in when the orthologous mouse region is not yet sequenced. Kent et al. PNAS September 30, 2003 vol. 100 no

5 Fig. 5. Distribution of the spans of all 147,445 chains in the human net. The distribution consists of a bimodal portion for short chains of span 10 5 and a flat tail for 579 long chains of size between 10 5 and whole-genome shotgun assembly techniques used in the mouse. In general, the most common type of rearrangement is a section of the genome being duplicated and inserted in a different chromosome (nonsyntenic duplication). Most of these appear to be processed pseudogenes. Nonsyntenic apparent transpositions were also surprisingly common. The distribution of the spans of the chains from the human net (Fig. 5) shows a roughly bimodal distribution for the chains that span 100,000 bases ( short chains ) and a long flat tail consisting of 579 chains that span between 100,000 and 115 million bases ( long chains, average length 983 kb). The long chains combined span 90.9% of the human genome (excluding gaps 100,000 bases in individual chains) and their aligned bases cover 32.9% of the bases in the human genome. In contrast, all chains together (including arbitrarily large gaps in individual chains) span 96.3% of the human genome and align to 34.6% of it. Thus the long chains alone (without their long gaps) span 94.4% of the bases spanned by all chains and include 95.1% of all aligned bases. Of the 579 long chains, 344 appear at the top level of the net and form large primary units of synteny with the mouse. The locations of these chains are largely in agreement with the regions of synteny with mouse identified in earlier studies (see Discussion). The remaining 235 long chains appear at lower levels of the net because they are embedded within the gaps of these primary synteny chains. Of these, 160 represent inversions, 29 represent duplications or translocations of nearby regions on the same chromosome, and 46 represent distant duplications or translocations at great distance on the same chromosome or between chromosomes. Some of these were also detected in earlier studies. In addition to the 344 long chains at the top level of the net, there are 19,800 short chains at the top level and many more at lower levels. The short chains at the top level often appear in long runs in regions where no significant synteny between the human and mouse genomes was previously found and constitute what appear to be hot spots for rearrangements or duplications. In these regions, the best alignment in the other species shifts rapidly from one chromosome to another. Many of these regions contain clusters of genes from families that have undergone recent lineage-specific expansions. They include many genes involved in the immune system, olfactory receptors (23), and Krüppel-associated box C2H2-type zinc fingers, which are known to have been highly duplicated and mobile in mammalian evolution (24). For a table of such regions, see Table 3, which is published as supporting information on the PNAS web site, Because the 344 large chains that are at the top level of the net are similar to the regions of synteny identified in earlier studies, the distribution of their spans agrees roughly with the simple random breakage model that has worked well in previous synteny (8). However, the presence of large numbers of small chains suggests that other processes may be at work not only within but also between the synteny blocks defined by the long chains. These intervening short chains cannot be explained as Table 2. Rearrangement statistics of the mouse genome relative to human Genomewide frequency (events per megabase) Finished frequency (events per megabase) Genome median size Finished median size Inversion Inversion local duplication Inversion local part duplication Local move Local duplication Local part duplication Syntenic move Syntenic duplication Syntenic part duplication Nonsyntenic move Nonsyntenic duplication Nonsyntenic part duplication Mouse 1 base gaps 1, , Mouse 10 base gaps Mouse gaps Double gaps H likely deletion The finished columns show rearrangements within the 96.3 megabases of mouse sequence that were finished in this assembly. The genome columns refer to the entire 2.47-gigabase mouse assembly. A move or inversion involves no duplicated sequence. Duplication means at least 80% of the aligning bases of the rearranged chain align to multiple places in the human genome. Part duplication means some sequence, but 80% is duplicated. The moves and duplications of 100,000 bases are considered local. Syntenic moves and duplications are on the same chromosome but 100,000 bases away and may be inverted as well. Nonsyntenic moves and duplications fill gaps in a chain with sequence from another chromosome. The single gaps 100 row shows gaps of 100 or more bases in mouse and 0 bases in human. The double gaps 100 row shows gaps of 100 or more in mouse and 0 bases in human. The h likely deletion 100 row counts gaps in the mouse genome that are not the result of human lineage-specific transposons or Ns, and that are at least 100 bases cgi doi pnas Kent et al.

6 additional blocks of synteny along with the longer blocks by using the parameters of a simple random breakage model. Finally, as described in Methods, we also derived a syntenic subset of the human net. The complete net covers 96.7% of bases in RefSeq ( ref. 25) coding regions, whereas the syntenic subset covers 93.0%. The complete net covers 71% of bases in pseudogenes on chromosome 22 annotated by the Sanger Centre (26), whereas the syntenic subset covers 13%. Thus, the syntenic subset should be useful in locating pseudogenes and in gene analysis applications where specificity is important, such as described in refs. 27 and 28. We examined in detail the coding regions covered by the complete net but not the syntenic net on chromosome 22. This includes part or all of 40 of the 450 RefSeq genes mapped to chromosome 22 in the November 2002 assembly at http: genome.ucsc.edu. Eighteen of these genes were in areas that seem to be hot spots for rearrangement and duplication, as discussed above. Twelve of the genes with parts present in the full net but not the syntenic net are caused by gaps in the mouse draft genome. In the complete net, paralogous regions are filled in for the missing orthologous regions. Eight genes appeared to be deleted or partially deleted in the mouse. Extrapolating this to the complete genome would imply that the mouse has deleted or partially deleted 2% of genes present in the common ancestor. Two of the RefSeq gene mappings missing from the syntenic net turned out, on closer examination, to be mappings to processed pseudogenes. As the human and mouse genomes become more fully finished, the syntenic subnet will be increasingly useful for identifying processed pseudogenes and studying gene evolution in the entire genome. Discussion The chaining and netting technique presented here is very useful in tracing the evolution of the genome. It has some major advantages over methods such as GRIMM (4) for studying genomic rearrangments. It can accommodate duplication and deletion as well as transposition and inversion, and it provides resolution down to the base level. However, the netting technique assumes that each rearrangement is independent, not overlapping other rearrangements. It can detect and correctly classify translocations only if they insert in the middle of a chromosome rather than at chromosome ends. GRIMM can deal with overlapping rearrangements and translocations at the ends of chromosomes. When we restrict our attention to blocks of 100,000 bases or more, we find 160 inversions, which is similar to the 149 inversions of 1 million bases or more reported with GRIMM. The level of transpositions 100,000 bases is lower than observed by GRIMM due to the limitations of our technique. The breakpoints we observe between large syntenic areas do correlate well with those observed with GRIMM and with windowing-based techniques such as reported in ref. 3. Comparisons of the chains from the human net and blocks of synteny found by other methods can be viewed as side-by-side tracks on the genome browser by means of the link at http: genome.ucsc.edu cauldron syntenies.html. It is clear that inversions and other rearrangements happen at all scales, and indeed that there are more inversions 1,000 bases that there are inversions 1,000 bases. Due to limits in our alignment techniques, we may not currently be observing many of the smallest rearrangements, particularly those 50 bases. The most common rearrangement we observe (excluding transposon insertion) is nonsyntenic duplication, most of which appears to be due to processed pseudogenes created by transposon machinery. However, there is a surprising amount of nonsyntenic nonduplicating transposition of small blocks. Some of these, as well as other events we record, may be artifacts due to the incompleteness of the mouse genome: second-best matches to the mouse genome are occasionally used in the net where the best match is missing because it falls in an unsequenced region in the mouse genome. However, this does not account for the majority of the events. We currently do not know of a mechanism that would generate movement of small pieces of the genome like this. Finally, this work lets us define orthologous mouse and human genes more clearly than we could in the past and has the potential to help eliminate false annotation of pseudogenes as true genes. We thank the International Human Genome Sequencing Consortium and the Mouse Genome Sequencing Consortium for providing the data. W.J.K., R.B., A.H., and D.H. were supported by National Human Genome Research Institute Grant 1P41HG D.H. was also supported by the Howard Hughes Medical Institute. W.M. is supported by National Human Genome Research Institute Grant HG Haldane, J. B. S. (1932) The Causes of Evolution (Longmans and Green, London). 2. Graur, D. & Li, W. H. (2000) Fundamentals of Molecular Evolution (Sinauer, Sunderland, MA). 3. The Mouse Sequencing Consortium (2002) Nature 420, Tesler, G. (2002) Bioinformatics 18, Yang, F., Alkalaeva, E. Z., Perelman, P. L., Pardini, A. T., Harrison, W. R., O Brien, P. C., Fu, B., Graphodatsky, A. S., Ferguson-Smith, M. A. & Robinson, T. J. (2003) Proc. Natl. Acad. Sci. USA 100, Housworth, E. A. & Postlethwait, J. (2002) Genetics 162, Kumar, S., Gadagkar, S. R., Filipski, A. & Gu, X. (2001) Genetics 157, Nadeau, J. H. & Taylor, B. A. (1984) Proc. Natl. Acad. Sci. USA 81, Nadeau, J. H. & Sankoff, D. (1998) Mamm. Genome 9, Pevzner, P. & Tesler, G. (2003) Genome Res. 13, Chiaromonte, F., Yap, V. B. & Miller, W. (2002) Pac. Symp. Biocomput Makalowski, W. & Boguski, M. S. (1998) Proc. Natl. Acad. Sci. USA 95, Qian, B. & Goldstein, R. A. (2001) Proteins 45, Ophir, R. & Graur, D. (1997) Gene 205, Gu, X. & Li, W. H. (1995) J. Mol. Evol. 40, Kent, W. J. & Zahler, A. M. (2000) Genome Res. 10, Bailey, J. A., Gu, Z., Clark, R. A., Reinert, K., Samonte, R. V., Schwartz, S., Adams, M. D., Myers, E. W., Li, P. W. & Eichler, E. E. (2002) Science 297, Schwartz, S., Kent, W. J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R. C., Haussler, D. & Miller, W. (2003) Genome Res. 13, Smit, A. F. (1999) Curr. Opin. Genet. Dev. 9, Benson, G. (1999) Nucleic Acids Res. 27, Zhang, Z., Raghavachari, B., Hardison, R. C. & Miller, W. (1994) J. Comput. Biol. 1, Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M. & Haussler, D. (2002) Genome Res. 12, Young, J. M., Friedman, C., Williams, E. M., Ross, J. A., Tonnes-Priddy, L. & Trask, B. J. (2002) Hum. Mol. Genet. 11, Looman, C., Abrink, M., Mark, C. & Hellman, L. (2002) Mol. Biol. Evol. 19, Pruitt, K. D., Tatusova, T. & Maglott, D. R. (2003) Nucleic Acids Res. 31, Collins, J. E., Goward, M. E., Cole, C. G., Smink, L. J., Huckle, E. J., Knowles, S., Bye, J. M., Beare, D. M. & Dunham, I. (2003) Genome Res. 13, Couronne, O., Poliakov, A., Bray, N., Ishkhanov, T., Ryaboy, D., Rubin, E., Pachter, L. & Dubchak, I. (2003) Genome Res. 13, Alexandersson, M., Cawley, S. & Pachter, L. (2003) Genome Res. 13, EVOLUTION Kent et al. PNAS September 30, 2003 vol. 100 no

Handling Rearrangements in DNA Sequence Alignment

Handling Rearrangements in DNA Sequence Alignment Handling Rearrangements in DNA Sequence Alignment Maneesh Bhand 12/5/10 1 Introduction Sequence alignment is one of the core problems of bioinformatics, with a broad range of applications such as genome

More information

Multiple Alignment of Genomic Sequences

Multiple Alignment of Genomic Sequences Ross Metzger June 4, 2004 Biochemistry 218 Multiple Alignment of Genomic Sequences Genomic sequence is currently available from ENTREZ for more than 40 eukaryotic and 157 prokaryotic organisms. As part

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre PhD defense Chromosomal rearrangements in mammalian genomes : characterising the breakpoints Claire Lemaitre Laboratoire de Biométrie et Biologie Évolutive Université Claude Bernard Lyon 1 6 novembre 2008

More information

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven) BMI/CS 776 Lecture #20 Alignment of whole genomes Colin Dewey (with slides adapted from those by Mark Craven) 2007.03.29 1 Multiple whole genome alignment Input set of whole genome sequences genomes diverged

More information

NIH Public Access Author Manuscript Pac Symp Biocomput. Author manuscript; available in PMC 2009 October 6.

NIH Public Access Author Manuscript Pac Symp Biocomput. Author manuscript; available in PMC 2009 October 6. NIH Public Access Author Manuscript Published in final edited form as: Pac Symp Biocomput. 2009 ; : 162 173. SIMULTANEOUS HISTORY RECONSTRUCTION FOR COMPLEX GENE CLUSTERS IN MULTIPLE SPECIES * Yu Zhang,

More information

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering Genome Rearrangements In Man and Mouse Abhinav Tiwari Department of Bioengineering Genome Rearrangement Scrambling of the order of the genome during evolution Operations on chromosomes Reversal Translocation

More information

Evolution at the nucleotide level: the problem of multiple whole-genome alignment

Evolution at the nucleotide level: the problem of multiple whole-genome alignment Human Molecular Genetics, 2006, Vol. 15, Review Issue 1 doi:10.1093/hmg/ddl056 R51 R56 Evolution at the nucleotide level: the problem of multiple whole-genome alignment Colin N. Dewey 1, * and Lior Pachter

More information

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Leming Zhou and Liliana Florea 1 Methods Supplementary Materials 1.1 Cluster-based seed design 1. Determine Homologous Genes.

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2015 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

Comparative Genomics. Chapter for Human Genetics - Principles and Approaches - 4 th Edition

Comparative Genomics. Chapter for Human Genetics - Principles and Approaches - 4 th Edition Chapter for Human Genetics - Principles and Approaches - 4 th Edition Editors: Friedrich Vogel, Arno Motulsky, Stylianos Antonarakis, and Michael Speicher Comparative Genomics Ross C. Hardison Affiliations:

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

Reconstructing contiguous regions of an ancestral genome

Reconstructing contiguous regions of an ancestral genome Reconstructing contiguous regions of an ancestral genome Jian Ma, Louxin Zhang, Bernard B. Suh, Brian J. Raney, Richard C. Burhans, W. James Kent, Mathieu Blanchette, David Haussler and Webb Miller Genome

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

RGP finder: prediction of Genomic Islands

RGP finder: prediction of Genomic Islands Training courses on MicroScope platform RGP finder: prediction of Genomic Islands Dynamics of bacterial genomes Gene gain Horizontal gene transfer Gene loss Deletion of one or several genes Duplication

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Computational Genetics Winter 2013 Lecture 10. Eleazar Eskin University of California, Los Angeles

Computational Genetics Winter 2013 Lecture 10. Eleazar Eskin University of California, Los Angeles Computational Genetics Winter 2013 Lecture 10 Eleazar Eskin University of California, Los ngeles Pair End Sequencing Lecture 10. February 20th, 2013 (Slides from Ben Raphael) Chromosome Painting: Normal

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement ABSTRACT

Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 12, Number 6, 2005 Mary Ann Liebert, Inc. Pp. 812 821 Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement DAVID SANKOFF 1 and PHIL TRINH 2 ABSTRACT In

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada Multiple Whole Genome Alignments Without a Reference Organism Inna Dubchak 1,2, Alexander Poliakov 1, Andrey Kislyuk 3, Michael Brudno 4* 1 Genome Sciences Division, Lawrence Berkeley National Laboratory,

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Automated whole-genome multiple alignment of rat, mouse, and human Permalink https://escholarship.org/uc/item/1z58c37n

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly Comparative Genomics: Human versus chimpanzee 1. Introduction The chimpanzee is the closest living relative to humans. The two species are nearly identical in DNA sequence (>98% identity), yet vastly different

More information

Synteny Portal Documentation

Synteny Portal Documentation Synteny Portal Documentation Synteny Portal is a web application portal for visualizing, browsing, searching and building synteny blocks. Synteny Portal provides four main web applications: SynCircos,

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

The Fragile Breakage versus Random Breakage Models of Chromosome Evolution

The Fragile Breakage versus Random Breakage Models of Chromosome Evolution The Fragile Breakage versus Random Breakage Models of Chromosome Evolution Qian Peng 1, Pavel A. Pevzner 1, Glenn Tesler 2* 1 Department of Computer Science and Engineering, University of California San

More information

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

More information

Phylogenomic Resources at the UCSC Genome Browser

Phylogenomic Resources at the UCSC Genome Browser 9 Phylogenomic Resources at the UCSC Genome Browser Kate Rosenbloom, James Taylor, Stephen Schaeffer, Jim Kent, David Haussler, and Webb Miller Summary The UC Santa Cruz Genome Browser provides a number

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships Chapter 26: Phylogeny and the Tree of Life You Must Know The taxonomic categories and how they indicate relatedness. How systematics is used to develop phylogenetic trees. How to construct a phylogenetic

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Comparative genomics. Lucy Skrabanek ICB, WMC 6 May 2008

Comparative genomics. Lucy Skrabanek ICB, WMC 6 May 2008 Comparative genomics Lucy Skrabanek ICB, WMC 6 May 2008 What does it encompass? Genome conservation transfer knowledge gained from model organisms to non-model organisms Genome evolution understand how

More information

Is inversion symmetry of chromosomes a law of nature?

Is inversion symmetry of chromosomes a law of nature? Is inversion symmetry of chromosomes a law of nature? David Horn TAU Safra bioinformatics retreat, 28/6/2018 Lecture based on Inversion symmetry of DNA k-mer counts: validity and deviations. Shporer S,

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012 Whole Genome Alignment Adam Phillippy University of Maryland, Fall 2012 Motivation cancergenome.nih.gov Breast cancer karyotypes www.path.cam.ac.uk Goal of whole-genome alignment } For two genomes, A and

More information

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors Phylo-VISTA: Interactive Visualization of Multiple DNA Sequence Alignments Nameeta Shah 1,*, Olivier Couronne 2,*, Len A. Pennacchio 2, Michael Brudno 3, Serafim Batzoglou 3, E. Wes Bethel 2, Edward M.

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

A Browser for Pig Genome Data

A Browser for Pig Genome Data A Browser for Pig Genome Data Thomas Mailund January 2, 2004 This report briefly describe the blast and alignment data available at http://www.daimi.au.dk/ mailund/pig-genome/ hits.html. The report describes

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment Ofer Gill and Bud Mishra Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street,

More information

BLAT The BLAST-Like Alignment Tool

BLAT The BLAST-Like Alignment Tool Resource BLAT The BLAST-Like Alignment Tool W. James Kent Department of Biology and Center for Molecular Biology of RNA, University of California, Santa Cruz, Santa Cruz, California 95064, USA Analyzing

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2011 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes Cedric Chauve 1, Eric Tannier 2,3,4,5 * 1 Department of Mathematics,

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

Multiple Genome Alignment by Clustering Pairwise Matches

Multiple Genome Alignment by Clustering Pairwise Matches Multiple Genome Alignment by Clustering Pairwise Matches Jeong-Hyeon Choi 1,3, Kwangmin Choi 1, Hwan-Gue Cho 3, and Sun Kim 1,2 1 School of Informatics, Indiana University, IN 47408, USA, {jeochoi,kwchoi,sunkim}@bio.informatics.indiana.edu

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Introduction to Sequence Alignment. Manpreet S. Katari

Introduction to Sequence Alignment. Manpreet S. Katari Introduction to Sequence Alignment Manpreet S. Katari 1 Outline 1. Global vs. local approaches to aligning sequences 1. Dot Plots 2. BLAST 1. Dynamic Programming 3. Hash Tables 1. BLAT 4. BWT (Burrow Wheeler

More information

Linear-Space Alignment

Linear-Space Alignment Linear-Space Alignment Subsequences and Substrings Definition A string x is a substring of a string x, if x = ux v for some prefix string u and suffix string v (similarly, x = x i x j, for some 1 i j x

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Analysis of Gene Order Evolution beyond Single-Copy Genes

Analysis of Gene Order Evolution beyond Single-Copy Genes Analysis of Gene Order Evolution beyond Single-Copy Genes Nadia El-Mabrouk Département d Informatique et de Recherche Opérationnelle Université de Montréal mabrouk@iro.umontreal.ca David Sankoff Department

More information

Comparative Genomics II

Comparative Genomics II Comparative Genomics II Advances in Bioinformatics and Genomics GEN 240B Jason Stajich May 19 Comparative Genomics II Slide 1/31 Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Haploid & diploid recombination and their evolutionary impact

Haploid & diploid recombination and their evolutionary impact Haploid & diploid recombination and their evolutionary impact W. Garrett Mitchener College of Charleston Mathematics Department MitchenerG@cofc.edu http://mitchenerg.people.cofc.edu Introduction The basis

More information

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal Genômica comparativa João Carlos Setubal IQ-USP outubro 2012 11/5/2012 J. C. Setubal 1 Comparative genomics There are currently (out/2012) 2,230 completed sequenced microbial genomes publicly available

More information

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18 Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Supplementary Information

Supplementary Information Supplementary Information Supplementary Figure 1. Schematic pipeline for single-cell genome assembly, cleaning and annotation. a. The assembly process was optimized to account for multiple cells putatively

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16 Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution

Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/5923057 Ancestral reconstruction of segmental duplications reveals punctuated cores of human

More information

Supplementary text and figures: Comparative assessment of methods for aligning multiple genome sequences

Supplementary text and figures: Comparative assessment of methods for aligning multiple genome sequences Supplementary text and figures: Comparative assessment of methods for aligning multiple genome sequences Xiaoyu Chen Martin Tompa Department of Computer Science and Engineering Department of Genome Sciences

More information

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Supplementary Note S2 Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Phylogenetic trees reconstructed by a variety of methods from either single-copy orthologous loci (Class

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

Comparative Genomics. Primer. Ross C. Hardison

Comparative Genomics. Primer. Ross C. Hardison Primer Comparative Genomics Ross C. Hardison A complete genome sequence of an organism can be considered to be the ultimate genetic map, in the sense that the heritable characteristics are encoded within

More information

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula Maheshi Dassanayake 1,9, Dong-Ha Oh 1,9, Jeffrey S. Haas 1,2, Alvaro Hernandez 3, Hyewon Hong 1,4, Shahjahan

More information

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility SWEEPFINDER2: Increased sensitivity, robustness, and flexibility Michael DeGiorgio 1,*, Christian D. Huber 2, Melissa J. Hubisz 3, Ines Hellmann 4, and Rasmus Nielsen 5 1 Department of Biology, Pennsylvania

More information

TE content correlates positively with genome size

TE content correlates positively with genome size TE content correlates positively with genome size Mb 3000 Genomic DNA 2500 2000 1500 1000 TE DNA Protein-coding DNA 500 0 Feschotte & Pritham 2006 Transposable elements. Variation in gene numbers cannot

More information

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

The combinatorics and algorithmics of genomic rearrangements have been the subject of much

The combinatorics and algorithmics of genomic rearrangements have been the subject of much JOURNAL OF COMPUTATIONAL BIOLOGY Volume 22, Number 5, 2015 # Mary Ann Liebert, Inc. Pp. 425 435 DOI: 10.1089/cmb.2014.0096 An Exact Algorithm to Compute the Double-Cutand-Join Distance for Genomes with

More information

MiGA: The Microbial Genome Atlas

MiGA: The Microbial Genome Atlas December 12 th 2017 MiGA: The Microbial Genome Atlas Jim Cole Center for Microbial Ecology Dept. of Plant, Soil & Microbial Sciences Michigan State University East Lansing, Michigan U.S.A. Where I m From

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

BIOINFORMATICS. PILER: identification and classification of genomic repeats. Robert C. Edgar 1* and Eugene W. Myers 2 1 INTRODUCTION

BIOINFORMATICS. PILER: identification and classification of genomic repeats. Robert C. Edgar 1* and Eugene W. Myers 2 1 INTRODUCTION BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 1 PILER: identification and classification of genomic repeats Robert C. Edgar 1* and Eugene W. Myers 2 1 195 Roque Moraes Drive, Mill Valley, CA, U.S.A., bob@drive5.com.

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

Computational Identification of Evolutionarily Conserved Exons

Computational Identification of Evolutionarily Conserved Exons Computational Identification of Evolutionarily Conserved Exons Adam Siepel Center for Biomolecular Science and Engr. University of California Santa Cruz, CA 95064, USA acs@soe.ucsc.edu David Haussler Howard

More information