Whole-Genome Alignments and Polytopes for Comparative Genomics

Size: px

Start display at page:

Download "Whole-Genome Alignments and Polytopes for Comparative Genomics"

Brian Black
6 years ago
Views:

Whole-Genome Alignments and Polytopes for Comparative Genomics Colin Noel Dewey Electrical Engineering and Computer Sciences University of

1 Whole-Genome Alignments and Polytopes for Comparative Genomics Colin Noel Dewey Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS August 3, 2006

2 Copyright 2006, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.

3 Whole-Genome Alignments and Polytopes for Comparative Genomics by Colin Noel Dewey B.S. (University of California, Berkeley) 2001 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering and Computer Sciences and the Designated Emphasis in Computational and Genomic Biology in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Lior Pachter, Chair Professor Richard Karp Professor Michael Eisen Fall 2006

4 The dissertation of Colin Noel Dewey is approved: Chair Date Date Date University of California, Berkeley Fall 2006

6 1 Abstract Whole-Genome Alignments and Polytopes for Comparative Genomics by Colin Noel Dewey Doctor of Philosophy in Engineering - Electrical Engineering and Computer Sciences Designated Emphasis in Computational and Genomic Biology University of California, Berkeley Professor Lior Pachter, Chair Whole-genome sequencing of many species has presented us with the opportunity to deduce the evolutionary relationships between each and every nucleotide. The problem of determining all such relationships is that of multiple whole-genome alignment. Most previous work on whole-genome alignment has focused on the pairwise case and on the string pattern-matching aspect of the problem. However, to completely describe and determine the evolution of nucleotides in multiple genomes, refined definitions as well as algorithms that go beyond pattern matching are required. This thesis addresses these issues by introducing new evolutionary terms and describing novel methods for alignment at both the whole-genome and nucleotide levels. Precise definitions for the evolutionary relationships between nucleotides, presented at the beginning of this work, provide the framework within which our methods for genome alignment are described. The sensitivity of alignments to parameter values can be ascertained through the use of alignment polytopes, which are explained. For the problem of aligning multiple whole genomes, this work presents a method that constructs orthology maps, which are high-level mappings between genomes that can be used to guide nucleotide-level alignments. Combining our methods for orthology mapping and alignment polytope determination, we construct a parametric alignment of two whole fruit fly genomes, which describes the alignment of the two genomes for all possible parameter values. The usefulness of whole-genome and parametric alignments in comparative genomics is shown through studies of cis-regulatory element evolution and phylogenetic tree reconstruction.

7 Professor Lior Pachter Dissertation Committee Chair 2

8 To my family i

9 ii Contents List of Figures List of Tables v vi 1 Introduction Comparative genomics Nucleotide homology Refinements of nucleotide homology Primary refinements Secondary refinements Whole-genome alignment strategies Local alignment Hierarchical alignment Comparison of alignment strategies From alignments to biological discovery Nucleotide-level alignment: models and polytopes Parametric alignment Motivation Alignment summaries Alignment models Alignment polytopes Robustness cones Parametric alignment algorithms Statistical alignment Multiple whole-genome alignment: Mercator Motivation Related Work Pairwise maps Multiple maps Definitions Algorithms Overview

10 iii Segment Identification Comparative Scaffolding Breakpoint Identification Results Agreement with single pairwise alignments Inconsistency of multiple pairwise alignments Consistency of Mercator maps Comparative scaffolding of human, mouse, and rat Discussion Methods Pairwise alignments Consistency analysis Comparative scaffolding Parametric alignment of Drosophila Whole-genome parametric alignment Orthology mapping Polytope computation Computational results Biological results Assessment of the BLASTZ alignment Conservation of cis-regulatory elements The Jukes Cantor distance function Discussion Methods Parametric phylogeny Phylogeny in CLUSTAL W Pairwise distances from alignment polytopes Assigning posterior probabilities to polytope vertices Application to Drosophila introns Discussion Bibliography 70 A Aligning Multiple Whole Genomes with Mercator and MAVID 81 A.1 Materials A.2 Methods A.2.1 Obtaining Genome Sequences A.2.2 Preparing the Genome Sequences A.2.3 Obtaining Gene Annotations A.2.4 Generating Input for Mercator A.2.5 Constructing an Orthology Map with Mercator A.2.6 Comparatively Scaffolding Draft Genomes A.2.7 Refining the Map via Breakpoint Finding

11 A.2.8 Generating Input for MAVID A.2.9 Aligning Orthologous Segments with MAVID A.2.10 Extracting Subalignments A.3 Notes A.3.1 Obtaining Genome Sequences A.3.2 Preparing the Genome Sequences A.3.3 Obtaining Gene Annotations A.3.4 Generating Input for Mercator A.3.5 Constructing an Orthology Map with Mercator A.3.6 Comparatively Scaffolding Draft Genomes A.3.7 Refining the Map via Breakpoint Finding A.3.8 Generating Input for MAVID A.3.9 Aligning Orthologous Segments with MAVID A.3.10 Extracting Subalignments A.4 Concluding Remarks iv

12 v List of Figures 1.1 Orthologs, topoorthologs, and monotopoorthologs An example evolutionary scenario for nucleotide sequences The refinements of homology The local and hierarchical strategies for multiple whole-genome alignment The alignment polygon for two Drosophila introns The 3d alignment polytope for two Drosophila introns The state transition diagram for the PHMM of Durbin et al The state transition diagram for a PHMM with exact correspondence to the Needleman Wunsch algorithm The state transition diagram in Figure 2.4 with the silent state (S) removed A schematic of the Mercator method Identification of monotopoorthologous segments by Mercator Comparative scaffolding of genomes with Mercator Breakpoint identification with Mercator Comparison of nucleotide-level alignments with a map Leveraging pairwise homology maps to build a multiple homology map The Jukes Cantor distance function of two Drosophila genomes The three possible unrooted trees for the four Drosophila species The Minkowski sum of six pairwise alignments of a Drosophila intron

13 vi List of Tables 1.1 A comparison of two whole-genome alignment strategies The 8,362 optimal alignments for two Drosophila introns Three substitution matrices for parametric alignment Face numbers of the alignment polytopes for two Drosophila introns Agreement of pairwise alignments with Mercator pairwise maps Agreement of pairwise alignments with Mercator multiple map Results of the comparative scaffolding of human, mouse, and rat Observed running times for the computation of alignment polytopes Numbers of vertices and facets for Drosophila alignment polytopes Parametric analysis of cis-regulatory element conservation The ten most probable sets of optimal pairwise alignments A.1 Web sites of programs for the alignment of multiple whole genomes A.2 Web sites providing whole genome sequences A.3 File formats used by Mercator and MAVID

14 vii Acknowledgments I give my utmost thanks to my adviser, Lior Pachter, who has challenged me to achieve great things and supported me at every step of the way. As I can t imagine having chosen a better adviser, I thank my good friend Philip Sternberg for originally suggesting that I take a course from Lior in my first year. Parts of this work would never have been possible without my fantastic coauthors: Peter Huggins, Kevin Woods, Bernd Sturmfels, and Lior Pachter. In addition to his contributions as a collaborator, I thank Bernd Sturmfels for his great teachings and support of my upcoming career. For the great fun that kept me going through my graduate school years, I thank all of my Bay Area friends. In particular, I thank my El Cerrito housemates: Rahul Thakar, for his snarky and thoughtful remarks, Hayley Lam, for her delicious meals and uplifting laughter, Ryan Tang, for the amazing soup, and Arun Chawan, for his company during the World Cup games. For helping me to take a break every once and a while during my last year, I am grateful to my friends Jason Pittenger and Caryn Shield. Last, but not least, I thank my family. Without the constant support of my mother, Kay, my father, Jim, and my brother, Jamie, I could never have made it this far.

15 1 Chapter 1 Introduction With the genome sequences of numerous species at hand, we have the opportunity to discover how evolution has acted at each and every nucleotide in our genome. To this end, we must identify sets of nucleotides that have descended from a common ancestral nucleotide. The problem of identifying evolutionary related nucleotides is that of sequence alignment, which is central to the field of comparative genomics. When the sequences under consideration are entire genomes, we have the problem of multiple whole-genome alignment. In this introduction, a series of definitions for homology and its subrelations between single nucleotides will be stated. Within the framework of these definitions, we will describe how alignments specify such relations and review the current methods available for the alignment of multiple large genomes. We then describe a subset of tools that make biological inferences from multiple whole genome alignments. The majority of the material in this chapter comes from the review article [36]. 1.1 Comparative genomics Comparative genomics [76, 54] is the use of molecular evolution as a tool in the investigation of biological processes. Nucleotide sequences common to the genomes of several diverged species are indicative of shared biology, while differences in genomic sequence and structure may shed light on what makes species distinct. The identification of genomic elements that have been conserved over time allows biologists to focus their experiments on those parts of the genome that are fundamental to much of life. Thus, methods for the comparison of genomes and prediction of elements constrained by evolution have been

16 2 actively researched as of late. Often implicit in the discussion of conserved or common sequences is the concept of homology. Homology, famously defined by Richard Owen as the same organ in different animals under every variety of form and function, is accepted by most as common ancestry [51]. This important concept relies on the identification of evolutionary characters, distinct entities between which we may assign ancestral relationships. First used in reference to morphological characters, such as eye color or petal number, homology has since been used in reference to characters of all levels, from the molecular to the behavioral. Our recently acquired ability to identify single nucleotides in the genomes of different species allows us to specify homology at the smallest scale. The definition and identification of homology at this scale is the focus of our review. The prediction of homology between nucleotides relies on the fact that genomic positions derived from a common ancestral position are more likely to have the same state: one of A, C, G, or T. With only four states and, often, billions of genomic positions, we cannot simply use the coincidence of bases at two positions as a basis for assigning homology. Therefore, we must take advantage of context. Positions adjacent in an ancestral sequence are likely to be adjacent in the extant sequences. Predicting homology between genomic positions is thus the problem of identifying colinear segments having statistically significant numbers of similar states. Because we are faced with analyzing multiple large genomes, this task requires expertise from the fields of computer science, statistics, and mathematics. In these fields, the task of identifying related positions in sequences is the problem of alignment. Although alignment is based on the identification of similar sequences, similarity is not equivalent to homology. Similar, but unrelated sequences may arise simply by chance, or through convergent evolution. On the other hand, sequences may be homologous but not share a single similar character. In general, alignments may be used to specify relationships other than common ancestry, such as structural or functional similarities. Although identifying other classes of similarities between sequences is important, such similarities are best understood in the light of evolution. Therefore, we focus on the problem of evolutionary alignment, which aims to identify only homologous relationships between nucleotide positions.

17 3 1.2 Nucleotide homology When Watson and Crick noted that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material, [111] they alluded to the most fundamental level of ancestry. Although homology is used at many levels of biology, it is most directly defined with respect to nucleotide sequences. It is not clear from the literature that people have agreed on a precise definition of nucleotide homology. Given that the molecular mechanisms of nucleic acid replication are well known, it is important from an evolutionary theory standpoint that such definitions are established. Moreover, if we are to design and compare methods that predict homology between nucleotides, we must have concrete definitions of the problem at hand. These definitions, however, must be based on biology and not on what is possibly identified by our methods. Adhering to this ideology, we propose definitions for nucleotide homology. At the nucleic acid level, an evolutionary character is a position in single or doublestranded DNA or RNA. The copying mechanism for nucleic acids is a single-stranded phenomenon, and therefore we begin our definitions with the single-stranded case. For a singlestranded nucleic acid, a character x has two properties: its position, traditionally counted from the 5 end of the polymer, and its state, which is one of A, C, G, T, or U. A singlestranded character x is a copy of a character y if x was initially base-paired with y at the time when x was added to its polymer. In such cases, the process by which x is added to its polymer is called template-dependent synthesis [13] and y is called the template for x. Positions added to a polymer without a template (e.g., adenines added during poly(a) extension of mrnas) have no such relationships. In the double-stranded case, a character x comprises two single-stranded characters, x + and x, which are base-paired. Like a single-stranded character, double-stranded characters have a position (usually given as the position of x + ), and a state. The state of a double-stranded character depends on a third property, its orientation, which we indicate by one of + or. If x has an orientation of +, then its state is that of x + (the character on the forward strand), otherwise it is that of x (the character on the reverse strand). One of x + or x is usually a copy of the other, with exceptions occurring due to mechanisms such as replication slippage [13]. Given a double-stranded character x and a single-stranded character y, x is a copy of y, if one of x + or x is a copy of y. Conversely, y is copy of x if y is a copy of x + or x. If both x and y are double-stranded, then x is a copy of y if one

18 4 of x + or x is a copy of y + or y. We now address mutation, the second major mechanism in molecular evolution. Because characters are positions, point mutations of single-stranded characters do not change their copy relationships. For example, if x is a copy of y and a point mutation changes the state of x from A to G, then x is still a copy of y. However, in double-stranded DNA, repair mechanisms may use the template of an opposite strand or a homologous region to replace damaged positions. Whenever a position is excised and restored using a template, a new copy relationship is established. Having discussed the concepts of copying and mutation, we now define homology. For both types of characters, we say that x is derived from y if there is an ordered set of characters, x 1, x 2,..., x T, such that y = x 1, x = x T, and x t+1 is a copy of x t. The ordered set may include both single-stranded and double-stranded characters. A character x is homologous to a character y if there exists (or existed) a character z such that both x and y are derived from z. 1.3 Refinements of nucleotide homology Primary refinements Molecular homology has traditionally been divided into three subrelations: orthology, paralogy, and xenology [44]. Although these refinements have distinct biological implications [63], it is difficult to state unambiguous definitions for them in terms of biological mechanisms. Nevertheless, the distinctions made by orthology, paralogy, and xenology are important and the alignment methods we discuss distinguish between them. We therefore describe how orthology, paralogy, and xenology are applied at the nucleotide level. Homology is first refined by the relation of xenology. Consider two homologous nucleic acid positions, x and y, whose last common ancestor is z. These characters are xenologous if at least one is derived from a position w, derived from z, that was horizontally transferred. That is, the species to which w belonged changed during w s existence (excluding changes from a parent to a child species). If x and y are not xenologous, then they are either orthologous or paralogous, depending on the events undergone by z and its copies. Replication of genomic nucleic acids is a regular occurrence in cells, with copies of the same genetic material normally separating

19 5 from each other during cell division. When cell divisions (either through mitosis, meiosis, or binary fission) do not separate genomic copies, paralogous relationships are established. To make this more precise, suppose that z is copied, resulting in two characters z 1 and z 2 in the same cell, where x is derived from z 1 and y is derived from z 2. If z 1 and z 2 are not subsequently separated by cytokinesis, then x and y are paralogous. Otherwise, x and y are orthologous Secondary refinements When describing relationships between genomic characters, the notion of context is critical. As far as the fundamental terms of orthology and paralogy are concerned, genomes could be viewed as bags of genes, i.e., a set of genomic elements without an ordering or grouping on chromosomes. Therefore, when we take into consideration the context of genomic elements, more precise terms are required for describing the relationships between them, as we illustrate with the following example. Suppose that genes X and Y are orthologs and are the only members of a certain gene family in two genomes. If a mrna of Y subsequently becomes retrotransposed elsewhere in the genome, resulting in a gene Y (possibly non-functional), then X will be orthologous to both Y and Y. Orthology does not distinguish between the two copies of Y, even though they are quite different contextually and, likely, functionally. In an attempt to take context into account, researchers have often used the word synteny in describing genomic segments that have been left untouched by genomic rearrangements during evolution in several lineages. Unfortunately, this term often used incorrectly or ambiguously [84]. By itself, the word synteny neither implies homology nor colinearity of related elements. Therefore, we advocate discontinuing the use of such widespread phrases as synteny map and synteny block in favor of more precise terms. Although of similar etymology to synteny, we favor use of the word colinear, as famously used by [113], in combination with evolutionary terms to describe relationships between genomic characters. We introduce a series of definitions that allow us to describe both evolutionary and contextual relationships. The first two definitions distinguish between genomic duplication events and are the basis for the later concepts. In the following definitions, for some genetic material A and a genome G, we use G + A and G A to denote the result of the insertion

20 6 of A into G and the result of the removal of A from G, respectively. Definition 1 (Undirected duplication) A duplication event acting on a genome G, giving rise to a genome G, where G = G + A and A is a copy of A, is undirected if G A = G. Examples of undirected duplications are tandem duplications and whole- chromosomal or whole-genome duplications. The general characteristic of such duplications is that one can not distinguish between the two copies of the duplicated genetic material. The alternative to an undirected duplication is a directed duplication. Definition 2 (Directed duplication) A duplication event acting on a genome G, giving rise to a genome G, where G = G + A and A is a copy of A, is directed if G A G. In such events, A is termed the source, and A is termed the target. Unlike the undirected case, directed duplications involve distinct target and source genomic elements. Examples of events leading to directed duplications are segmental duplication and retrotransposition. Generally speaking, the source element remains in the ancestral position while the target element is placed elsewhere in the genome 1. Given this classification of duplication events, we define two subrelations of orthology. Definition 3 (Topoorthology) Characters x and x are topoorthologous if they are orthologous and neither is derived from the target of a directed duplication since the time of the last common ancestor of x and x. Thus, topoorthologous elements are orthologs that, in the absence of rearrangement events, retain the position of the ancestral element. Note that with the occurrence of undirected duplications, topoorthology is not a one-to-one relation. However, it is useful to define a subrelation that is one-to-one. Definition 4 (Monotopoorthology) Characters x and x are monotopoorthologous if they are topoorthologous and neither is derived from an undirected duplication since the time of the last common ancestor of x and x. 1 A confusing situation arises when a genomic segment is copied and inserted back into itself, in the same orientation. If the copying and insertion are done perfectly, then the duplication event is considered to be undirected by our definitions. However, one might argue that the two resulting copies are quite different. Nevertheless, the outcome of such a duplication is indistinguishable from that of two simultaneous tandem duplications, and thus must be considered undirected.

21 7 A X Y Z undirected duplication speciation XA YA ZA XB YB ZB XA YA1 YA2 ZA XB YB ZB XB2 directed duplication B Y X YA1 YA2 YB XA XB XB2 Figure 1.1: A hypothetical evolutionary scenario in which we distinguish between classes of orthologs. (A) After a speciation event, the genome of species A undergoes an undirected duplication and the genome of species B undergoes a directed duplication. (B) YA1 and YA2 are both topoorthologous to YB. XA and XB are monotopoorthologs and XB2 is only generally orthologous to XA. For directed duplications, we propose that in tree schematics, an arrow be used to point to the target (as in the arrow pointing to XB2, right tree). Monotopoorthology is similar to the concept of true exemplars [92], but defined in terms of evolutionary events. One important difference between the concepts is that in the case of a gene family consisting of two inparalogs that are the result of an undirected duplication, neither of the genes can be a monotopoortholog while Sankoff would pick one to be the true exemplar. Unlike the relations of panorthology [6], inparalogy, and outparalogy [99, 63], the relationships of topoorthology and monotopoorthology do not depend on the entire collection of species considered in a given analysis. Like orthology, paralogy, and xenology, the relations introduced here are defined only in terms of the evolutionary events that have occurred since the time of the last common ancestor and are independent of deletions. The concepts of topoorthology and monotopoorthology are important because they have func-

22 8 tional implications. Orthologous genes are likely to have more similar functions if they are topoorthologs because they have similar genomic contexts [81]. Monotopoorthologs are even more likely to have identical functions because there is less opportunity for subfunctionalization to occur when only one gene remains in the ancestral position. We illustrate these new definitions with several figures. Figure 1.1 gives a simple segment-level evolutionary scenario where these terms are applicable. Figure 1.2 gives an example of homologous relationships between copied nucleotide positions. Lastly, Figure 1.3 shows the division of homology into its subrelations. A Parent G C C T T C G G A A G C C C T A G C T C T T C G G G A T C G A G A Child 1 Child 2 U Figure 1.2: An example evolutionary scenario involving the replication of double-stranded DNA in a parent cell and division into two child cells. The two dotted arrows indicate the separation of the parent strands. Single and double-stranded copy relationships are indicated by single and double-edged arrows, respectively. Greyed double-stranded positions have participated in duplication events. Positions 3 and 4 are the result of an undirected duplication, while positions 10 and 12 are the result of a directed duplication (involving an RNA intermediate), with 12 as the source, and 10 as the target. Monotopoorthologous position pairs: (1,7), (2,8), (5,11), and (6,12). Topoorthologous position pairs: (3,9) and (4,9). Only orthologous position pairs: (6,10). Inparalogous position pairs: (3,4), (10,12).

23 9 Xenology Topoorthology Monotopoorthology Outparalogy Inparalogy Paralogy Orthology Homology Figure 1.3: Refinements of homology. 1.4 Whole-genome alignment strategies With the evolutionary relations that we wish to establish between genomic sequence defined, we review the tools available for this task. We focus on methods that take as input a set of three or more genomes and output alignments designating homology or its subrelations between individual genomic positions. There are two major strategies for aligning entire genomes: local alignment and hierarchical alignment. Figure 1.4 illustrates the main components of these strategies Local alignment The local alignment strategy is first to find all similarities between pairs of genomes and then to combine these pairwise alignments into multiple alignments. Pairwise local aligners are unaffected by genome rearrangements, as they effectively compare every position in one genome to every position in another. Local alignments between two genomes represent both orthologous and outparalogous relations (xenology is rarely a concern, unless prokaryotes are involved). When the reference and query genomes are the same, local aligners can additionally find inparalogous relationships. However, as we will describe, pair-

24 10 Multiple Whole Genome Sequences Pairwise Local Aligner (High Sensitivity) local alignments Local Aligner (Low Sensitivity) Alignment Aligner pairwise local alignments homologous segment sets Homology Mapper Multiple Global Aligner Multiple Whole Genome Alignment Figure 1.4: The local (left path) and hierarchical (right path) strategies for multiple wholegenome alignment. wise local alignments are typically filtered for orthology before they are joined into multiple alignments. Pairwise local alignment is a well-studied area [2]. Most local aligners use a seedand-extend strategy in which short exact or inexact matches are used to initiate potentially larger alignments. Although BLAST [1] could be used as a local aligner for whole genomes, many other methods have been developed with large comparisons in mind [95, 71, 14, 61, 65]. At the whole genome scale, the only method currently available for combining pairwise local alignments into multiple alignments is MULTIZ [8]. In the language of the authors of MULTIZ, a multiple whole-genome alignment is called a threaded blockset. A threaded blockset is defined as a set of multiple alignments (blocks) of colinear segments of the input sequences, where each position in the input sequences is included in exactly one block. Blocks are allowed to have just one sequence in cases where the sequence is not found to have any homologs. The purpose of MULTIZ is to join two threaded blocksets into one, given a local alignment of two of the input genomes. More precisely, given a threaded blockset containing species X and another containing species Y, the two threaded blocksets

25 11 are joined by a pairwise alignment between X and Y. The UCSC Genome Browser [59] currently provides MULTIZ genome alignments for vertebrates, insects, and yeast. For each of these alignments, the pairwise BLASTZ [95] alignments given to MULTIZ as input are first filtered with a best-in-genome criterion [62]. Given a pairwise alignment between a reference and a query genome, this filter keeps only the best alignment for each position in the reference genome. The filtered alignments are assumed to specify only orthologous relationships. Unless applied in a reciprocal manner, these filters give many-to-one orthology relationships between a reference and a query genome. Although not capturing all orthologous relationships, the resulting reference-based multiple alignments have the convenient property that every column has at most one position from each genome Hierarchical alignment A second strategy for multiple whole genome alignment combines homology mapping with efficient global alignment. Homology maps identify sets of large colinear homologous segments between multiple genomes, and are typically designed to find only monotopoorthologous relationships. For example, a homology map might specify that intervals 38,400,000-38,529,874 of human chromosome 17, 101,551, ,659,587 of mouse chromosome 11, and 90,483,833-90,585,675 of rat chromosome 10 (all intervals on the forward strand) contain monotopoorthologous and colinear positions (these intervals contain the BRCA1 gene). Genomic global alignment programs, which require colinearity, are run on segments (such as those just mentioned as an example) specified by a homology map to produce nucleotide level alignments. Methods for homology mapping typically take as input sets of pairwise local alignments and output sets of genomic segments containing significant numbers of local alignments that occur in the same order and orientation. After the sequencing of the third large genome, that of the rat [45], several methods were developed for the construction of multiple genome homology maps. GRIMM-Synteny [10], combines the output of a sensitive local aligner, such as PatternHunter [71], between all pairs of k genomes to first produce k-way anchors. Nearby and consistent k-way anchors are joined to produce a k-way orthology map. Mauve [30], a related method that uses multi-mum (multiple maximal unique match) local alignments [31] to construct orthology maps between multiple closely related

26 12 species, has been demonstrated to create maps between the human, mouse, and rat. Both Mauve and GRIMM-Synteny output one-to-one maps between genomes, which are indicative of monotopoorthology. PARAGON [94], another similar method that uses BLASTZ alignments as input, has been used to create orthology maps between more distant species. Another method used to align the human, mouse, and rat genomes [16] uses a progressive extension of a pairwise strategy engineered for aligning human to mouse [28]. Using the BLAT [61] local aligner, a mouse-rat orthology map was first constructed. The orthologous segments were aligned using the LAGAN [15] global aligner, mapped to the human genome using BLAT, and finally put into a multiple alignment using MLAGAN. The resulting maps represented all orthology relationships, although most genomic segments were found to be monotopoorthologous. A final method used for human, mouse, and rat orthology mapping used BAC end sequence comparisons as a basis for orthologous anchors [115]. In Chapter 3, a monotopoorthology mapping method called Mercator will be described. Here we provide a brief overview of Mercator so that it may be compared with the other methods described in this section. Mercator takes as input a set of non-overlapping landmarks in each genome and pairwise similarity scores between all landmarks. A graph is constructed with landmarks as vertices and hits between them as edges. Within this graph, Mercator identifies high-scoring cliques, i.e., sets of landmarks (in this graph, containing at most one landmark from each genome) where there is a significant hit between each pair. For example, if exons are used as landmarks, then the first exons of the human, mouse, and rat SHH gene would be identified as a high-scoring clique. Such cliques indicate orthologous relationships. Starting with the largest cliques (those in which we are most confident), adjacent and consistent cliques (such as those formed from each exon of SHH ) are joined into runs that represent orthologous segments. Edges not consistent with previously identified runs are discarded and smaller cliques are discovered in the graph and incorporated into runs. The algorithm iterates until cliques involving all possible combinations of genomes have been considered. Thus, unlike most other monotopoorthology mapping methods, Mercator produces maps comprising sets of segments that may be specific to any subset of the input genomes. Once colinear homologous segments have been identified, multiple global alignment programs are used to assign homologous relationships between individual positions. Global aligners create a one-to-one mapping between the positions of two sequences. Thus,

27 13 in the absence of recent tandem duplications, multiple global aligners will determine the monotopoorthologous positions in a a set of colinear monotopoorthologous segments. The only methods that have been run on whole large genomes thus far are MAVID [12], and MLAGAN [15]. Both rely on global chaining of short matches between pairs of sequences. A chain is simply an ordered set of locally aligned segments with the property that the coordinates of the segments of the ith local alignment in the chain are less than those of the segments of the jth local alignment, when i < j. To create a multiple alignment, both methods use a progressive strategy. MAVID and MLAGAN differ in their identification of local alignment anchors (exact vs. inexact) and the methods by which alignments are aligned at internal nodes of the phylogenetic tree (ancestral reconstruction vs. via sum-ofpairs). Other multiple genomic global aligners that have not been run on whole genomes but are comparable are given in [14, 8, 114] Comparison of alignment strategies Currently, all local and hierarchical multiple alignment methods focus on orthology. They either identify many-to-many, many-to-one, or one-to-one (monotopoorthologous) relationships. Hierarchical methods begin by using local alignments, but typically do not use local methods with their most sensitive parameter settings. This results in much faster running times at the expense of missing short and significantly diverged orthologous sequence. Although less sensitive at the genome-wide scale, the hierarchical strategy can afford to use more sensitive methods at a smaller scale, within the sets of orthologous segments identified by the map. An important difference between the two strategies is the treatment of genomic segments that have been inserted or deleted during evolution. Given a set of orthologous segments, global aligners will gap all positions that are not found to have orthologous relations. With recent insertions of mobile elements, these gaps can often be very large. Local alignments, on the other hand, are not extended through longer insertions and deletions. Segments that are not part of any local alignment may be interpreted in two ways. One way is to treat orthologous relationships to such segments as missing data. A second interpretation is that segments not part of any alignment are implicitly gapped, i.e., they are believed to have been inserted or deleted. The choice of alignment strategy and the treatment of gaps are issues that researchers must be aware of when using multiple whole

28 14 Strategy Local Hierarchical Programs BLASTZ, PatternHunter, GRIMM-Synteny, Mauve, MUMmer, MULTIZ, CHAIN- PARAGON, Mercator, MAVID, NET MLAGAN, TBA, MAP2 Relationships Most commonly many-to-one Most commonly monotopoorthology identified orthology Sensitivity (Genome-wide) High Moderate, depending on local alignments used for homology map construction Sensitivity Moderate High (Within homologous segments) Speed Slow, but often parallelizable Fast and parallelizable Short Indels Explicitly gapped Explicitly gapped Long Indels Implicitly gapped or interpreted as missing data Explicitly gapped Table 1.1: A comparison of the local and hierarchical multiple whole-genome alignment strategies genome alignments for biological inference. Table 1.1 summarizes the important differences between the two strategies. 1.5 From alignments to biological discovery Multiple whole genome alignments usually constitute only the first step of comparative genomics studies targeted at specific biological questions. We refer the reader to a number of excellent surveys on comparative genomics [76, 54] for examples of how multiple whole genome alignments have been utilized. However, we have selected for further discussion one key (unsolved) problem that is central to utilizing multiple alignments for functional genomics. A multiple whole genome alignment assigns homology between nucleotides, but it does not identify genomic positions that are under selection or evolving neutrally. The analysis of homologous nucleotides in a multiple alignment using an evolutionary model forms part of the emerging field of phylogenomics [39] and is essential for distinguishing functional elements from neutrally evolving regions in genomes. The term conserved nucleotide is used informally to describe nucleotides that appear to be mutating slower than suggested by a neutral model of evolution (usually based on

29 15 a continuous time Markov model for point mutation [41]). Groups of conserved nucleotides are called conserved elements. To our knowledge, there is no precise definition of conserved elements at this time. Software tools that have been developed for identifying conserved nucleotides and elements include GERP [27], PhastCons [96], BinCons [72], and Shadower [75]. Conserved elements can also be identified by examining insertions and deletions within multiple alignments. This has been described in [70, 98]. In [36], there is a discussion of how the choice of alignment affects the determination of conserved nucleotides, and estimation of evolutionary model parameters. The problem of identifying conservation within multiple alignments is inherently a statistics problem, but one that requires further advances by biologists in experimentally validating functional elements. Such advances are crucial for defining appropriate choices of evolutionary models, and will subsequently inform computational biologists on the best ways to predict new functional elements.

30 16 Chapter 2 Nucleotide-level alignment: models and polytopes The majority of methods for the alignment of nucleotide sequences include, as a component, the classic Needleman Wunsch global alignment algorithm [80]. In this chapter, we analyze this algorithm from both parametric and statistical points of view. Parametric alignment, which determines the dependencies of optimal alignments on parameter values, is described in terms of alignment polytopes. A statistical model with a direct correspondence to the Needleman Wunsch algorithm is also presented. The parametric alignment aspects of this chapter come from the paper [34]. 2.1 Parametric alignment Motivation Needleman Wunsch pairwise sequence alignment is known to be sensitive to parameter choices. To illustrate the problem, consider the 8th intron of the Drosophila melanogaster CG9935-RA gene (as annotated by FlyBase [37]) located on chr4:660, ,522 (April 2004 BDGP release 4). This intron, which is 61 base pairs long, has a 60 base pair ortholog in Drosophila pseudoobscura. The ortholog is located at Contig8094 Contig5509:4,876-4,935 in the August 2003, freeze 1 assembly, as produced by the Baylor Genome Sequencing Center. Using the basic 3-parameter scoring scheme (match M, mismatch X and space

31 17 penalty S), these two orthologous introns have the following optimal alignment when the parameters are set to M = 5, X = 5 and S = 5: mel GTAAGTTTGTTTAT-ATTTTTTTTTTTTTGAAGTGA-CAAATAGC-A-CTTATAAATATACTTAG pse GTTCGTTAACACATGAAATTCCATCGCCTGAT-TGTTCA-CTATCTAACTAACGAAT-T--TTAG ** *** ** * ** * *** ** ** ** * * ** * *** * **** However, if we change the parameters to M = 5, X = 6 and S = 4, then the following alignment is optimal: mel GTAAGTT------TGTTTATATTTTTTTT--T--TT-TTGAAGTGA-CAAATAGCACTTATA--A pse GTTCGTTAACACATG-A-A-ATTCCATCGCCTGATTGTT-CACT-ATC---TA--AC-TA-ACGA ** *** ** * *** * * ** ** * * * * ** ** ** * * mel ATATACTTAG pse AT-T--TTAG ** * **** Note that a relatively small change in the parameters produces a very different alignment of the introns. This problem is exacerbated with more complex scoring schemes, and is a central issue with whole-genome alignments produced by programs such as MAVID [11] or BLASTZ/MULTIZ [95]. Indeed, although whole genome alignment systems use many heuristics for rapidly identifying alignable regions and subsequently aligning them, they all rely on the Needleman Wunsch algorithm at some level. Dependence on parameters becomes an even more crucial issue in the multiple alignment of more than two sequences. Parametric alignment was introduced by Waterman, Eggert and Lander [109] and further developed by Gusfield et al. [48, 49] and Fernandez-Baca et al. [42] as an approach for overcoming the difficulties in selecting parameters for Needleman Wunsch alignment. See [43] for a review and [82, 83] for an algebraic perspective. Parametric alignment amounts to partitioning the space of parameters into regions. Parameters in the same region lead to the same optimal alignments. Enumerating all regions is a non-trivial problem of computational geometry. The following sections present our approach to solving this problem. In Chapter 4 we solve this problem on a whole genome scale for up to five free parameters Alignment summaries We present an approach to parametric alignment that rests on the idea that the score of an alignment is specified by a short list of numbers derived from the alignment. For instance, given the standard 3-parameter scoring scheme, we summarize each alignment by the number m of matches, the number x of mismatches, and the number s of spaces in the

32 18 alignment. The triple (m, x, s) is called the alignment summary. As an example consider the pair of orthologous Drosophila introns given in Section The first (shorter) alignment has the alignment summary (33, 23, 9) while the second (longer) alignment has the alignment summary (36, 10, 29). Remarkably, even though the number of all alignments of two sequences is very large, the number of alignment summaries that arise from Needleman Wunsch alignment is very small. Specifically, in the example above, where the two sequences have lengths 61 and 60, the total number of alignments is 1,511,912,317,060,120,757,519,610,968,109,962,170,434,175, There are only 13 alignment summaries that have the highest score for some choice of parameters M, X, S. For biologically reasonable choices, i.e., when we require M > X and 2S < X, only six of the 13 summaries are optimal. These six summaries account for a total of 8362 optimal alignments (Table 2.1). Note that the basic model discussed above has only d = 2 free parameters, because for a pair of sequences of lengths l, l all the summaries (m, x, s) satisfy 2m + 2x + s = l + l. (2.1) This relation holds with l + l = 121 for the six summaries in Table 2.1. Figure 2.1 shows the alignment polygon, as defined in Section 2.3, in the coordinates (x, s). alignment summary number of alignments with that summary A (25, 35, 1) 5 B (28, 31, 3) 15 C (32, 25, 7) 44 D (33, 23, 9) 78 E (34, 20, 13) 156 F (36, 10, 29) 8064 Table 2.1: The 8,362 optimal alignments for two Drosophila intron sequences. In general, for two DNA sequences of lengths l and l, the number of optimal alignment summaries is bounded from above by a polynomial in l + l of degree d(d 1)/(d + 1), where d is the number of free parameters in the model [43, 82]. For d = 2, this

33 19 A B C D E F A B C D E F Figure 2.1: The alignment polygon for our two introns is shown on the left. For each of the alignment summaries A, B,..., F in Table 2.1, the corresponding cone in the alignment fan is shown on the right. If the parameters (S, X) stay inside a particular cone, every optimal alignment has the same alignment summary. degree is 0.667, and so the number of optimal alignment summaries has sublinear growth relative to the sequence lengths. Even for d = 5, the growth exponent d(d 1)/(d + 1) is only This means that all optimal alignment summaries can be computed on a large scale for models with few parameters. The growth exponent d(d 1)/(d + 1) was derived by Gusfield et al. [48] for d = 2 and by Fernandez-Baca at al. [42] and Pachter Sturmfels [82] for general d. Table 2.1 can be computed using the software XPARAL [49]. This software works for d = 2 and d = 3, and it generates a representation of all optimal alignments with respect to all reasonable choices of parameters. Although XPARAL has a convenient graphical interface, it seems that this program has not been widely used by biologists, perhaps because it is not designed for high throughput data analysis and the number of free parameters is restricted to d Alignment models The basic model, discussed in Section 2.1.1, has three natural parameters, namely, M for match, X for mismatch and S for space. If the numbers M, X and S are fixed, then we seek to maximize M m + X x + S s, where (m, x, s) runs over the summaries of all alignments. In light of the relation (2.1), this model has only two free parameters and there is no loss of generality in assuming that the match score M is zero. From now on we set M = 0 and we take X and S as the free parameters. We define the 2d alignment summary to be the pair (x, s). Following the convention of [83, 2.2], we summarize a scoring scheme with a 5 5-

34 20 0 X X X S X 0 X X S X X 0 X S, X X X 0 S S S S S 0 X Y X S X 0 X Y S Y X 0 X S, X Y X 0 S S S S S 0 X Y Z S X 0 Z Y S Y Z 0 X S Z Y X 0 S S S S S Table 2.2: The Jukes Cantor matrix, the Kimura-2 matrix, and the Kimura-3 matrix. These three matrices correspond to JC69, K80 and K81 in the Felsenstein hierarchy [83, Figure 4.7] of probabilistic models for DNA sequence evolution. matrix w whose rows and columns are both indexed by A, C, G, T, and -. The matrix w for the basic model is the leftmost matrix in Table 2.2, and it corresponds to the Jukes Cantor model of DNA sequence evolution. Using such matrices, we describe three additional models that are commonly used and have dimensions 3, 4, and 5. The 3d model is the most commonly used scoring scheme for computing alignments. This model includes the number g of gaps. A gap is a complete block of spaces in one of the aligned sequences; it either begins at the start of the sequence or is immediately preceded by a nucleotide, and either follows the end of the sequence or is succeeded by a nucleotide. The 3d alignment summary is the triple (x, s, g). The score for a gap, G, is known as the affine gap penalty. If X, S and G are fixed, then the alignment problem is to maximize X x + S s + G g where (x, s, g) runs over all 3d alignment summaries. The parametric version is implemented in XPARAL. Introducing the gap score G does not affect the matrix w which is still the leftmost matrix in Table 2.2. The 4d model is derived from the Kimura-2 model of sequence evolution. The 4d alignment summary is the vector (x, y, s, g) where s and g are as above, x is the number of transversion mismatches (between a purine and a pyrimidine or vice versa) and y is the number of transition mismatches (between purines or between pyrimidines). The four parameters are X, Y, S, and G. The matrix w of scores, as specified in [83, (2.11)], is now the middle matrix in Table 2.2. The 5d scoring scheme is derived from the Kimura-3 model. Here the matrix w is the rightmost matrix in Table 2.2. The 5d alignment summary is the vector (x, y, z, s, g), where s counts spaces, g counts gaps, x is the number of mismatches A C, C A, G T or T G, y is the number of mismatches A G, G A, C T or T C, and z is the number of mismatches A T, T A, C G or G C. Thus,

35 21 s x g Figure 2.2: The 3d alignment polytope of our two Drosophila introns has 76 vertices. The marked vertex (x, s, g) = (30, 5, 2) represents the BLASTZ alignment. the 5d alignment summaries of the two Drosophila intron alignments at the beginning of Section are (4, 10, 9, 9, 8) and (3, 3, 4, 29, 17). Even the 5d model does not encompass all scoring schemes that are used in practice. See Section for a discussion of the BLASTZ scoring matrix [25] and its proximity to the Kimura-2 model Alignment polytopes The convex hull of a finite set S of points in R d is the smallest convex set containing these points. It is denoted conv(s) and called a convex polytope. There exists a unique smallest subset V S for which conv(s) = conv(v). The points in V are called the vertices of the convex polytope. The vertices lie in higher-dimensional faces on the boundary of the polytope. Faces include edges, which are one-dimensional, and facets, which are (d 1)- dimensional. Introductions to these concepts can be found in the textbooks [46, 88]. By computing the convex hull of a finite set S R d we mean identifying the vertices and the facets of conv(s) and, if possible, all faces of all dimensions.

36 22 Fix one of the four models discussed in Section The alignment polytope of two DNA sequences is the convex polytope conv(s) R d, where S is the set of alignment summaries of all alignments of these two sequences. For instance, the 3d alignment polytope of two DNA sequences is the convex polytope in R 3 that is formed by taking the convex hull of all alignment summaries (x, s, g). Figure 2.2 shows the 3d alignment polytope for the two sequences in Section Its projection onto the (x, s)-plane is the polygon depicted in Figure 2.1. It is a basic fact of convexity that the maximum of a linear function over a polytope is attained at a vertex. Thus, an alignment of two DNA sequences is optimal if and only if its summary is in the set V of vertices of the alignment polytope. The Needleman Wunsch algorithm efficiently solves the linear programming problem over this polytope. For instance, for the 3d model with fixed parameters, the alignment problem is the linear programming problem Maximize X x + S s + g G subject to (x, s, g) V. (2.2) For a numerical example consider the parameter values X = 200, S = 80 and G = 400, which represent an approximation of the BLASTZ scoring scheme (Section 3.1). The solution to (2.2) is attained at the vertex (x, s, g) = (30, 5, 2) which is the 3d summary of the following alignment of our two Drosophila introns mel GTAAGTTTGTTTATATTTTTTTTTTTTTGAAGTGACAAATAGC--ACTTATAAATATACTTAG pse GTTCGTTAACACATGAAATTCCATCGCCTGATTGTTCACTATCTAACTAACGAAT---TTTAG ** *** ** ** * * ** * ** * *** * *** **** The 3d summary of this alignment is the marked vertex in Figure 2.2. As demonstrated by our discussion of alignment polytopes, convexity is the organizing principle that reveals the needles in the haystack. In our running example of two Drosophila introns, the haystack consists of more than alignments, and the needles are the 8362 optimal alignments. Thus, parametric alignment of two DNA sequences relative to some chosen scoring scheme means constructing the alignment polytope of the two sequences Robustness cones Suppose we are given a specific alignment of two DNA sequences. Then the robustness cone of that alignment is the set of all parameter vectors that have the following

37 23 d # of vertices # of edges # of 2d faces # of 3d faces # of 4d faces Avg. # of edges per vertex Table 2.3: Face numbers of the alignment polytopes for the intron sequences from the beginning of the Introduction. The average number of edges containing a vertex is the average number of linear inequalities bounding a robustness cone. property: any other alignment that has a different alignment summary is given a lower score. As a mathematical object, the robustness cone is an open convex polyhedral cone in the space R d of free parameters. An alignment summary is said to be optimal, relative to a given model, if its robustness cone is not empty. Equivalently, an alignment summary is optimal if there exists a choice of parameters such that the Needleman Wunsch algorithm produces only that alignment summary. Such a parameter choice will be robust, in the sense that if we make a small enough change in the parameters then the optimal alignment summary will remain unchanged. Each robustness cone is specified by a finite list of linear inequalities in the model parameters. For example, consider the first alignment in the Section Its 2d alignment summary is the pair (x, s) = (23, 9), labeled D in Table 2.1. The robustness cone of this summary is the set of all points (X, S) such that the score 23X + 9S is larger than the score of all other alignments summaries other than (23, 9). This cone is specified by the two linear inequalities S > X and 4S < 3X. If we fix two DNA sequences, then the robustness cones of all the optimal alignments define a partition of the parameter space, R d. That partition is called the alignment fan of the two DNA sequences. Figure 2.1 shows the (biologically relevant part of the) alignment fan of two Drosophila introns in the 2d model. While this alignment fan has only 13 robustness cones, the alignment fan of the same introns has 76 cones for the 3d model, 932 cones for the 4d model, and 10,009 cones for the 5d model. These are the vertex numbers in Table 2.3.

38 Parametric alignment algorithms The problem of computing a parametric alignment is now specified precisely. The input consists of two DNA sequences. The output is the set of vertices and the set of facets of the alignment polytope conv(s). See also [83, Remark 2.29]. We note that the robustness cone of an optimal alignment summary is the normal cone of the polytope at that vertex. The alignment fan is the normal fan of the alignment polytope. See [83, p. 61] for definitions of these concepts. We now briefly outline two different methods for constructing alignment polytopes: polytope propagation and incremental convex hull. Polytope propagation for sequence alignment is the Needleman Wunsch algorithm with the standard operations of plus and max replaced by Minkowski sum and polytope merge (convex hull of union). The polytope propagation algorithm was introduced in [82, 83]. The incremental convex hull algorithm, on the other hand, gradually builds the alignment polytope by successively finding new optimal alignment summaries, the vertices of the polytope. In order to find the new optimal summaries, the algorithm repeatedly calls a Needleman Wunsch (NW) subroutine that is an efficient implementation of the classical Needleman Wunsch algorithm. For fixed values of the parameters, this subroutine returns an optimal alignment summary. For instance, for the 3d model, the input to the NW subroutine is a parameter vector (X, S, G) and the output is an optimal summary (x, s, g). Suppose we have already found a few optimal alignment summaries, by running the NW subroutine with various parameter values. We let P be the convex hull of the summaries in R d, and we assume that P is already d-dimensional. We maintain a list of all vertices and facets of P. Each facet is either tentative or confirmed, where being confirmed means that its affine span is already known to be a facet-defining hyperplane of the final alignment polytope. In each iteration, we pick a tentative facet of P and an outer normal vector U of that facet. We then call the NW subroutine with U as the input parameter. The output of the NW subroutine is an optimal summary v. If the optimal score U v equals the maximum of the linear function U w over all w in P then we declare the facet to be confirmed. Otherwise, the score U v is greater than the maximum and we replace P by the convex hull of P and v. This convex hull computation utilizes the beneath-beyond construction [88, 3.4.2] which erases some of the tentative facets of the old polytope and replaces them by new tentative facets.

39 25 The algorithm terminates when all facets are confirmed. The current polytope P at that iteration is the final alignment polytope. The number of iterations of this incremental convex hull algorithm equals the number of vertices plus the number of facets of the final polytope P. So for a given model, the running time of the incremental convex hull algorithm scales linearly in the size of the output. This was confirmed in practice by our computations described in Chapter 4 (see Table 4.1). Given an alignment polytope, there are various subsequent computations one may wish to perform. For instance, we may be interested in the robustness cones at the vertices. In order to get an irredundant inequality representation of a robustness cone, it suffices to know the edges emanating from the corresponding vertex. Thus it is useful to also compute the edge graph of each of our polytopes. 2.2 Statistical alignment We have seen that parametric alignment allows for the examination of the dependence of optimal alignments on model parameters. Thus far, parametric methods have only been described for alignment scoring schemes that are not statistically based. In this section, a statistical alignment model will be given such that every parameter setting is equivalent to a set of scores for classic Needleman Wunsch alignment. The problem of alignment can be formulated probabilistically in terms of a pair hidden Markov model (PHMM) [38]. For global alignment, as solved by the Needleman Wunsch algorithm, Durbin et al. [38] present the pair hidden Markov model shown in Figure 2.3. For this model, the optimal log-odds alignment can be found with a modified version of the Needleman Wunsch algorithm with parameters computed from the PHMM probabilities. However, this model has a some shortcomings. 1. The model does not allow for an insertion to follow a deletion, and vice versa. Such scenarios could definitely occur through evolution, particularly in mutable regions. 2. Two alignments with the same alignment summary (depending on the parameter scheme used) do not necessarily have the same probability, under this model. For example, an insertion of length n at the beginning of an alignment will contribute δɛ n 1 (1 ɛ τ) to the total probability (ignoring emissions), while an insertion at the end of an alignment will contribute δɛ n 1 τ. Although the alignment probabilities

40 26 ε δ I BEGIN δ 1-ε-τ τ END 1-2δ-τ H τ δ 1-ε-τ D τ δ ε τ Figure 2.3: The state transition diagram for the PHMM of [38]. in this example are likely to be very similar, it is desirable to have a model that is independent of the orientation of the sequences. That is, if we flip the alignment (perhaps by reverse-complementation in the case of DNA), the probability given by the model should be the same. 3. A consequence of not giving the same probability to alignments having the same summary is that the transformation from the PHMM to Needleman Wunsch involves some modification to the termination step. We present a new PHMM that overcomes these issues and, as a consequence, will be useful in combination with parametric analyses (Chapter 5). The state space and transitions for this PHMM are shown in Figure 2.4. This model has been factorized through the use of a silent state (S) in order to make the transitions sparser, and to clearly demonstrate the meaning of the parameters δ, ɛ, and τ. An equivalent model with the silent state removed

41 27 ε I 1 BEGIN H 1 1-δ-τ δ/2 1-ε S δ/2 1-ε τ END D ε Figure 2.4: The state transition diagram for a PHMM with exact correspondence to the Needleman Wunsch algorithm. is shown in Figure 2.5. With transitions directly between the I and D states, this PHMM clearly allows for adjacent insertions and deletions. We now show that the model assigns the same probability to alignments with the same summary. For simplicity, the PHMM with the three transition parameters shown in Figure 2.4 and a single emission parameter µ will be considered. For sequences over an alphabet Σ, the emission probabilities are: (1 µ)/ Σ if c 1 = c 2 P H (c 1, c 2 ) = µ/( Σ ( Σ 1)) otherwise P D (c) = P I (c) = 1/ Σ, for all c Σ With τ fixed, this parameter scheme for the PHMM corresponds to the 3d scoring scheme from Section Under this model, the probability of an alignment with m matches, x mismatches, g spaces, and s spaces is ( ) (1 µ)(1 δ τ) m ( ) µ(1 δ τ) x ( ) δ(1 ɛ) g ( ) ɛ s τ. (2.3) Σ Σ ( Σ 1) 2ɛ Σ

42 28 ε+(1-ε)δ/2 δ/2 I δ/2 (1-ε)(1-δ-τ) (1-ε)τ BEGIN 1-δ-τ H (1-ε)δ/2 (1-ε)δ/2 τ END 1-δ-τ (1-ε)(1-δ-τ) δ/2 (1-ε)τ δ/2 D ε+(1-ε)δ/2 τ Figure 2.5: The state transition diagram in Figure 2.4 with the silent state (S) removed. Therefore, a maximum a posteriori alignment of the PHMM is equivalent to an optimal Needleman Wunsch alignment with parameters ( ) (1 µ)(1 δ τ) M = log (2.4) Σ ( ) µ(1 δ τ) X = log (2.5) Σ ( Σ 1) ( ) δ(1 ɛ) G = log (2.6) 2ɛ ( ) ɛ S = log. (2.7) Σ The correspondence between the PHMM MAP estimates and optimal Needleman Wunsch alignments will be utilized in Chapter 5.

43 29 Chapter 3 Multiple whole-genome alignment: Mercator A central problem in the comparison of multiple whole genome sequences is homology mapping, which is the identification of sets of homologous segments among multiple genomes. We present a solution to this problem that focuses on monotopoorthology, a one-to-one subrelation of orthology introduced in Chapter 1. Our methods for identifying monotopoorthologous segments and locating evolutionary breakpoints are based on graph theoretical frameworks, including the formalism of undirected graphical models. By effectively making use of information from multiple genomes to identify monotopoorthologous segments and their bounding breakpoints, we are able to produce homology maps that improve on pairwise maps. In an analysis of four mammalian genomes, we show that existing pairwise alignment strategies map 3% of exons inconsistently, 24.8% of which we are able to correct using our homology mapping methods. An additional feature of our method is the ability to comparatively scaffold assemblies that are not yet mapped to chromosomes. We demonstrate this by comparatively assembling contigs from the human, mouse and rat genome with few incorrect joins and high coverage of the genomes. An implementation of our method, Mercator, is freely available and is fast enough to be run on a single workstation. It is currently being used to guide nucleotide-level multiple alignments of whole genomes and for genome evolution studies.

44 Motivation Since the completion of the first draft assemblies of the human genome [66, 108], technological advances and lower costs have resulted in the sequencing of many whole vertebrate genomes, as well as numerous non-vertebrate genomes. At the time of writing, whole genome sequence assemblies have been produced for 17 vertebrate species, 8 of which have assemblies into full chromosomes. Projects such as the Mammalian Genome Project [73] and the recent sequencing of 12 Drosophila species [26] will provide us with a large number of additional genomes to analyze using the tools of comparative genomics. An important first step in comparing all of these genomes is to determine the large-scale correspondences between them. At a low resolution, one can identify (with a microscope) related chromosomes from two species using chromosome painting techniques such as Zoo-FISH [93]. These techniques allow for the detection of conserved synteny: the state of elements on the same chromosome in one genome having their homologs occurring on the same chromosome in another genome. When markers, such as genes, have been mapped and ordered on multiple genomes, comparative genetic maps can be created. Such maps have resolutions that depend on the density of markers used and allow for the detection of conserved segments: segments in multiple genomes containing homologous markers that occur in the same order in each genome [79]. With the sequencing of whole genomes, we are now able to produce comparative maps at the nucleotide level. In addition to protein coding genes for which we have genomic coordinates, we can also use conserved non-coding sequence as markers. For a set of fully sequenced genomes, we would like to divide the genome sequences into coordinate-based segments and then determine evolutionary relationships among them. We call the result of such an analysis a homology map. When the segments are defined in such a way that homologous segments are colinear, the map is called a colinear homology map. Of particular interest are the construction of orthology maps which specify only a subset of evolutionary relationships between genomic segments. Homology maps are important for many kinds of downstream analyses. For example, if one would like to obtain a nucleotide-level alignment of a set of genomes, one simply feeds the sets of homologous segments established by the map to alignment programs such as DIALIGN [77], MAVID [12], MLAGAN [15], or TBA [8]. Another analysis that requires a colinear homology map, but not a nucleotide-level alignment, is that of determining the

45 31 history of genomic rearrangements [86, 78]. The combination of these two analyses gives a proposed evolutionary history for every position in every genome and allows for the inference of ancestral genome sequence [7]. In this chapter, we present a method for constructing colinear monotopoorthology maps between multiple whole genomes. We first introduce a series of molecular evolution definitions to explain what such a map represents. We then present the details of our algorithms for constructing maps, as they are implemented in the program Mercator. Lastly, we give some results that demonstrate the performance of Mercator on four large eukaryotic genomes. Mercator is shown to produce accurate maps, to assemble genomes comparatively, and to locate breakpoints between rearranged orthologous segments effectively. 3.2 Related Work Pairwise maps Much work has been done on the identification of homologous segments between pairs of genomes. In general, these methods comprise three steps: 1. identification of local alignments between genome sequences, 2. clustering of alignments by genomic position and orientation, and 3. filtering and classification of clusters to predict homologous relationships. Local alignments between genomes can be obtained through all-vs-all comparisons of protein coding regions using programs such as BLAST and BLAT [1, 61] or from nucleotide alignment programs that can be used at the whole-genome scale [61, 65, 68, 95]. Without using genomic context information (step two), several methods predict orthology and paralogy relations between the proteomes of multiple genomes [102, 67, 89]. Combined with genomic coordinates for the genes encoding the proteins given as input, these methods can be used to produce gene-based homology maps. Although one can predict homology relationships between proteins without utilizing the positions of their coding genes, genomic context can provide valuable information for inferring the evolutionary relationships between genomic segments. Since rearrangement events undergone by genomes during evolution are rare, neighboring elements in one

46 32 genome tend to have homologs that are neighbors in other genomes [105]. Other methods described here take advantage of genomic context in some fashion. A common approach used by genome sequencing projects (e.g., [100]) for determining orthology maps is to first identify best bidirectional hits, or SymBets [63], between the protein coding genes of two genomes. SymBets are assumed to be orthologs and additional orthology relationships are determined based on adjacencies to genes participating in SymBets. This strategy has been extended in various ways. For human and mouse, nucleotide-level syntenic anchors and function annotations are additionally used to predict orthologous genes [116]. For yeast, the SymBet strategy was extended to give subsets of orthologous proteins in cases where recent duplications have led to many-to-one or manyto-many orthologous relationships [60]. Of the orthology mapping methods that use local nucleotide alignments, the most widely-used are based on the chaining of local alignments [58, 62, 65, 86, 101]. Although all of these methods chain local alignments, they are quite different at all three steps of the orthology mapping process and give different types of output. For example, the nets produced by [62] make up a many-to-one orthology map between a reference and query genome, while the output of [58] is a one-to-one orthology map. We will distinguish between such orthology maps in Section 3.3. Methods that do not use chaining and that rely on nucleotide anchors unique to a pair of genomes are given in [112] and [69]. More recently, methods that take into account a parsimonious series of rearrangement events while predicting orthologous segments have been proposed [23, 24]. For a more detailed review of some of these methods, see [2, 36]. Thus far, we have mentioned methods that attempt to identify orthologous relationships between genomes. There are many others that predict homologous genomic segments without specifying precisely how the segments are related. The majority of these methods use gene comparison data. Some identify genomic segments containing a significant number of colinear or clustered homologous genes [18, 107, 53, 52]. Others are formulated to locate diagonals within gene dot plots [20, 50]. The diagonal locating ability of [20] has been used in [19] to construct maps that distinguish between orthologous and paralogous segments. Lastly, a unique method utilizing the formalism of Markov random fields and capable of using either gene or nucleotide-level alignments was developed by [21].

47 Multiple maps Although a large number of pairwise homology mapping programs are available, few methods have been developed for mapping between three or more genomes simultaneously. Simply combining pairwise maps is not sufficient because, just as in the multiple sequence alignment problem, pairwise maps are often not consistent. We demonstrate the inconsistency of whole-genome alignments based on pairwise maps in Section Section reviewed a number of multiple genome mapping methods that have been used on large genomes. Other mappers of note include a method that uses Gibbs-like sampler to detect orthologous segments in multiple diverged genomes [74], and a method that finds conserved gene orders in multiple genomes once orthology has been assigned [85]. Visualizing multiple homology maps presents other challenges. The K-Browser [22] and ENSEMBL s alignslice view [5] allow for the browsing of a one-to-one orthology mapping and nucleotide-level alignment. SyntenyVista [55] is a well-prototyped program for visualizing orthology between multiple genomes at a variety of scales. UCSC s Genome Browser also has tracks for visualizing the nets and chains produced by [62]. 3.3 Definitions The literature on homology mapping methods is full of ambiguous use of terminology. Methods are also frequently used to define the desired output making it difficult to ascertain the biological relevance of the maps that are produced. These problems are the result of a lack of terminology for describing the complex evolutionary relationships that form the basis of homology. In Section 1.3.2, we introduced a series of new terms that are helpful in describing the output of homology mapping methods, and in particular, the output of the method presented here. We wish to emphasize that the new terms are based on biological considerations and have not simply been invented to explain algorithmic output. We introduce two terms to define the correspondence that our method and others determine between a set of genomes. Definition 5 (Colinear monotopoorthology) Genomic segments S 1 and S 2 are colinear monotopoorthologous if they are monotopoorthologous and the monotopoorthologous nucleotides contained within the segments occur in the same order.

48 34 Definition 6 (Colinear monotopoorthology map) A colinear monotopoorthology map is a set of genomic segment sets, each consisting of pairwise colinear monotopoorthologous segments. For a set of multiple whole genomes, a colinear monotopoorthology map is what many have referred to as a synteny map, and is the primary output of the method that we present. 3.4 Algorithms Overview We now describe our algorithms for constructing a colinear monotopoorthology map between multiple whole genomes. These algorithms are implemented in the program Mercator. Mercator s primary input is a set of k genomes, and its primary output is a colinear monotopoorthology map. A map consists of a collection of segment sets, with all segments in a given set considered to be colinear monotopoorthologs of each other. Each segment set contains between two and k segments, with at most one segment from each genome. The strategy employed by our method is to consider only certain intervals, or anchors, in each genome in order to determine monotopoorthology relationships. Although an anchor may be any genomic interval, convenient intervals to use are genes, exons, or intervals that are part of maximally unique matches (MUMs) [65]. The only restriction on anchors is that they do not overlap. Anchors are compared in an all-vs-all fashion to produce a set of significant pairwise hits between anchor sequences. Using sets of input anchors and hits, Mercator first identifies colinear monotopoorthologous segments by identifying neighboring cliques of anchors. During the joining of neighboring cliques, Mercator is capable of comparatively assembling genomes whose contigs have not been assigned to chromosomal positions. The colinear monotopoorthology pre-map established by the first step is refined by locating good breakpoints in between adjacent segments. The location of breakpoints is formulated as finding a maximum a posteriori configuration of a certain undirected graphical model. A schematic specifying the inputs, outputs, and primary components of Mercator is shown in Figure 3.1.

35 Genome Sequences Assembled Genomes Anchor Intervals Segment Identification Pre-map Breakpoint Identification Map Pairwise Anchor Hits Mercator Figure 3.1: A schematic of the Mercator method. 3.4.

49 35 Genome Sequences Assembled Genomes Anchor Intervals Segment Identification Pre-map Breakpoint Identification Map Pairwise Anchor Hits Mercator Figure 3.1: A schematic of the Mercator method Segment Identification The input to Mercator consists of k genome sequences, a set of anchors for each genome, and a set of pairwise hits between anchors. An anchor is an oriented genomic interval. A hit is specified by a pair of anchors and a similarity score. Given this input, an undirected k-partite graph is constructed with anchors as vertices and hits as edges. Edges are weighted by the scores of their corresponding hits. To eliminate hits that are not likely to indicate monotopoorthology, a filter is used during the addition of edges to the graph (an idea incorporated from [60]). Let a i,j denote the jth anchor of genome i, best(a i,j, k) denote the highest scoring edge incident to a i,j from an anchor in genome k, and score(e) denote the score of the hit corresponding to edge e. Starting with an edge-less graph, edges are added one-by-one based on the similarity score of hits (in decreasing order). When an edge e = (a i,j, a k,l ) is considered for addition, it is added to the graph if and only if score(e)/score(best(a i,j, k)) > f p and score(e)/score(best(a k,l, i)) > f p for some fraction f p < 1 (the --prune-pct option to Mercator). If f p = 0.8, this results in a filtered graph similar, but not identical, to the one constructed by [60] for pairwise comparisons. Using this filtered graph, Mercator performs k 1 iterations of clique finding and run forming. At the start of each iteration, repetitive anchors are marked. Anchor a i,j is considered repetitive if for some genome k, the number of edges e = (a i,j, a k,l ) such that score(e)/score(best(a i,j, k)) > f r, where f r < 1 (--repeat-pct option), is at least two. Mercator then considers the best edge subgraph of the current graph. This subgraph consists of all symmetric best edges, i.e., all e = (a i,j, a k,l ) such that e = best(a i,j, k) = best(a k,l, i).

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

BMI/CS 776 Lecture #20 Alignment of whole genomes Colin Dewey (with slides adapted from those by Mark Craven) 2007.03.29 1 Multiple whole genome alignment Input set of whole genome sequences genomes diverged