Bridging the gap- Establishing homology by sequence alignment and optimization methods.

Size: px

Start display at page:

Download "Bridging the gap- Establishing homology by sequence alignment and optimization methods."

Benjamin Ryan
5 years ago
Views:

1 Bridging the gap- Establishing homology by sequence alignment and optimization methods. Rasmus Hovmöller, 2003

3 Introduction... 3 Finding similarities... 3 Homology in DNA data... 3 Structural similarity and positional homology... 3 Manual alignment... 4 Algorithmic alignment... 5 Local and global alignment... 5 BLAS search... 5 Shotgun assembly... 6 Pairwise sequence alignment... 6 An optimal path... 8 Multiple sequence alignment he Clustal algorithm Parsimony alignment Gaps as characters in static alignments Optimization methods Optimization alignment Fixed state optimization Sensitivity analysis Congruence between datasets Visualization of congruence by Navajo rugs References Appendix: Internet resources:... 31

5 Introduction Finding similarities Alignment is a collective term for methods used to find similar sequences in strings of data. Finding matching patterns in strings has applications not only in biology, but also in sociological analyses of behavior patterns (Wilson 1998), as well as logistics (Schlich 2001). his essay will focus on the sequence alignment aspect of DNA homology, the basepair to basepair homology that is used in phylogenetic systematics. Homology in DNA data he concept of homology is central in phylogenetic systematics. From a morphological point of view, homology can be uncontroversial (i.e. the legs of a cow are homologous to the legs of a chicken, but not to the legs of a fly) or difficult (are the wings of insects homologous to the gills of crustaceans?). he methods for establishing primary (or putative) homology in a morphological context are comparative methods: Where similar structures show variation, the variants are assigned character states. For example, the petals of flowering plants show a lot of variation in shape, color, number, function; but it is safe to assume that the petals in one plant are homologous to those of another. Structural similarity and positional homology Structural similarity is not applicable for DNA data, as DNA only has four possible states (A, C, G and ). An A in one sequence is indistinguishable from any other A in any sequence. he only homology assessment that can be made for DNA data is positional homology, but this can be difficult due to length differences of the sequences. here are, however, instances where positional similarity is applicable for sequence data. For protein coding genes, amino acid triplets make a useful guide, since insertion and deletions also appear in threes. he indels of other kinds of molecular data can be of any length, and homologies have to be assigned by another criterion. 3

6 Another source of positional similarity is information from ribosomal secondary structures. Ribosomal coding DNA (rdna) sequences are frequently used in molecular systematics (e.g. metazoans, Lipscomb et al. 1998; arthropods, Giribet and Ribera 2000; fungi, ehler et al. 2000, angiosperms, Soltis et al. 2000). he positional homology of rdna data can be argued from the secondary structure of the rrna that make up the ribosomes. he complimentary nature of nucleotides that gives DNA its double-helix form also creates the secondary structure of the ribosomal rrna: rrna can fold upon itself to create complimentary regions (Figure 1.). Figure 1. rrna forms loops and stems due to its self-complementary nature. rrna Sequence (=primary structure) GAGUAAAGUUAAUACCUUUGCUC Secondary structure GAGUAAAG CUCGUUUC stem loop he structure formed by complimentary strands are called stems, while the single-stranded regions in-between are known as loops. A database (Wuyts et al. 2001, Wuyts et al. 2002) of ribosomal secondary structures is available online (see "Internet resources" below). his is a collection of rdna sequences gathered from GenBank, with added information from secondary structure. he secondary structure of E. coli ribosomes is used as a template to predict the secondary structure of new sequences. Prediction is based on assumptions such as that the sequences coding for the stem regions are more conserved that those coding for the loop regions, as a substitution in a loop would not affect the secondary structure. Manual alignment A very common method is alignment by eye. For protein coding DNA, manual alignment is a reliable method due to triplets (see above), but when performed on non protein-coding data, aesthetics is the only criterion used in assigning homology. Alignment by eye is often used a posteriori to algorithmic alignment. Why this is done, and how, strangely remains unanswered in most publications. he human mind seeks pattern in everything; even DNA sequences, and 4

7 can do a decent job of finding homologous regions in DNA. However, alignment by eye is arbitrary, unrepeatable and rarely consistent. Algorithmic alignment Alignments created by algorithms may not be pleasing to the eye, but they were created using a defined set of parameters. Input the same sequences, use the same parameters and you get the same alignment. here are two different kinds of algorithmic alignment that apply to phylogenetic systematics, local and global. he former is only mentioned briefly here, as it has some uses, but none for analyzing the relationships of organisms. he latter is described at greater detail, and the discussion is more based on ideas than mathematics. Static alignment is compared to optimization methods, and the role of the gap as a placeholder or character state is discussed. Local and global alignment BLAS search Local alignments are not used to find the optimal alignments between sequences, but searches for uninterrupted longer stretches of similarity. he BLAS (Alstchul 1990) algorithm basically works like a spellchecker. A target sequence is compared to other sequences for maximum segment pairs (MSP), the longest sequence of identical basepairs. A cutoff value specifies the minimum value considered as a possible MSP. Consider this local alignment search a target sequence and two other sequences. arget: AAAAA AAAAA AAAAA Seq. 1: AAAAC AAAAC AAAAC Seq. 2: AAAAA CCCCC CCCCC If the cutoff value is set at 5, the top scoring MSP would be in sequence 2. Although BLAS can be used to find all local MSPs that are higher than a specified cutoff value, local alignment is not useful for finding basepair to basepair homologies. Local 5

8 alignment algorithms are very fast, and are more useful for mining large databases. here is an online version of BLAS at that can check any submitted sequence against all of GenBank. When new genes are discovered from genomic data (see below), a BLAS search can be used to find similarities with known genes from known genomes such as Drosophila melanogaster or Caenorhabditis elegans. Similarity scores from a BLAS search can identify a gene and its associated function, or at least place the gene in a gene family (i.e. hemoglobins, cytochromes, immunoglobulins etc). Making BLAS searches is also useful for sequences about to be used in phylogenetic systematics. he often used universal primers are known to amplify contaminants as well as the target organism. his is especially important if the DNA is suspected to be degraded, as e.g. older material from museum collections. A BLAS search on GenBank can quickly reveal if you are sequencing the right material, or if you are about to base your phylogenetic hypothesis on Aspergillus fungi. Shotgun assembly Another application of local alignment is the shotgun approach to sequence assembly (Sanger 1982), as implemented in PHRAP (online documentation, Green 1999) or the Staden Package (Staden 1996). Shotgun assembly is used when there are several partially overlapping fragments that together form a longer sequence. he fragments are checked against each other for MSPs, and when matches are found the fragments are joined in a contig. More fragments are checked for overlaps, and found overlapping fragments are added to the contig. he shotgun approach was used in sequencing the human nuclear genome (Venter et al. 2001). his of 2,91 million basepair genome was assembled from randomly sequenced fragments of a mean length of only 543 basepairs. Where local alignment tries to find the maximal overlap between to sequences, global alignment is about finding homologous regions throughout the alignment. Pairwise sequence alignment 6

A pairwise global alignment can be visualized as a dot-plot matrix. One sequence on the horizontal axis is compared to another on the vertical axis, position for position.

9 A pairwise global alignment can be visualized as a dot-plot matrix. One sequence on the horizontal axis is compared to another on the vertical axis, position for position. he matrix show where the bases match between the sequences. Longer stretches of similarity appear as diagonal lines. Figure 2. shows a dot-plot between 18S rdna from the dragonfly Sympetrum sanguineum, and the stonefly Isoperla obscura. he disjunction in the diagonal corresponds to a 200 basepair insertion in the Isoperla sequence. he program Dotter (Sonnhammer and Durbin 1995) was used to create the dot-plot. Figure 2: A dot-plot matrix of 18S rdna. A filter was applied to only show longer correspondencies. he horizontal axis represents the 18S rdna sequence from Sympetrum sanguineum, and the vertical axis Isoperla obscura All possible alignments are contained in a dot-plot matrix, and any path that starts in the top left corner and ends up in the lower right represents one possible alignment. In the example in Figure 3., the top left cell (0, 0) is empty, to allow for leading gaps. Cell (0, 1) corresponds to the first base of sequence 1, cell (1, 0) to the first base of sequence 2, etc. An alignment between these sequences can be made by moving through the matrix. Start at the top left cell. Moving down inserts a gap in sequence (1), moving a step to the left inserts a gap in sequence (2). he arrows show just one of the many path through the alignment 7

10 Figure 3. One path through the alignment. - A A C - A G G G A G A A he arrows show the path representing the alignment: 1: A-GGGAGA 2: AAC--- he edit path distance for this path is 4 indels and 4 substiutions. An optimal path Pairwise alignment is usually performed to find the minimum edit distance between the sequences, that is the lowest number of substitutions, insertions and deletions required to change one sequence into another. he Needleman-Wunsch (often referred to as N-W) algorithm (1970) was originally devised to find a maximum similarity score between two protein amino-acid sequences. However, the method described is very general, and has been modified to apply to other minimum edit distance problems. Phillips et al. describes the N-W algorithm adapted to DNA sequence alignment in their review paper (2000). he first step is laying out a matrix of the two sequences like in Fig. 3.. hen scores have to be set, these can be simple (gap cost/ substitution cost) or complicated (stepmatrices, gap opening/gap extension penalties etc). For this example, a simple model of gap cost=10, substitution cost=1 is used. he minimum cost of reaching each cell in the matrix (from the top-left cell) is calculated via a wave-front update. Cells can only be reached from their immediate neighbors on the left, above, or left-above. he wave-front update moves through the matrix from the left to the right, top to bottom. he cost of reaching the neighboring cells is calculated. he cost of 8

11 moving is added to the accumulated cost of the path, and only the lowest costing paths are kept. For the detailed description in Figure 4., only the first three bases of each of the sequences are shown. Figure 4. he wave front update. Accumulated costs of moving through the matrix. he cost of reaching each cell from its neighbors to the, left, upper left and above is calculated. he wave front update moves from the left to the right, top to bottom. Gap cost is 10, and substitution cost is 1. - A G G - A G G A A Cell (1,1) is the first to have multiple paths leading to it. he cost of reaching this cell from the cell to the left required an insertion of a gap with the associated cost (+10). he path from the cell above also necessitates a gap (+10). Reaching the cell via the diagonal involves no gaps, and matches the bases. he cost for this operation is 0. Only the optimal paths to each cell are kept. he wave-front continues from the updated cell, through the matrix, until the optimal path to reach every cell has been calculated (Fig. 5). Figure 5. he fully updated matrix, with only the optimal paths to each cell. - A G G G A G A A A C A

12 he next step is the traceback (Fig. 6). his starts in the lower right corner and moves along the paths (but against the arrows) to the upper left corner. hus, only the paths that shows optimal full alignments are kept. Figure 6. he traceback. - A G G G A G A - A A C A 24 hese paths contain 13 optimal alignments, all requiring 4 substitutions and 2 gaps. AGGGAGA A--ACA AGGGAGA AA--CA AGGGAGA AA-C-A AGGGAGA AA-C-A AGGGAGA AAC--A AGGGAGA AA-C-A AGGGAGA AA-C-A AGGGAGA AA--CA AGGGAGA AA-C-A AGGGAGA AA-C-A AGGGAGA AA-C-A AGGGAGA AAC--A AGGGAGA AAC--A 10

13 Multiple sequence alignment heoretically, the Needleman-Wunsch algorithm can be expanded to an n-dimensional matrix to find all optimal alignments between n sequences. he amount of computation required to align more than a few sequences makes a multiple Needleman-Wunsch alignment unfeasible. One solution is dividing one big problem into several smaller: One multiple alignment can be re-phrased as a series of pairwise alignments! he Clustal algorithm Clustal (Higgins et al. 1988, 1989; hompson et al. 1994, 1997) is one of the most popular computer programs for performing multiple sequence alignment. Current version is ClustalX Figure 7. A schematic of the Clustal algorithm. Redrawn from Higgins and Sharp (1988) Calculate distances by NW pairwise alignment Construct Neighboir-joining dendrogram from distances Pick 2 closest sequences (or cluster of sequences) Pairwise alignment Compute a consensus with gaps Yes Are there sequences or clusters left to align? No Write full alignment Clustal moves through two stages of alignment: Pairwise and Multiple. he pairwise alignment is performed to construct a guide tree for the multiple stage. Early versions of Clustal used UPGMA (Sneath and Sokal 1973) to construct the guide tree, but from Clustal W (hompson et al. 1994) and on, neighbor-joining (Saitou and Nei, 1987) is used instead. By comparing each sequence with all other sequences by pairwise N-W alignments, similarity 11

14 scores are calculated for each possible pair of sequences. Neighbor-joining is then used to calculate a dendrogram. Clustal will only produce one guide tree, and it will always be perfectly bifurcating. An outline of the Clustal algorithm can be found in Figure 7. An example of how the Clustal algorithm calculates as guide tree can be found in Figure 8. Using the same penalties as in the earlier example (mismatch 1: gap 10), a distance matrix can be calculated by a series of N-W alignments. he distances tell that sequences 1 and two are the most similar. hese two are grouped together in a cluster. he second most similar sequence (or cluster) is then added to form the dendrogram. Figure 8. he sequences, the distances and the dendrogram. he sequences Distance matrix Dendrogram 1: ACAGC 2: ACGAACG 3: GGGGCG he dendrogram determines the order in which the sequences are aligned in the multiple alignment stage. Starting with the two closest sequences, a consensus sequence is constructed from one optimal alignment. All sites with substitutions or gaps are replaced by placeholders for 'unknowns' in the consensus. Seq 1: Seq 2: Cons 1: ACAGC ACGAACG AC?A??C? he sequence closest to this pair is then aligned to the consensus: Cons 1: Seq 3: Cons 2: AC?A??C? GGGGCG???????C? he next step is the traceback, where the sequences are aligned to the closest consensus sequence. he consensus sequences can be though of as ancestral states at the nodes in the guide tree. 12

15 he bottom node in the tree has Cons 2 as the ancestral states. Sequence 3 and consensus 1 (at the node of Seq 1 and 2) are the closest in the tree. Cons 2: Seq 3: Aligned3: Cons 2: Cons 1: Aligned cons 1:???????C? GGGGCG GGGGCG???????C? AC?A??C???????-C? he closest sequences to the aligned consensus 1 are seq 1 and seq2. Seq 1: Aligned cons 1: Aligned Seq 1: Seq 2: Aligned cons 1: Aligned Seq 2: All sequences are now aligned to the consensus. he final multiple alignment: Aligned Seq 1: Aligned Seq 2: Aligned Seq 3: ACAGC??????-C? ACAG-C- ACGAACG??????-C? ACGAA-CG ACAG-C- ACGAA-CG GGGGCG When using Clustal, it is important to remember that only a single guide tree is considered, and only a single multiple alignment is produced. Alternative paths are not evaluated, and the only method of discovering the effect on alignment order is to manually edit and submit alternative guide trees to Clustal. 13

16 Parsimony alignment A parsimony based approach to multiple sequence alignment is implemented in MALIGN (Wheeler and Gladstein 1994). Current version is (2.7). Central in the MALIGN philosophy is the conviction that the same sets of costs (for substitutions, gaps etc.) should be used in both alignment and subsequent searches for the most parsimonious trees. According to the authors, this is the only way phylogenetic analysis of DNA sequences can be logically consistent. Using parsimony as an optimality criterion in alignment, the best alignment are those that give the shortest phylogenetic trees. MALIGN takes the sequence addition order from phylogenetic trees, and performs multiple alignment by a series of pairwise alignments. he difference from Clustal is that MALIGN will keep track of all costs (gap, substitutions etc) associated with the alignment. he accumulated costs of the alignment will be identical to the tree length of the phylogenetic tree that provided the sequence addition order. MALIGN will the perform heuristic methods such as BR, branch-swapping etc, to create new guide tree and evaluate the associated costs. Shorter trees means parsimonious alignments. If several evaluated guide trees were found to be equally optimal, MALIGN will output a series of equally parsimonious aligned matrices. Figure 9. shows how the MALIGN algorithm evaluates guide treed to find the most parsimonius alignment. Figure 9. he MALIGN algorithm. he alignment cost of several guide trees are evaluated. Gap cost=10 Substitution cost= 1. A B C Addition sequence Multiple alignments 1: -C- 2: -ACG 3: GCG 1: -C- 2: A-CG 3: GCG 1: -C- 2: A-CG 3: GCG Costs Gaps : 2 Subst: 2 otal: 22 Gaps : 2 Subst: 1 otal: 21 Gaps : 2 Subst: 1 otal: 21 14

17 Guide trees B and C will result in equally parsimonious (and in this case, identical) alignments. In the example, using only three terminals makes an exhaustive search possible. With larger datasets, heuristic methods (branch-swapping, BR etc.) are used to find shorter trees. Gaps as characters in static alignments Both Clustal and MALIGN produce static alignments, with characters arranged in columns and character states for each taxa in rows. he trees that result from analysis of aligned data are highly dependent on the parameters used for alignment, and a suboptimal multiple alignment will result in suboptimal parsimony trees (Wheeler 2001). Static alignments are produced with a set cost for gaps and base substitutions, but these values are rarely discussed or given enough importance in publications. In parsimony analyses of static alignments, gaps are usually treated as missing data. When indel information is used, gaps are mostly treated as a fifth nucleotide character state. his is a simplified model, and a potential source of artefacts. hough gaps stem from mutational events, just like base substitutions, the homologies of overlapping contiguous gaps are difficult to interpret. A gap found in several sequences with identical 5' and 3' ends could be the result of a single indel. If gaps are assigned a fifth state, this would result in a highly weighted, but reasonable synapomorphy. On the other hand, if the gaps were only partially overlapping, or one gap a subset of another, the gaps could not be the result of a single indel event. here is no reason to treat all the gap placeholders of such gaps as homologous states, as they cannot have a common history. Simmons and Ochoterena (2000) seek to salvage the information in gaps of static alignments by devising new gap coding schemes: one simple, and one complex. In simple gap coding, a presence/absence matrix is constructed based on the gaps' placements in the aligned sequences. he following rules apply: 1. Identical gaps (matching 5' and 3') are treated as absence/presence characters. 2. Gaps that are a subset of another gap become individual characters. For the taxa that have the larger gap, the state of the smaller gap character is 'inapplicable'. 15

18 Consider the matrix in Figure 10., where gaps are the only informative characters. Numbers refer to characters in the additional presence/absence matrix. Figure 10. An informative-gap matrix, and corresponding simple gap coding matrix. 1: AA-1-CG--3--GAC 2: AA-1-CG-4-GAC 3: AAGC-2--4-GAC 4: AAGC GAC 5: AAGCCCGGGGAC Gap1 Gap2 Gap3 Gap Gap 1 appears in taxa 1 and 2. It overlaps with Gap 2, but since the neither share 5' nor 3' ends, they probably stem from different indel events. Gap 3 is found in taxa 1 and 4. Since Gap 4 is a complete subset of Gap 3, it is impossible to know if it is the result of a separate indel event, or an addition deletion from Gap 4. he Gap 4 character is therefore coded as 'inapplicable' for taxa 1 and 4. Complex gap coding implements rules to incorporate more information from gaps with different 5' and 3' endings. his is done by creating step-matrices with asymmetrical characters. Consult the original publication (Simmons and Ochoterena, 2000) about these rules, as describing the method would be too lengthy for this essay. A problem in static alignment is how to deal with contiguous gaps. he default setting in Clustal, and an option in MALIGN is to have one cost for initiating a gap, and a lower cost for extending a gap. he logic behind such reasoning is that a gap of any length is like the result of one single deletion. Giribet and Wheeler (1999) cautions about treating contiguous gaps as single characters. From an example of a gap spanning seven positions they write: "hat all seven gaps in a row were actually created by a single deletion of the whole series of bases might well be true but any analysis which creates such a tight dependency of costs among aligned positions will run afoul of the postulates of the phylogenetic analysis of characters- that they are at least logically independent.". Simmons and Ochoterena (2000) refutes this and counters that a gap spanning seven positions is likely to be the result of a single deletion event, and thus contiguous gaps should be treated as single events whenever possible. Comparing with how morphological characters are coded: 16

19 "five different petal-pubescence characters-one for each petal- would generally not be coded for a five-merous flower". Gaps too have a history to tell. Although the mechanisms are different from those that create base substitutions, indel events deserve to be considered in phylogenetic analyses. he history indel events may be harder to interpret, but there is no justifiable argument to discard gap data ad hoc. Optimization methods Where multiple alignment seeks to maximize positional homologies in a static matrix, optimization methods focus on the character state changes that are implied in a phylogenetic tree. hese are extensions of the standard cladistic character optimization methods (Farris 1970). With optimization methods indels, as well as base substitutions, are treated as evolutionary events: "transformations linking ancestral and descendent nucleotide sequences" (Giribet and Wheeler 1998). A basepair will never be homologized with another, and no gaps are inserted as placeholders. Character optimization begins with a phylogenetic tree, with DNA sequences as the characters on the leaf nodes. Gap costs have to be explicit, as they are used to calculate the length of the trees. Optimization alignment Optimization alignment (Wheeler 1996) was invented by the author of MALIGN, and boldly presented as "he end of multiple sequence alignment in phylogenetics?". he procedure is implemented in the program POY (Wheeler and Gladstein 1997), available from ftp://ftp.amnh.org/pub/people/wheeler/poy/. POY uses phylogenetic trees to calculate the total transformation cost, given the sequences, but no multiple alignment is ever produced. he cost of transforming a sequence can be calculated with the N-W algorithm. DNA sequences of the hypothetical ancestors are constructed as the intersection of the sequences. In the example the IUPAC-codes for ambiguous states are used, i.e., W= A or. A letter in parenthesis means that a gap is a possible state, i. e. () = or gap. 17

20 he down-pass (Figure 11.) is made to calculate the tree length, and to construct putative ancestral sequences at the nodes. Starting at the apical node, we find the sequences A and AG. he ancestral state for these sequences could have been either one, so A(G) is placed at the node. his requires one gap. Note that the reconstructed ancestral sequences are preliminary and unoptimized in the down-pass. Figure 11. he downpass step of POY. G A AG A(G) +1 gap WG +1 substitution K +1 substitution Moving down, the next branch has G. Comparing this to A(G), we find that the second position could be either an A or a, so the union of A and (=W) is used. his requires on substitution. For the third position, one sequence has G and the other has (G). his means that G is a possible state for both sequences, and is thus assigned to the reconstructed ancestral sequence: WG. Finding an intersection between to states ( G and (G) has the intersection G, W and has the intersection etc.) is not associated with either substitutions or gaps, and therefore carry no cost. he next branch below has the sequence. Compared to the ancestral sequence at the node above (WG), he second position has the intersection, and the third position has the union K (= G or ). otal cost for this tree is one gap and two substitutions. An up-pass (Figure 12.) can reconstruct the ancestral states using information from both ancestors and descendants. his follows the basic cladistic optimization methods. If an upper 18

21 node has a state in common with a lower node, this state is assigned to the upper node. If there is no state in common, the state that has the lower transformation cost is assigned to the upper node. he optimized nodes of the up-pass can never have any influence on the tree length determined on the down-pass. Figure 12. he up-pass step of POY. G A AG AG G K he beginning at the node above the root (the root has no ancestors, hence no ancestral state), this node has the sequence WG (= G or AG). When this is compared to the root node K (= or G) we find that they have the sequence G in common. his is now assigned to the node above the root. he node above has A(G) (=AG or A). here is no common sequence, so the more parsimonious transformation is chosen. G AG requires one substitution at cost 1, while G A requires one substitution and one gap. hus AG is assigned to this node. 19

22 Fixed state optimization Another optimization method by Wheeler (1999) is Fixed state optimization. his is an elegant method that has a similar optimality criterion to optimization alignment, but removes the need for ancestral states to be reconstructed at any point. A reconstructed ancestral state from optimization alignment can be optimal for that node, but suboptimal for the whole tree. If globally suboptimal states are used at a node, the effect will spread throughout the tree. his can result in that the tree length will be longer than necessary, and that the shortest trees are not found. Fixed state optimization offers a way around this problem, as the pairwise alignments never are used to reconstruct ancestral states. Only the cost of transformation (which will be identical to the ones found by optimization alignment, given the same cost parameters) is used. Figure 13: Fixed state optimization. Calculating the step-matrix Character states ransformation step-matrix 1: 2: G 3: A 4: AG Gap:substitution = 2:1 DNA sequences are treated as a single character with as many different character states, as there are taxa in the matrix. A step matrix is constructed from the transformation costs between all possible sequence pairs in the data. he transformation cost is the sum of the substitutions and gaps needed to change one sequence into another, as determined by a weighted (a set gap:substitution cost) N-W alignment. As the sequence is treated like a single character with multiple states, only the states (sequences) of the terminals are possible ancestral sequences at the nodes. he transformation step matrix (Figure 13.) shows all possible transformation costs. A gap has a cost of 2 and any substitution has a cost of 1. ree length is calculated like an ordinary multistate character. 20

23 Starting with the down-pass (Figure 14), the top terminals have character states 3 and 4. he ancestral state at this node can be either one. One transformation must have taken place, either 3 4 or 4 3. One gap (cost 2) is needed for this. Moving down to the node below, the state of the new terminal is 2. At this node, the most parsimonious transformation must be chosen. Possible states for this node is then 2, 3 or 4. However, state 3 can be discarded as a possible ancestral state, since 3 4 costs 4, and 2 4 costs one. During the down-pass, it is impossible to know if 2 is the ancestral state, and a transformation has taken place between this node and the one above, or if 4 is the ancestral state and a transformation has taken place in between the node and the terminal. he preliminary ancestral state for this node will then be 2 or 4. Continuing down the tree the next encountered state is 1. Here, similar calculations take place. 2 1 costs 1, and 4 1 costs 4. State 4 is discarded at this node, and possible ancestral states for the root node is 1 or 2. Figure 14. he down-pass step in Fixed State Optimization or 4 (+2, gap) 2 or 4 (+1, substitution) 1 or 2 (+1, substitution) otal cost 4 he total transformation costs are summed to give the tree length. 3 4 costs 2, 2 4 costs 1 and 1 2 costs 1 for a total tree length of 4. 21

24 An up-pass (Figure 15.)can be performed to find exactly where the transformations take place. his starts at the root node and moves up through the tree. he basic rule is that if the upper node has a possible state in common with the lower node, this state is assigned to the upper node. If there are no possible states in common, the most parsimonious transformation is chosen. Here, the root node has the possible states 1,2 and the node above 2,4. State 2 is the assigned to the node above the root. Moving up, the next node has states 3,4. Since 2 4 costs 1 and 2 3 costs 3, state 2 is assigned to this node. he up-pass is of course only to find where transformations take place, and has no impact on tree length. Figure 15. he up-pass in Fixed State Opimization or 2 otal cost 4 When performing optimization of sequence data, the trees are always rooted, as there will be asymmetric costs associated with gap transformations. here is no possibility of static gaps appearing in the ancestral states. ransforming a base to a gap has a cost, but a gap will never be replaced by a base. 22

25 Sensitivity analysis Congruence between datasets Since the optimal alignment weights for a phylogenetic dataset cannot be measured, an external measure must be used to choose between alignments. Wheeler (1995) suggests comparing trees from two datasets, and picking those that minimize character incongruence (Mickevich and Farris 1981, Farris et al. 1994). he incongruence (ILD= incongruence length difference) is the extra steps in a tree from a combined dataset compared to the sum of the individual treelengths. ILD = (Length combined - (Sum of length of individual datasets)) / Length combined If there is no character incongruence between the datasets, no extra steps will be needed for the combined set. If the tree from the combined dataset is longer than the sum of the lengths of the individual datasets, this is due to incongruence. In performing a sensitivity analysis several alignments, or optimization alignments, are performed under different weighting schemes. Another independent dataset (usually one not affected by alignment, such as morphology) for the same taxa is added to the first, and tree lengths are compared to those stemming from the individual datasets. If the trees produced from the datasets are of very different lengths, as would be expected if a large molecular dataset is compared to a smaller morphological, another incongruence measure, RILD= rescaled incongruence length difference (Wheeler and Hayashi 1998) can be used. he max length is the tree length of the least optimal tree, an unresolved bush. RILD = (Length combined - Sum of length of individual datasets) / (Max length combined - Sum of length of individual datasets) 23

26 Visualization of congruence by Navajo rugs If no independent dataset is available, a sensitivity analysis can still be performed by visualizing the effects of alignment weighting schemes in congruence plots (Wheeler 1995), informally known as 'Navajo rugs'. A Navajo rug represents the support of one group under different sets of parameters; it looks like a grid of squares, with one parameter on the X-axis and another on the Y-axis. An inventive use of Navajo rugs was published by Schulmeister et al. (2002), where Navajo rugs were placed at the nodes in a preferred (most congruent) phylogenetic trees to illustrate the parameter sensitivity of each monophyletic group. o give an example of the usefulness of Navajo rugs, consider the imaginary group Snarkivora (Carroll). his taxon has four members, Brilligia, oveis, Borogovia and Momeus. he traditional groups Slithyformes and Vorpaloidea contains Brilligia and oveis, respectively Borogovia and Momeus. A new group, Outgrabia, containing oveis and Momeus is proposed. he noncoding rubroquinin spacer is sequenced for all taxa and the molecular data is analyzed under different parameters. he values 1, 2 and 4 are used for Gap:change and transition:transversion ratio in a total of nine analyses. he different Figure 16. Different parameter settings give conflicting most parsimonious tree. gap:change Momeus oveis Borogovia Brilligia Brilligia oveis Borogovia Momeus Brilligia oveis Borogovia Momeus transition:transversion 2 Momeus oveis Borogovia Brilligia Brilligia oveis Borogovia Momeus Brilligia oveis Borogovia Momeus 4 Brilligia oveis Borogovia Momeus Brilligia oveis Borogovia Momeus Brilligia oveis Borogovia Momeus 24

27 parameter sets give conflicting most parsimonious trees, as can be seen in Figure 16. For a reader, the parameter sensitivity for a certain group is quickly gathered from a Navajo rug than from comparing all the different trees. he information about the groups of Snarkivora is shown as Navajo rugs in Figure 17. Figure 17. he information from the phylogenetic trees can be condensed into Navajo rugs Slithyformes Vorpaloidea Outgrabia A black square represents that the group is monophyletic, while a white square shows that the group is contradicted. If the tree is unresolved, but the group is neither supported nor contradicted, a gray square is used. he Slithyformes rug shows that this group was monophyletic under 4 parameter sets, unresolved (but not contradicted in 3, and contradicted in 2. However, one dataset cannot validate itself, even if it is analyzed under different weighting schemes. Even if a group appears under a wide range of alignment parameters, this should not necessarily be interpreted as increased support for the group. Sequence alignment and optimization methods are tools for extracting historical information from sequences. By carefully evaluating methods and parameters by sensitivity analysis, the sequence transformations known as insertions and deletions are sources of phylogenetic information. In the spirit of total evidence analysis all information should be used when searching for the most parsimonious explanation. _ 25

28 References Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol. 215: Carroll, L. (1871) hrough the looking glass and what Alice found there. Farris, J. S. (1970) Methods of computing Wagner trees. Syst. Zool. 26: Farris, J. S., Källersjö, M., Kluge, A. G., Bult, C. (1995) esting significance of incongruence. Cladistics 10: Giribet, G., Wheeler, W. C. (1999) On gaps. Mol. Phylogenet. Evol. 13(1): Giribet, G., Ribera, C. (2000) A review of arthropod phylogeny: new data based on ribosomal DNA sequences and direct character optimization. Cladistics 16: Gladstein, D. S., Wheeler, W. C. (1997) POY: the opimization of alignment characters. Program and documentation. American Museum of Natural History, New York. Current version (3.0) available from ftp://ftp.amnh.org/pub/people/wheeler/poy Green, P. (1999) PHRAP documentation. Available online from Higgins, D. G., Sharp, P. M. (1988) CLUSAL: a packacge for performing multiple sequence alignment on a microcomputer. Gene 73: Higgins, D. G., Sharp, P. M. (1989) Fast and sensitive multiple sequence alignment on a microcomputer. CABIOS 5(2): Lipscomb, D. L., Farris, J. S., Källersjö, M., ehler, A. (1998) Support, ribosomal sequences and the phylogeny of the eukaryotes. Cladistics 14:

29 Mickevich, M. F., Farris, J. S. (1981) he implications of congruence in Menidia. Syst. Zool. 30(3): Needleman, S. B., Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: Phillips, A., Janies, D., Wheeler, W. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol. 16(3): Saitou, N., Nei, M. (1987) he neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 6: Sanger, F., Coulson, A. R., Hong, G. F., Hill, D. F., Petersen, G. B. (1982) Nucleotide sequence of bacteriophage lambda DNA. J. Mol. Biol. 162(4): Schlich, R. (2001) Analysing intrapersonal variability of travel behaviour using the sequence alignment method, European ransport Conference, Cambridge, September Available online at Schulmeister, S., Wheeler, W. C., Carpenter, J. M. (2002) Simulatneous analysis of the basal lineages of Hymenoptera (Insecta) using sensitivity analysis. Cladistics 18: Simmons, M. P., Ochoterena, H. (2000) Gaps as characters in sequence-based phylogenetic analysis. Syst. Biol. 49(2): Sneath, P.H.A., and Sokal, R.R. (1973) Numerical axonomy. W.H. Freeman, San Francisco. pages Soltis, D. E., Soltis, P. S., Chase, M. W., Mort, M. E., Albach, D. C., Zanis, M., Savolainen, V., Hahn, W. H., Hoot, S. B., Fay, M. F., Axtell, M., Swensen, S. M., Prince, L. M., Kress, J. W., Nixon, K. C., Farris, J. S. (2000) Angiosperm phylogeny inferred from 18S rdna, rbcl, and atpb sequences. Bot. J. Linn. Soc. 133(4):

30 Sonnhammer, E. L. L., Durbin, R. (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis Gene 167:GC1-10 (1995) Staden, R. he Staden sequence analysis package. (1996) Mol. Biotechnol. 5: ehler, A., Farris J. S., Lipscomb D.L., Källersjö, M. (2000) Phylogenetic analyses of the fungi based on large rdna data sets. Mycologia 92(3): hompson, J. D., Higgins, D. G., Gibson,. J. (1994) CLUSAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22): hompson, J. D., Gibson,. J., Plewniak, F., Jeanmoughin, F., Higgins, D. G. (1997) he CLUSAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25 (24): Venter, J. C. et al. he sequence of the human genome. (2001) Science 291: Wheeler, W. C., Gladstein, D. L. (1994) MALIGN. Program and documentation. American Museum of Natural History, New York. Current version 2.7 (2002) available from ftp://ftp.amnh.org/people/wheeler/malign/ Wheeler, W. C., (1995) Sequence alignment, parameter sensitivity, and the phylogenetic analysis of molecular data. Syst. Biol. 44(3): Wheeler, W. (1996) Optimization alignment: he end of multiple sequence alignment in phylogenetics? Cladistics 12: 1-9 Wheeler, W. C., Hayashi, C. Y. (1998) he phylogeny of extant chelicerate orders. Cladistics 14: Wheeler, W. (1999) Fixed character states and the optimization of molecular sequence data. Cladistics 15:

31 Wheeler, W. (2001) Homology and the optimization of DNA sequence data. Cladistics 17: S3-S11 Wilson, W. C., Environment and Planning A 1998, volume 30, pages Wuyts, J., Van de Peer, Y., Winkelmans,., De Wachter R. (2002) he European database on small subunit ribosomal RNA. Nucleic Acids Res. 30, Wuyts J., De Rijk P., Van de Peer Y., Winkelmans., De Wachter R. (2001) he European Large Subunit Ribosomal RNA database. Nucleic Acids Res. 29(1):

32 Appendix: Internet resources: he Staden Package Website Blast search at GenBank European Ribosomal RNA database Download ClustalW and ClustalX ftp://ftp-igbmc.u-strasbg.fr/pub/clustalx/ Download POY and MALIGN ftp://ftp.amnh.org/people/wheeler/malign/ Download Dotter 30

Effects of Gap Open and Gap Extension Penalties

Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See