Bridging the gap- Establishing homology by sequence alignment and optimization methods.

Size: px
Start display at page:

Download "Bridging the gap- Establishing homology by sequence alignment and optimization methods."

Transcription

1 Bridging the gap- Establishing homology by sequence alignment and optimization methods. Rasmus Hovmöller, 2003

2

3 Introduction... 3 Finding similarities... 3 Homology in DNA data... 3 Structural similarity and positional homology... 3 Manual alignment... 4 Algorithmic alignment... 5 Local and global alignment... 5 BLAS search... 5 Shotgun assembly... 6 Pairwise sequence alignment... 6 An optimal path... 8 Multiple sequence alignment he Clustal algorithm Parsimony alignment Gaps as characters in static alignments Optimization methods Optimization alignment Fixed state optimization Sensitivity analysis Congruence between datasets Visualization of congruence by Navajo rugs References Appendix: Internet resources:... 31

4

5 Introduction Finding similarities Alignment is a collective term for methods used to find similar sequences in strings of data. Finding matching patterns in strings has applications not only in biology, but also in sociological analyses of behavior patterns (Wilson 1998), as well as logistics (Schlich 2001). his essay will focus on the sequence alignment aspect of DNA homology, the basepair to basepair homology that is used in phylogenetic systematics. Homology in DNA data he concept of homology is central in phylogenetic systematics. From a morphological point of view, homology can be uncontroversial (i.e. the legs of a cow are homologous to the legs of a chicken, but not to the legs of a fly) or difficult (are the wings of insects homologous to the gills of crustaceans?). he methods for establishing primary (or putative) homology in a morphological context are comparative methods: Where similar structures show variation, the variants are assigned character states. For example, the petals of flowering plants show a lot of variation in shape, color, number, function; but it is safe to assume that the petals in one plant are homologous to those of another. Structural similarity and positional homology Structural similarity is not applicable for DNA data, as DNA only has four possible states (A, C, G and ). An A in one sequence is indistinguishable from any other A in any sequence. he only homology assessment that can be made for DNA data is positional homology, but this can be difficult due to length differences of the sequences. here are, however, instances where positional similarity is applicable for sequence data. For protein coding genes, amino acid triplets make a useful guide, since insertion and deletions also appear in threes. he indels of other kinds of molecular data can be of any length, and homologies have to be assigned by another criterion. 3

6 Another source of positional similarity is information from ribosomal secondary structures. Ribosomal coding DNA (rdna) sequences are frequently used in molecular systematics (e.g. metazoans, Lipscomb et al. 1998; arthropods, Giribet and Ribera 2000; fungi, ehler et al. 2000, angiosperms, Soltis et al. 2000). he positional homology of rdna data can be argued from the secondary structure of the rrna that make up the ribosomes. he complimentary nature of nucleotides that gives DNA its double-helix form also creates the secondary structure of the ribosomal rrna: rrna can fold upon itself to create complimentary regions (Figure 1.). Figure 1. rrna forms loops and stems due to its self-complementary nature. rrna Sequence (=primary structure) GAGUAAAGUUAAUACCUUUGCUC Secondary structure GAGUAAAG CUCGUUUC stem loop he structure formed by complimentary strands are called stems, while the single-stranded regions in-between are known as loops. A database (Wuyts et al. 2001, Wuyts et al. 2002) of ribosomal secondary structures is available online (see "Internet resources" below). his is a collection of rdna sequences gathered from GenBank, with added information from secondary structure. he secondary structure of E. coli ribosomes is used as a template to predict the secondary structure of new sequences. Prediction is based on assumptions such as that the sequences coding for the stem regions are more conserved that those coding for the loop regions, as a substitution in a loop would not affect the secondary structure. Manual alignment A very common method is alignment by eye. For protein coding DNA, manual alignment is a reliable method due to triplets (see above), but when performed on non protein-coding data, aesthetics is the only criterion used in assigning homology. Alignment by eye is often used a posteriori to algorithmic alignment. Why this is done, and how, strangely remains unanswered in most publications. he human mind seeks pattern in everything; even DNA sequences, and 4

7 can do a decent job of finding homologous regions in DNA. However, alignment by eye is arbitrary, unrepeatable and rarely consistent. Algorithmic alignment Alignments created by algorithms may not be pleasing to the eye, but they were created using a defined set of parameters. Input the same sequences, use the same parameters and you get the same alignment. here are two different kinds of algorithmic alignment that apply to phylogenetic systematics, local and global. he former is only mentioned briefly here, as it has some uses, but none for analyzing the relationships of organisms. he latter is described at greater detail, and the discussion is more based on ideas than mathematics. Static alignment is compared to optimization methods, and the role of the gap as a placeholder or character state is discussed. Local and global alignment BLAS search Local alignments are not used to find the optimal alignments between sequences, but searches for uninterrupted longer stretches of similarity. he BLAS (Alstchul 1990) algorithm basically works like a spellchecker. A target sequence is compared to other sequences for maximum segment pairs (MSP), the longest sequence of identical basepairs. A cutoff value specifies the minimum value considered as a possible MSP. Consider this local alignment search a target sequence and two other sequences. arget: AAAAA AAAAA AAAAA Seq. 1: AAAAC AAAAC AAAAC Seq. 2: AAAAA CCCCC CCCCC If the cutoff value is set at 5, the top scoring MSP would be in sequence 2. Although BLAS can be used to find all local MSPs that are higher than a specified cutoff value, local alignment is not useful for finding basepair to basepair homologies. Local 5

8 alignment algorithms are very fast, and are more useful for mining large databases. here is an online version of BLAS at that can check any submitted sequence against all of GenBank. When new genes are discovered from genomic data (see below), a BLAS search can be used to find similarities with known genes from known genomes such as Drosophila melanogaster or Caenorhabditis elegans. Similarity scores from a BLAS search can identify a gene and its associated function, or at least place the gene in a gene family (i.e. hemoglobins, cytochromes, immunoglobulins etc). Making BLAS searches is also useful for sequences about to be used in phylogenetic systematics. he often used universal primers are known to amplify contaminants as well as the target organism. his is especially important if the DNA is suspected to be degraded, as e.g. older material from museum collections. A BLAS search on GenBank can quickly reveal if you are sequencing the right material, or if you are about to base your phylogenetic hypothesis on Aspergillus fungi. Shotgun assembly Another application of local alignment is the shotgun approach to sequence assembly (Sanger 1982), as implemented in PHRAP (online documentation, Green 1999) or the Staden Package (Staden 1996). Shotgun assembly is used when there are several partially overlapping fragments that together form a longer sequence. he fragments are checked against each other for MSPs, and when matches are found the fragments are joined in a contig. More fragments are checked for overlaps, and found overlapping fragments are added to the contig. he shotgun approach was used in sequencing the human nuclear genome (Venter et al. 2001). his of 2,91 million basepair genome was assembled from randomly sequenced fragments of a mean length of only 543 basepairs. Where local alignment tries to find the maximal overlap between to sequences, global alignment is about finding homologous regions throughout the alignment. Pairwise sequence alignment 6

9 A pairwise global alignment can be visualized as a dot-plot matrix. One sequence on the horizontal axis is compared to another on the vertical axis, position for position. he matrix show where the bases match between the sequences. Longer stretches of similarity appear as diagonal lines. Figure 2. shows a dot-plot between 18S rdna from the dragonfly Sympetrum sanguineum, and the stonefly Isoperla obscura. he disjunction in the diagonal corresponds to a 200 basepair insertion in the Isoperla sequence. he program Dotter (Sonnhammer and Durbin 1995) was used to create the dot-plot. Figure 2: A dot-plot matrix of 18S rdna. A filter was applied to only show longer correspondencies. he horizontal axis represents the 18S rdna sequence from Sympetrum sanguineum, and the vertical axis Isoperla obscura All possible alignments are contained in a dot-plot matrix, and any path that starts in the top left corner and ends up in the lower right represents one possible alignment. In the example in Figure 3., the top left cell (0, 0) is empty, to allow for leading gaps. Cell (0, 1) corresponds to the first base of sequence 1, cell (1, 0) to the first base of sequence 2, etc. An alignment between these sequences can be made by moving through the matrix. Start at the top left cell. Moving down inserts a gap in sequence (1), moving a step to the left inserts a gap in sequence (2). he arrows show just one of the many path through the alignment 7

10 Figure 3. One path through the alignment. - A A C - A G G G A G A A he arrows show the path representing the alignment: 1: A-GGGAGA 2: AAC--- he edit path distance for this path is 4 indels and 4 substiutions. An optimal path Pairwise alignment is usually performed to find the minimum edit distance between the sequences, that is the lowest number of substitutions, insertions and deletions required to change one sequence into another. he Needleman-Wunsch (often referred to as N-W) algorithm (1970) was originally devised to find a maximum similarity score between two protein amino-acid sequences. However, the method described is very general, and has been modified to apply to other minimum edit distance problems. Phillips et al. describes the N-W algorithm adapted to DNA sequence alignment in their review paper (2000). he first step is laying out a matrix of the two sequences like in Fig. 3.. hen scores have to be set, these can be simple (gap cost/ substitution cost) or complicated (stepmatrices, gap opening/gap extension penalties etc). For this example, a simple model of gap cost=10, substitution cost=1 is used. he minimum cost of reaching each cell in the matrix (from the top-left cell) is calculated via a wave-front update. Cells can only be reached from their immediate neighbors on the left, above, or left-above. he wave-front update moves through the matrix from the left to the right, top to bottom. he cost of reaching the neighboring cells is calculated. he cost of 8

11 moving is added to the accumulated cost of the path, and only the lowest costing paths are kept. For the detailed description in Figure 4., only the first three bases of each of the sequences are shown. Figure 4. he wave front update. Accumulated costs of moving through the matrix. he cost of reaching each cell from its neighbors to the, left, upper left and above is calculated. he wave front update moves from the left to the right, top to bottom. Gap cost is 10, and substitution cost is 1. - A G G - A G G A A Cell (1,1) is the first to have multiple paths leading to it. he cost of reaching this cell from the cell to the left required an insertion of a gap with the associated cost (+10). he path from the cell above also necessitates a gap (+10). Reaching the cell via the diagonal involves no gaps, and matches the bases. he cost for this operation is 0. Only the optimal paths to each cell are kept. he wave-front continues from the updated cell, through the matrix, until the optimal path to reach every cell has been calculated (Fig. 5). Figure 5. he fully updated matrix, with only the optimal paths to each cell. - A G G G A G A A A C A

12 he next step is the traceback (Fig. 6). his starts in the lower right corner and moves along the paths (but against the arrows) to the upper left corner. hus, only the paths that shows optimal full alignments are kept. Figure 6. he traceback. - A G G G A G A - A A C A 24 hese paths contain 13 optimal alignments, all requiring 4 substitutions and 2 gaps. AGGGAGA A--ACA AGGGAGA AA--CA AGGGAGA AA-C-A AGGGAGA AA-C-A AGGGAGA AAC--A AGGGAGA AA-C-A AGGGAGA AA-C-A AGGGAGA AA--CA AGGGAGA AA-C-A AGGGAGA AA-C-A AGGGAGA AA-C-A AGGGAGA AAC--A AGGGAGA AAC--A 10

13 Multiple sequence alignment heoretically, the Needleman-Wunsch algorithm can be expanded to an n-dimensional matrix to find all optimal alignments between n sequences. he amount of computation required to align more than a few sequences makes a multiple Needleman-Wunsch alignment unfeasible. One solution is dividing one big problem into several smaller: One multiple alignment can be re-phrased as a series of pairwise alignments! he Clustal algorithm Clustal (Higgins et al. 1988, 1989; hompson et al. 1994, 1997) is one of the most popular computer programs for performing multiple sequence alignment. Current version is ClustalX Figure 7. A schematic of the Clustal algorithm. Redrawn from Higgins and Sharp (1988) Calculate distances by NW pairwise alignment Construct Neighboir-joining dendrogram from distances Pick 2 closest sequences (or cluster of sequences) Pairwise alignment Compute a consensus with gaps Yes Are there sequences or clusters left to align? No Write full alignment Clustal moves through two stages of alignment: Pairwise and Multiple. he pairwise alignment is performed to construct a guide tree for the multiple stage. Early versions of Clustal used UPGMA (Sneath and Sokal 1973) to construct the guide tree, but from Clustal W (hompson et al. 1994) and on, neighbor-joining (Saitou and Nei, 1987) is used instead. By comparing each sequence with all other sequences by pairwise N-W alignments, similarity 11

14 scores are calculated for each possible pair of sequences. Neighbor-joining is then used to calculate a dendrogram. Clustal will only produce one guide tree, and it will always be perfectly bifurcating. An outline of the Clustal algorithm can be found in Figure 7. An example of how the Clustal algorithm calculates as guide tree can be found in Figure 8. Using the same penalties as in the earlier example (mismatch 1: gap 10), a distance matrix can be calculated by a series of N-W alignments. he distances tell that sequences 1 and two are the most similar. hese two are grouped together in a cluster. he second most similar sequence (or cluster) is then added to form the dendrogram. Figure 8. he sequences, the distances and the dendrogram. he sequences Distance matrix Dendrogram 1: ACAGC 2: ACGAACG 3: GGGGCG he dendrogram determines the order in which the sequences are aligned in the multiple alignment stage. Starting with the two closest sequences, a consensus sequence is constructed from one optimal alignment. All sites with substitutions or gaps are replaced by placeholders for 'unknowns' in the consensus. Seq 1: Seq 2: Cons 1: ACAGC ACGAACG AC?A??C? he sequence closest to this pair is then aligned to the consensus: Cons 1: Seq 3: Cons 2: AC?A??C? GGGGCG???????C? he next step is the traceback, where the sequences are aligned to the closest consensus sequence. he consensus sequences can be though of as ancestral states at the nodes in the guide tree. 12

15 he bottom node in the tree has Cons 2 as the ancestral states. Sequence 3 and consensus 1 (at the node of Seq 1 and 2) are the closest in the tree. Cons 2: Seq 3: Aligned3: Cons 2: Cons 1: Aligned cons 1:???????C? GGGGCG GGGGCG???????C? AC?A??C???????-C? he closest sequences to the aligned consensus 1 are seq 1 and seq2. Seq 1: Aligned cons 1: Aligned Seq 1: Seq 2: Aligned cons 1: Aligned Seq 2: All sequences are now aligned to the consensus. he final multiple alignment: Aligned Seq 1: Aligned Seq 2: Aligned Seq 3: ACAGC??????-C? ACAG-C- ACGAACG??????-C? ACGAA-CG ACAG-C- ACGAA-CG GGGGCG When using Clustal, it is important to remember that only a single guide tree is considered, and only a single multiple alignment is produced. Alternative paths are not evaluated, and the only method of discovering the effect on alignment order is to manually edit and submit alternative guide trees to Clustal. 13

16 Parsimony alignment A parsimony based approach to multiple sequence alignment is implemented in MALIGN (Wheeler and Gladstein 1994). Current version is (2.7). Central in the MALIGN philosophy is the conviction that the same sets of costs (for substitutions, gaps etc.) should be used in both alignment and subsequent searches for the most parsimonious trees. According to the authors, this is the only way phylogenetic analysis of DNA sequences can be logically consistent. Using parsimony as an optimality criterion in alignment, the best alignment are those that give the shortest phylogenetic trees. MALIGN takes the sequence addition order from phylogenetic trees, and performs multiple alignment by a series of pairwise alignments. he difference from Clustal is that MALIGN will keep track of all costs (gap, substitutions etc) associated with the alignment. he accumulated costs of the alignment will be identical to the tree length of the phylogenetic tree that provided the sequence addition order. MALIGN will the perform heuristic methods such as BR, branch-swapping etc, to create new guide tree and evaluate the associated costs. Shorter trees means parsimonious alignments. If several evaluated guide trees were found to be equally optimal, MALIGN will output a series of equally parsimonious aligned matrices. Figure 9. shows how the MALIGN algorithm evaluates guide treed to find the most parsimonius alignment. Figure 9. he MALIGN algorithm. he alignment cost of several guide trees are evaluated. Gap cost=10 Substitution cost= 1. A B C Addition sequence Multiple alignments 1: -C- 2: -ACG 3: GCG 1: -C- 2: A-CG 3: GCG 1: -C- 2: A-CG 3: GCG Costs Gaps : 2 Subst: 2 otal: 22 Gaps : 2 Subst: 1 otal: 21 Gaps : 2 Subst: 1 otal: 21 14

17 Guide trees B and C will result in equally parsimonious (and in this case, identical) alignments. In the example, using only three terminals makes an exhaustive search possible. With larger datasets, heuristic methods (branch-swapping, BR etc.) are used to find shorter trees. Gaps as characters in static alignments Both Clustal and MALIGN produce static alignments, with characters arranged in columns and character states for each taxa in rows. he trees that result from analysis of aligned data are highly dependent on the parameters used for alignment, and a suboptimal multiple alignment will result in suboptimal parsimony trees (Wheeler 2001). Static alignments are produced with a set cost for gaps and base substitutions, but these values are rarely discussed or given enough importance in publications. In parsimony analyses of static alignments, gaps are usually treated as missing data. When indel information is used, gaps are mostly treated as a fifth nucleotide character state. his is a simplified model, and a potential source of artefacts. hough gaps stem from mutational events, just like base substitutions, the homologies of overlapping contiguous gaps are difficult to interpret. A gap found in several sequences with identical 5' and 3' ends could be the result of a single indel. If gaps are assigned a fifth state, this would result in a highly weighted, but reasonable synapomorphy. On the other hand, if the gaps were only partially overlapping, or one gap a subset of another, the gaps could not be the result of a single indel event. here is no reason to treat all the gap placeholders of such gaps as homologous states, as they cannot have a common history. Simmons and Ochoterena (2000) seek to salvage the information in gaps of static alignments by devising new gap coding schemes: one simple, and one complex. In simple gap coding, a presence/absence matrix is constructed based on the gaps' placements in the aligned sequences. he following rules apply: 1. Identical gaps (matching 5' and 3') are treated as absence/presence characters. 2. Gaps that are a subset of another gap become individual characters. For the taxa that have the larger gap, the state of the smaller gap character is 'inapplicable'. 15

18 Consider the matrix in Figure 10., where gaps are the only informative characters. Numbers refer to characters in the additional presence/absence matrix. Figure 10. An informative-gap matrix, and corresponding simple gap coding matrix. 1: AA-1-CG--3--GAC 2: AA-1-CG-4-GAC 3: AAGC-2--4-GAC 4: AAGC GAC 5: AAGCCCGGGGAC Gap1 Gap2 Gap3 Gap Gap 1 appears in taxa 1 and 2. It overlaps with Gap 2, but since the neither share 5' nor 3' ends, they probably stem from different indel events. Gap 3 is found in taxa 1 and 4. Since Gap 4 is a complete subset of Gap 3, it is impossible to know if it is the result of a separate indel event, or an addition deletion from Gap 4. he Gap 4 character is therefore coded as 'inapplicable' for taxa 1 and 4. Complex gap coding implements rules to incorporate more information from gaps with different 5' and 3' endings. his is done by creating step-matrices with asymmetrical characters. Consult the original publication (Simmons and Ochoterena, 2000) about these rules, as describing the method would be too lengthy for this essay. A problem in static alignment is how to deal with contiguous gaps. he default setting in Clustal, and an option in MALIGN is to have one cost for initiating a gap, and a lower cost for extending a gap. he logic behind such reasoning is that a gap of any length is like the result of one single deletion. Giribet and Wheeler (1999) cautions about treating contiguous gaps as single characters. From an example of a gap spanning seven positions they write: "hat all seven gaps in a row were actually created by a single deletion of the whole series of bases might well be true but any analysis which creates such a tight dependency of costs among aligned positions will run afoul of the postulates of the phylogenetic analysis of characters- that they are at least logically independent.". Simmons and Ochoterena (2000) refutes this and counters that a gap spanning seven positions is likely to be the result of a single deletion event, and thus contiguous gaps should be treated as single events whenever possible. Comparing with how morphological characters are coded: 16

19 "five different petal-pubescence characters-one for each petal- would generally not be coded for a five-merous flower". Gaps too have a history to tell. Although the mechanisms are different from those that create base substitutions, indel events deserve to be considered in phylogenetic analyses. he history indel events may be harder to interpret, but there is no justifiable argument to discard gap data ad hoc. Optimization methods Where multiple alignment seeks to maximize positional homologies in a static matrix, optimization methods focus on the character state changes that are implied in a phylogenetic tree. hese are extensions of the standard cladistic character optimization methods (Farris 1970). With optimization methods indels, as well as base substitutions, are treated as evolutionary events: "transformations linking ancestral and descendent nucleotide sequences" (Giribet and Wheeler 1998). A basepair will never be homologized with another, and no gaps are inserted as placeholders. Character optimization begins with a phylogenetic tree, with DNA sequences as the characters on the leaf nodes. Gap costs have to be explicit, as they are used to calculate the length of the trees. Optimization alignment Optimization alignment (Wheeler 1996) was invented by the author of MALIGN, and boldly presented as "he end of multiple sequence alignment in phylogenetics?". he procedure is implemented in the program POY (Wheeler and Gladstein 1997), available from ftp://ftp.amnh.org/pub/people/wheeler/poy/. POY uses phylogenetic trees to calculate the total transformation cost, given the sequences, but no multiple alignment is ever produced. he cost of transforming a sequence can be calculated with the N-W algorithm. DNA sequences of the hypothetical ancestors are constructed as the intersection of the sequences. In the example the IUPAC-codes for ambiguous states are used, i.e., W= A or. A letter in parenthesis means that a gap is a possible state, i. e. () = or gap. 17

20 he down-pass (Figure 11.) is made to calculate the tree length, and to construct putative ancestral sequences at the nodes. Starting at the apical node, we find the sequences A and AG. he ancestral state for these sequences could have been either one, so A(G) is placed at the node. his requires one gap. Note that the reconstructed ancestral sequences are preliminary and unoptimized in the down-pass. Figure 11. he downpass step of POY. G A AG A(G) +1 gap WG +1 substitution K +1 substitution Moving down, the next branch has G. Comparing this to A(G), we find that the second position could be either an A or a, so the union of A and (=W) is used. his requires on substitution. For the third position, one sequence has G and the other has (G). his means that G is a possible state for both sequences, and is thus assigned to the reconstructed ancestral sequence: WG. Finding an intersection between to states ( G and (G) has the intersection G, W and has the intersection etc.) is not associated with either substitutions or gaps, and therefore carry no cost. he next branch below has the sequence. Compared to the ancestral sequence at the node above (WG), he second position has the intersection, and the third position has the union K (= G or ). otal cost for this tree is one gap and two substitutions. An up-pass (Figure 12.) can reconstruct the ancestral states using information from both ancestors and descendants. his follows the basic cladistic optimization methods. If an upper 18

21 node has a state in common with a lower node, this state is assigned to the upper node. If there is no state in common, the state that has the lower transformation cost is assigned to the upper node. he optimized nodes of the up-pass can never have any influence on the tree length determined on the down-pass. Figure 12. he up-pass step of POY. G A AG AG G K he beginning at the node above the root (the root has no ancestors, hence no ancestral state), this node has the sequence WG (= G or AG). When this is compared to the root node K (= or G) we find that they have the sequence G in common. his is now assigned to the node above the root. he node above has A(G) (=AG or A). here is no common sequence, so the more parsimonious transformation is chosen. G AG requires one substitution at cost 1, while G A requires one substitution and one gap. hus AG is assigned to this node. 19

22 Fixed state optimization Another optimization method by Wheeler (1999) is Fixed state optimization. his is an elegant method that has a similar optimality criterion to optimization alignment, but removes the need for ancestral states to be reconstructed at any point. A reconstructed ancestral state from optimization alignment can be optimal for that node, but suboptimal for the whole tree. If globally suboptimal states are used at a node, the effect will spread throughout the tree. his can result in that the tree length will be longer than necessary, and that the shortest trees are not found. Fixed state optimization offers a way around this problem, as the pairwise alignments never are used to reconstruct ancestral states. Only the cost of transformation (which will be identical to the ones found by optimization alignment, given the same cost parameters) is used. Figure 13: Fixed state optimization. Calculating the step-matrix Character states ransformation step-matrix 1: 2: G 3: A 4: AG Gap:substitution = 2:1 DNA sequences are treated as a single character with as many different character states, as there are taxa in the matrix. A step matrix is constructed from the transformation costs between all possible sequence pairs in the data. he transformation cost is the sum of the substitutions and gaps needed to change one sequence into another, as determined by a weighted (a set gap:substitution cost) N-W alignment. As the sequence is treated like a single character with multiple states, only the states (sequences) of the terminals are possible ancestral sequences at the nodes. he transformation step matrix (Figure 13.) shows all possible transformation costs. A gap has a cost of 2 and any substitution has a cost of 1. ree length is calculated like an ordinary multistate character. 20

23 Starting with the down-pass (Figure 14), the top terminals have character states 3 and 4. he ancestral state at this node can be either one. One transformation must have taken place, either 3 4 or 4 3. One gap (cost 2) is needed for this. Moving down to the node below, the state of the new terminal is 2. At this node, the most parsimonious transformation must be chosen. Possible states for this node is then 2, 3 or 4. However, state 3 can be discarded as a possible ancestral state, since 3 4 costs 4, and 2 4 costs one. During the down-pass, it is impossible to know if 2 is the ancestral state, and a transformation has taken place between this node and the one above, or if 4 is the ancestral state and a transformation has taken place in between the node and the terminal. he preliminary ancestral state for this node will then be 2 or 4. Continuing down the tree the next encountered state is 1. Here, similar calculations take place. 2 1 costs 1, and 4 1 costs 4. State 4 is discarded at this node, and possible ancestral states for the root node is 1 or 2. Figure 14. he down-pass step in Fixed State Optimization or 4 (+2, gap) 2 or 4 (+1, substitution) 1 or 2 (+1, substitution) otal cost 4 he total transformation costs are summed to give the tree length. 3 4 costs 2, 2 4 costs 1 and 1 2 costs 1 for a total tree length of 4. 21

24 An up-pass (Figure 15.)can be performed to find exactly where the transformations take place. his starts at the root node and moves up through the tree. he basic rule is that if the upper node has a possible state in common with the lower node, this state is assigned to the upper node. If there are no possible states in common, the most parsimonious transformation is chosen. Here, the root node has the possible states 1,2 and the node above 2,4. State 2 is the assigned to the node above the root. Moving up, the next node has states 3,4. Since 2 4 costs 1 and 2 3 costs 3, state 2 is assigned to this node. he up-pass is of course only to find where transformations take place, and has no impact on tree length. Figure 15. he up-pass in Fixed State Opimization or 2 otal cost 4 When performing optimization of sequence data, the trees are always rooted, as there will be asymmetric costs associated with gap transformations. here is no possibility of static gaps appearing in the ancestral states. ransforming a base to a gap has a cost, but a gap will never be replaced by a base. 22

25 Sensitivity analysis Congruence between datasets Since the optimal alignment weights for a phylogenetic dataset cannot be measured, an external measure must be used to choose between alignments. Wheeler (1995) suggests comparing trees from two datasets, and picking those that minimize character incongruence (Mickevich and Farris 1981, Farris et al. 1994). he incongruence (ILD= incongruence length difference) is the extra steps in a tree from a combined dataset compared to the sum of the individual treelengths. ILD = (Length combined - (Sum of length of individual datasets)) / Length combined If there is no character incongruence between the datasets, no extra steps will be needed for the combined set. If the tree from the combined dataset is longer than the sum of the lengths of the individual datasets, this is due to incongruence. In performing a sensitivity analysis several alignments, or optimization alignments, are performed under different weighting schemes. Another independent dataset (usually one not affected by alignment, such as morphology) for the same taxa is added to the first, and tree lengths are compared to those stemming from the individual datasets. If the trees produced from the datasets are of very different lengths, as would be expected if a large molecular dataset is compared to a smaller morphological, another incongruence measure, RILD= rescaled incongruence length difference (Wheeler and Hayashi 1998) can be used. he max length is the tree length of the least optimal tree, an unresolved bush. RILD = (Length combined - Sum of length of individual datasets) / (Max length combined - Sum of length of individual datasets) 23

26 Visualization of congruence by Navajo rugs If no independent dataset is available, a sensitivity analysis can still be performed by visualizing the effects of alignment weighting schemes in congruence plots (Wheeler 1995), informally known as 'Navajo rugs'. A Navajo rug represents the support of one group under different sets of parameters; it looks like a grid of squares, with one parameter on the X-axis and another on the Y-axis. An inventive use of Navajo rugs was published by Schulmeister et al. (2002), where Navajo rugs were placed at the nodes in a preferred (most congruent) phylogenetic trees to illustrate the parameter sensitivity of each monophyletic group. o give an example of the usefulness of Navajo rugs, consider the imaginary group Snarkivora (Carroll). his taxon has four members, Brilligia, oveis, Borogovia and Momeus. he traditional groups Slithyformes and Vorpaloidea contains Brilligia and oveis, respectively Borogovia and Momeus. A new group, Outgrabia, containing oveis and Momeus is proposed. he noncoding rubroquinin spacer is sequenced for all taxa and the molecular data is analyzed under different parameters. he values 1, 2 and 4 are used for Gap:change and transition:transversion ratio in a total of nine analyses. he different Figure 16. Different parameter settings give conflicting most parsimonious tree. gap:change Momeus oveis Borogovia Brilligia Brilligia oveis Borogovia Momeus Brilligia oveis Borogovia Momeus transition:transversion 2 Momeus oveis Borogovia Brilligia Brilligia oveis Borogovia Momeus Brilligia oveis Borogovia Momeus 4 Brilligia oveis Borogovia Momeus Brilligia oveis Borogovia Momeus Brilligia oveis Borogovia Momeus 24

27 parameter sets give conflicting most parsimonious trees, as can be seen in Figure 16. For a reader, the parameter sensitivity for a certain group is quickly gathered from a Navajo rug than from comparing all the different trees. he information about the groups of Snarkivora is shown as Navajo rugs in Figure 17. Figure 17. he information from the phylogenetic trees can be condensed into Navajo rugs Slithyformes Vorpaloidea Outgrabia A black square represents that the group is monophyletic, while a white square shows that the group is contradicted. If the tree is unresolved, but the group is neither supported nor contradicted, a gray square is used. he Slithyformes rug shows that this group was monophyletic under 4 parameter sets, unresolved (but not contradicted in 3, and contradicted in 2. However, one dataset cannot validate itself, even if it is analyzed under different weighting schemes. Even if a group appears under a wide range of alignment parameters, this should not necessarily be interpreted as increased support for the group. Sequence alignment and optimization methods are tools for extracting historical information from sequences. By carefully evaluating methods and parameters by sensitivity analysis, the sequence transformations known as insertions and deletions are sources of phylogenetic information. In the spirit of total evidence analysis all information should be used when searching for the most parsimonious explanation. _ 25

28 References Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol. 215: Carroll, L. (1871) hrough the looking glass and what Alice found there. Farris, J. S. (1970) Methods of computing Wagner trees. Syst. Zool. 26: Farris, J. S., Källersjö, M., Kluge, A. G., Bult, C. (1995) esting significance of incongruence. Cladistics 10: Giribet, G., Wheeler, W. C. (1999) On gaps. Mol. Phylogenet. Evol. 13(1): Giribet, G., Ribera, C. (2000) A review of arthropod phylogeny: new data based on ribosomal DNA sequences and direct character optimization. Cladistics 16: Gladstein, D. S., Wheeler, W. C. (1997) POY: the opimization of alignment characters. Program and documentation. American Museum of Natural History, New York. Current version (3.0) available from ftp://ftp.amnh.org/pub/people/wheeler/poy Green, P. (1999) PHRAP documentation. Available online from Higgins, D. G., Sharp, P. M. (1988) CLUSAL: a packacge for performing multiple sequence alignment on a microcomputer. Gene 73: Higgins, D. G., Sharp, P. M. (1989) Fast and sensitive multiple sequence alignment on a microcomputer. CABIOS 5(2): Lipscomb, D. L., Farris, J. S., Källersjö, M., ehler, A. (1998) Support, ribosomal sequences and the phylogeny of the eukaryotes. Cladistics 14:

29 Mickevich, M. F., Farris, J. S. (1981) he implications of congruence in Menidia. Syst. Zool. 30(3): Needleman, S. B., Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: Phillips, A., Janies, D., Wheeler, W. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol. 16(3): Saitou, N., Nei, M. (1987) he neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 6: Sanger, F., Coulson, A. R., Hong, G. F., Hill, D. F., Petersen, G. B. (1982) Nucleotide sequence of bacteriophage lambda DNA. J. Mol. Biol. 162(4): Schlich, R. (2001) Analysing intrapersonal variability of travel behaviour using the sequence alignment method, European ransport Conference, Cambridge, September Available online at Schulmeister, S., Wheeler, W. C., Carpenter, J. M. (2002) Simulatneous analysis of the basal lineages of Hymenoptera (Insecta) using sensitivity analysis. Cladistics 18: Simmons, M. P., Ochoterena, H. (2000) Gaps as characters in sequence-based phylogenetic analysis. Syst. Biol. 49(2): Sneath, P.H.A., and Sokal, R.R. (1973) Numerical axonomy. W.H. Freeman, San Francisco. pages Soltis, D. E., Soltis, P. S., Chase, M. W., Mort, M. E., Albach, D. C., Zanis, M., Savolainen, V., Hahn, W. H., Hoot, S. B., Fay, M. F., Axtell, M., Swensen, S. M., Prince, L. M., Kress, J. W., Nixon, K. C., Farris, J. S. (2000) Angiosperm phylogeny inferred from 18S rdna, rbcl, and atpb sequences. Bot. J. Linn. Soc. 133(4):

30 Sonnhammer, E. L. L., Durbin, R. (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis Gene 167:GC1-10 (1995) Staden, R. he Staden sequence analysis package. (1996) Mol. Biotechnol. 5: ehler, A., Farris J. S., Lipscomb D.L., Källersjö, M. (2000) Phylogenetic analyses of the fungi based on large rdna data sets. Mycologia 92(3): hompson, J. D., Higgins, D. G., Gibson,. J. (1994) CLUSAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22): hompson, J. D., Gibson,. J., Plewniak, F., Jeanmoughin, F., Higgins, D. G. (1997) he CLUSAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25 (24): Venter, J. C. et al. he sequence of the human genome. (2001) Science 291: Wheeler, W. C., Gladstein, D. L. (1994) MALIGN. Program and documentation. American Museum of Natural History, New York. Current version 2.7 (2002) available from ftp://ftp.amnh.org/people/wheeler/malign/ Wheeler, W. C., (1995) Sequence alignment, parameter sensitivity, and the phylogenetic analysis of molecular data. Syst. Biol. 44(3): Wheeler, W. (1996) Optimization alignment: he end of multiple sequence alignment in phylogenetics? Cladistics 12: 1-9 Wheeler, W. C., Hayashi, C. Y. (1998) he phylogeny of extant chelicerate orders. Cladistics 14: Wheeler, W. (1999) Fixed character states and the optimization of molecular sequence data. Cladistics 15:

31 Wheeler, W. (2001) Homology and the optimization of DNA sequence data. Cladistics 17: S3-S11 Wilson, W. C., Environment and Planning A 1998, volume 30, pages Wuyts, J., Van de Peer, Y., Winkelmans,., De Wachter R. (2002) he European database on small subunit ribosomal RNA. Nucleic Acids Res. 30, Wuyts J., De Rijk P., Van de Peer Y., Winkelmans., De Wachter R. (2001) he European Large Subunit Ribosomal RNA database. Nucleic Acids Res. 29(1):

32 Appendix: Internet resources: he Staden Package Website Blast search at GenBank European Ribosomal RNA database Download ClustalW and ClustalX ftp://ftp-igbmc.u-strasbg.fr/pub/clustalx/ Download POY and MALIGN ftp://ftp.amnh.org/people/wheeler/malign/ Download Dotter 30

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Phylogenetic analyses. Kirsi Kostamo

Phylogenetic analyses. Kirsi Kostamo Phylogenetic analyses Kirsi Kostamo The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species,

More information

Large Grain Size Stochastic Optimization Alignment

Large Grain Size Stochastic Optimization Alignment Brigham Young University BYU ScholarsArchive All Faculty Publications 2006-10-01 Large Grain Size Stochastic Optimization Alignment Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise Bot 421/521 PHYLOGENETIC ANALYSIS I. Origins A. Hennig 1950 (German edition) Phylogenetic Systematics 1966 B. Zimmerman (Germany, 1930 s) C. Wagner (Michigan, 1920-2000) II. Characters and character states

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

PHYLOGENY AND SYSTEMATICS

PHYLOGENY AND SYSTEMATICS AP BIOLOGY EVOLUTION/HEREDITY UNIT Unit 1 Part 11 Chapter 26 Activity #15 NAME DATE PERIOD PHYLOGENY AND SYSTEMATICS PHYLOGENY Evolutionary history of species or group of related species SYSTEMATICS Study

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

How to read and make phylogenetic trees Zuzana Starostová

How to read and make phylogenetic trees Zuzana Starostová How to read and make phylogenetic trees Zuzana Starostová How to make phylogenetic trees? Workflow: obtain DNA sequence quality check sequence alignment calculating genetic distances phylogeny estimation

More information

Phylogenetic hypotheses and the utility of multiple sequence alignment

Phylogenetic hypotheses and the utility of multiple sequence alignment Phylogenetic hypotheses and the utility of multiple sequence alignment Ward C. Wheeler 1 and Gonzalo Giribet 2 1 Division of Invertebrate Zoology, American Museum of Natural History Central Park West at

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides hanks to Paul Lewis, Jeff horne, and Joe Felsenstein for the use of slides Hennigian logic reconstructs the tree if we know polarity of characters and there is no homoplasy UPM infers a tree from a distance

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

1 ATGGGTCTC 2 ATGAGTCTC

1 ATGGGTCTC 2 ATGAGTCTC We need an optimality criterion to choose a best estimate (tree) Other optimality criteria used to choose a best estimate (tree) Parsimony: begins with the assumption that the simplest hypothesis that

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Cladistics. The deterministic effects of alignment bias in phylogenetic inference. Mark P. Simmons a, *, Kai F. Mu ller b and Colleen T.

Cladistics. The deterministic effects of alignment bias in phylogenetic inference. Mark P. Simmons a, *, Kai F. Mu ller b and Colleen T. Cladistics Cladistics 27 (2) 42 46./j.96-3.2.333.x The deterministic effects of alignment bias in phylogenetic inference Mark P. Simmons a, *, Kai F. Mu ller b and Colleen T. Webb a a Department of Biology,

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5. Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony

More information

Ch. 9 Multiple Sequence Alignment (MSA)

Ch. 9 Multiple Sequence Alignment (MSA) Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -

More information

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

More information

Consensus Methods. * You are only responsible for the first two

Consensus Methods. * You are only responsible for the first two Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogeny? - Systematics? The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogenetic systematics? Connection between phylogeny and classification. - Phylogenetic systematics informs the

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200B Spring 2009 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley B.D. Mishler Jan. 22, 2009. Trees I. Summary of previous lecture: Hennigian

More information

Supplementary Materials for

Supplementary Materials for advances.sciencemag.org/cgi/content/full/1/8/e1500527/dc1 Supplementary Materials for A phylogenomic data-driven exploration of viral origins and evolution The PDF file includes: Arshan Nasir and Gustavo

More information

Introduction to characters and parsimony analysis

Introduction to characters and parsimony analysis Introduction to characters and parsimony analysis Genetic Relationships Genetic relationships exist between individuals within populations These include ancestordescendent relationships and more indirect

More information

Pairwise sequence alignments

Pairwise sequence alignments Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

LAB 4: PHYLOGENIES & MAPPING TRAITS

LAB 4: PHYLOGENIES & MAPPING TRAITS LAB 4: PHYLOGENIES & MAPPING TRAITS *This is a good day to check your Physcomitrella (protonema, buds, gametophores?) and Ceratopteris cultures (embryos, young sporophytes?)* Phylogeny Introduction The

More information

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17: Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Collected Works of Charles Dickens

Collected Works of Charles Dickens Collected Works of Charles Dickens A Random Dickens Quote If there were no bad people, there would be no good lawyers. Original Sentence It was a dark and stormy night; the night was dark except at sunny

More information

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004, Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

More information

Multiple sequence alignment

Multiple sequence alignment Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple

More information

Phylogenetic methods in molecular systematics

Phylogenetic methods in molecular systematics Phylogenetic methods in molecular systematics Niklas Wahlberg Stockholm University Acknowledgement Many of the slides in this lecture series modified from slides by others www.dbbm.fiocruz.br/james/lectures.html

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri RNA Structure Prediction Secondary

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Phylogeny: the evolutionary history of a species

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Ratio of explanatory power (REP): A new measure of group support

Ratio of explanatory power (REP): A new measure of group support Molecular Phylogenetics and Evolution 44 (2007) 483 487 Short communication Ratio of explanatory power (REP): A new measure of group support Taran Grant a, *, Arnold G. Kluge b a Division of Vertebrate

More information

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel ) Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

Phylogenetic Inference and Parsimony Analysis

Phylogenetic Inference and Parsimony Analysis Phylogeny and Parsimony 23 2 Phylogenetic Inference and Parsimony Analysis Llewellyn D. Densmore III 1. Introduction Application of phylogenetic inference methods to comparative endocrinology studies has

More information

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057 Bootstrapping and Tree reliability Biol4230 Tues, March 13, 2018 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 Rooting trees (outgroups) Bootstrapping given a set of sequences sample positions randomly,

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

--Therefore, congruence among all postulated homologies provides a test of any single character in question [the central epistemological advance].

--Therefore, congruence among all postulated homologies provides a test of any single character in question [the central epistemological advance]. Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008 University of California, Berkeley B.D. Mishler Jan. 29, 2008. The Hennig Principle: Homology, Synapomorphy, Rooting issues The fundamental

More information

Lecture 2: Pairwise Alignment. CG Ron Shamir

Lecture 2: Pairwise Alignment. CG Ron Shamir Lecture 2: Pairwise Alignment 1 Main source 2 Why compare sequences? Human hexosaminidase A vs Mouse hexosaminidase A 3 www.mathworks.com/.../jan04/bio_genome.html Sequence Alignment עימוד רצפים The problem:

More information

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.

More information