Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Similar documents
InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Woods Hole brief primer on Multiple Sequence Alignment presented by Mark Holder

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetic inference

Constructing Evolutionary/Phylogenetic Trees

EVOLUTIONARY DISTANCES

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Dr. Amira A. AL-Hosary

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Concepts and Methods in Molecular Divergence Time Estimation

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Effects of Gap Open and Gap Extension Penalties

Constructing Evolutionary/Phylogenetic Trees

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenetic Tree Reconstruction

Algorithms in Bioinformatics

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

A (short) introduction to phylogenetics

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

C3020 Molecular Evolution. Exercises #3: Phylogenetics

BINF6201/8201. Molecular phylogenetic methods

Quantifying sequence similarity

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Practical considerations of working with sequencing data

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Sequence Alignment (chapter 6)

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Evolutionary Models. Evolutionary Models

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Tools and Algorithms in Bioinformatics

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

X X (2) X Pr(X = x θ) (3)

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Lecture 11 Friday, October 21, 2011

Consistency Index (CI)

Phylogenetic analyses. Kirsi Kostamo

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057


Phylogenetics. BIOL 7711 Computational Bioscience

Inferring Molecular Phylogeny

Phylogeny Tree Algorithms

Algorithms in Bioinformatics

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Inferring Molecular Phylogeny

Phylogeny: building the tree of life

Reconstructing the history of lineages

How to read and make phylogenetic trees Zuzana Starostová

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Pairwise & Multiple sequence alignments

Theory of Evolution Charles Darwin

Many of the slides that I ll use have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

What is Phylogenetics

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Single alignment: Substitution Matrix. 16 march 2017

1 ATGGGTCTC 2 ATGAGTCTC

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Computational Biology

Copyright 2000 N. AYDIN. All rights reserved. 1

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Chapter 26 Phylogeny and the Tree of Life

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Intraspecific gene genealogies: trees grafting into networks

Sequence analysis and Genomics

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Overview Multiple Sequence Alignment

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Moreover, the circular logic

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogenetics in the Age of Genomics: Prospects and Challenges

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Multiple Sequence Alignment. Sequences

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogeny. November 7, 2017

8/23/2014. Phylogeny and the Tree of Life

Theory of Evolution. Charles Darwin

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

7. Tests for selection

Principles of Phylogeny Reconstruction How do we reconstruct the tree of life? Basic Terminology. Looking at Trees. Basic Terminology.

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

Transcription:

hanks to Paul Lewis, Jeff horne, and Joe Felsenstein for the use of slides

Hennigian logic reconstructs the tree if we know polarity of characters and there is no homoplasy UPM infers a tree from a distance matrix: groups based on similarity fails to give the correct tree if rates of character evolution vary much Modern distance-based approaches: find trees and branch lengths: patristic distances distances from character data. do not use all of the information in the data. Parsimony: prefer the tree that requires the fewest character state changes. Minimize the number of times you invoke homoplasy to explain the data. can work well if if homoplasy is not rare fails if homoplasy very common or is concentrated on certain parts of the tree Maximum likelihood computes the probability of the data given a model (tree and branch lengths) computationally expensive

Review ree Searching Hennigian logic builds a tree directly from the characters UPM builds a tree from distances Parsimony, maximum likelihood, and modern distance methods are optimality criteria. We still have to search for the best tree. oo many trees to enumerate them exhaustively We rely on hill-climbing heuristics

Even if we find the optimal tree, we do not know that it is the true tree. How do we assess statistical support?

estimate of θ he bootstrap (unknown) true value of θ empirical distribution of sample Bootstrap replicates (unknown) true distribution Distribution of estimates of parameters Week 7: Bayesian inference, esting trees, Bootstraps p.33/54

he bootstrap for phylogenies Original Data sites sequences Bootstrap sample #1 sites Estimate of the tree sequences sample same number of sites, with replacement Bootstrap sample #2 sequences sites sample same number of sites, with replacement Bootstrap estimate of the tree, #1 (and so on) Bootstrap estimate of the tree, #2 Week 7: Bayesian inference, esting trees, Bootstraps p.34/54

Bootstrapping: first step 1 2 3 4 5 6 7... k 1... 2... 3... 4... From the original data, estimate a tree using, say, parsimony (could use NJ, LS, ML, etc., however) 1 2 3 4 opyright 2007 Paul O. Lewis 4

Bootstrapping: first replicate weights 1 2 1 1 2 2 3 0 4 0 5 1 6 3 7 1............ k 2 Sum of weights equals k (i.e., each bootstrap dataset has same number of sites as the original) 3... 4... From the bootstrap dataset, estimate the tree using the same method you used for the original dataset 1 2 3 4 opyright 2007 Paul O. Lewis 5

Bootstrapping: second replicate weights 1 2 3 1 0 2 1 3 1 4 1 5 1 6 3 7 0............... k 0 Note that weights are different this time, reflecting the random sampling with replacement used to generate the weights 4... his time the tree that is estimated is different than the one estimated using the original dataset. 1 3 2 4 opyright 2007 Paul O. Lewis 6

Bootstrapping: 20 replicates 1234 Freq ---------- -*-* 75.0 -**- 15.0 --** 10.0 1 2 3 4 1 2 1 2 3 4 1 2 1 2 3 4 1 2 1 2 3 4 1 2 1 2 3 4 1 2 3 4 3 4 4 3 3 4 3 4 Note: usually at least 100 replicates are performed, and 500 is better 1 2 3 4 1 2 4 3 1 2 3 4 1 2 3 4 1 3 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 3 1 3 2 4 opyright 2007 Paul O. Lewis 7

Bootstrapping: first step 1 2 3 4 5 6 7... k 1... 2... 3... 4... From the original data, estimate a tree using, say, parsimony (could use NJ, LS, ML, etc., however) 1 2 3 4 opyright 2007 Paul O. Lewis 4

Bootstrapping: first replicate weights 1 2 1 1 2 2 3 0 4 0 5 1 6 3 7 1............ k 2 Sum of weights equals k (i.e., each bootstrap dataset has same number of sites as the original) 3... 4... From the bootstrap dataset, estimate the tree using the same method you used for the original dataset 1 2 3 4 opyright 2007 Paul O. Lewis 5

Bootstrapping: second replicate weights 1 2 3 1 0 2 1 3 1 4 1 5 1 6 3 7 0............... k 0 Note that weights are different this time, reflecting the random sampling with replacement used to generate the weights 4... his time the tree that is estimated is different than the one estimated using the original dataset. 1 3 2 4 opyright 2007 Paul O. Lewis 6

Bootstrapping: 20 replicates 1234 Freq ---------- -*-* 75.0 -**- 15.0 --** 10.0 1 2 3 4 1 2 1 2 3 4 1 2 1 2 3 4 1 2 1 2 3 4 1 2 1 2 3 4 1 2 3 4 3 4 4 3 3 4 3 4 Note: usually at least 100 replicates are performed, and 500 is better 1 2 3 4 1 2 4 3 1 2 3 4 1 2 3 4 1 3 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 3 1 3 2 4 opyright 2007 Paul O. Lewis 7

20% 10% 0.5% 5% 0.5% 4.5% 5% 10% 200 Million Year Old Fossil

20% 10% 0.5% 5% 4.5% 20% Sequence Divergence in 200 Mill. Years means 1% divergence per 10 Mill. Years 0.5% 10 Million 100 5% Million 10% he "lock Idea" 400 Million 200 Million Year Old Fossil

" comparison of the structures of homologous proteins... from different species is important, therefore, for two reasons. First, the similarities found give a measure of the minimum structure for biological function. Second, the differences found may give us important clues to the rate at which successful mutations have occurred throughout evolutionary time and may also serve as an additional basis for establishing phylogenetic relationships." From p. 143 of he Molecular Basis of Evolution by Dr. hristian B. nfinsen (Wiley, 1959)

20% 10% 0.5% 5% 4.5% problem with the "lock Idea": Rates of Molecular Evolution hange Over ime!! 0.5% 10 Million 100 5% Million 10% 400 Million 200 Million Year Old Fossil

Ernst Mayr recalled at this meeting that there are two distinct aspects to phylogeny: the splitting of lines, and what happens to the lines subsequently by divergence. He emphasized that, after splitting, the resulting lines may evolve at very different rates... How can one then expect a given type of protein to display constant rates of evolutionary modification along different lines of descent? (Evolving enes and Proteins. Zuckerkandl and Pauling, 1965, p. 138).

Molecular lock No lock B D E D amount of evolution (substitutions per site) B E

ssuming a Strict Molecular lock No lock lnl = -10623 lock lnl = -10739 LR test statistic = 232 n=15 taxa, n-2 = 13 d.f. Null (clock) hypothesis rejected Langley,. H., and W. M. Fitch. 1974. n estimation of the constancy of the rate of molecular evolution. Journal of Molecular Evolution 3:161-177. Felsenstein, J. 1983. Statistical inference of phylogenies. 2007 by Paul O. Lewis Journal of the Royal Statistical Society 146:246-272. 3

Reasons that the clock might be rejected 1. Rates of evolution vary across lineages can vary over time: (a) mutation rates can vary (mutations per cell cycle, mutations per time, number of cell cycles per generation, generation time). (b) strength and targets of selection can vary (c) population sizes can vary 2. Incorrect models of sequence evolution lead to errors in the estimation of rates (a) lmost any error in the model can lead to biases (or higher than needed variance) in detecting multiple hits (b) ssumption of a Poisson clock can be wrong even if we correctly count the number of changes, if we don t count for over-dispersion (higher than Poisson-variance in the # of substitutions) then we can falsely reject utler (2000)

Penalized likelihood (penalize rates that vary too much) Bayesian approaches: model the rate of evolution of the rate of evolution. incorporates prior knowledge of what rates combinations are most likely.

Molecular sequence data protein and (later) DN sequences clearly not environmental or plastic Kimura s neutral theory implies that homoplasy due to functional convergence should be rare

Homo sap. Pan trog. orilla gor. Pongo pyg. he sequences cannot be characters states in a Hennigian analysis No two are shared!

Homo sap. Pan trog. orilla gor. Pongo pyg. We could treat columns ( sites ) as characters and the bases as states his requires an alignment

Insertions and deletions (indels) of nucleotides occur during evolution; So, we cannot count on the 5th position in every sequence as being descended from the same ancestral base; lignment: adding gap characters ( - ) to sequences. he goal of alignment is to make homologous sites occur in the same column. Multiple sequence alignment is a very difficult problem compared to pairwise alignment.

Uses of multiple sequence alignment orrespondence We often want to know which parts do the same thing or have the same structure. Profiles we can create profiles that summarize the characteristics of a protein family. enome assembly alignment is a part of the creation of contig maps of genomic fragments such as ESs. Phylogenetics he vast majority of phylogenetic methods require aligned data.

urrent standard operating procedure for tree reconstruction from molecular sequence data 1. ollect sequences 2. lign the sequences (usually with clustalw or clustalx) 3. Remove/recode regions of uncertain alignment 4. Infer phylogenetic trees

human chimp orang KRSV KRV KPRV

human chimp orangutan KRSV KRV KPRV del S KRSV S->R P->R KPSV

human chimp gorilla orang KRSV KRV KSV KPRV How should we align these sequences? human KRSV human KRSV chimp KR-V OR chimp K-RV gorilla KS-V gorilla K-SV orang KPRV orang KPRV

Pairwise alignment ap penalties and a substitution matrix imply a score for any alignment. Pairwise alignment involves finding the alignment that maximizes this score. substitution matrices assign positive values to matches or similar substitutions (for example Leucine Isoleucine). unlikely substitutions receive negative scores gaps are rare and are heavily penalized (given large negative values).

Scoring an alignment. Simplest case osts: Match 1 Mismatch 0 ap -5 lignment: Pongo V D E V E L R L F V V P Q orilla V E V D L R L L I V Y P S R Score 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 otal score = 5

Scoring an different alignment. Simplest case Match 1 Mismatch 0 ap -5 Pongo V D E V E L R L - F V V P Q orilla V - E V D L R L L I V Y P S R Score 1-5 1 1 0 1 0 1 1 1 1-5 0 1 0 1 0 0 otal score = 0

BLOSUM 62 Substitution matrix R N D Q E H I L K M F P S W Y V 4 R -1 5 N -2 0 6 D -2-2 1 6 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 R N D Q E H I L K M F P S W Y V

Scoring an alignment with the BLOSUM 62 matrix Pongo V D E V E L R L F V V P Q orilla V E V D L R L L I V Y P S R Score 4 2-2 0 6-6 -3-4 -2-2 4 0 4-1 7 4 1 he score for the alignment is D ij = k d (k) ij If i indicates Pongo and j indicates orilla D ij = 12

Scoring an alignment with gaps If the P is -8: Pongo V D E V E L R L - F V V P Q orilla V - E V D L R L L I V Y P S R Score 4-8 5 5 0 6 2 4 6 5 4-8 0 4-1 7 4 1 By introducing gaps we have improved the score: D ij = 40

ap Penalties aps are penalized more heavily than substitutions to avoid alignments like this: Pongo orilla VDEVE-LRLFVVPQ VDEV-WLRLFVVPQ

ap Penalties Because multiple residues are often inserted or deleted at the same time, affine gap penalties are often used: P = O + le where: P is the gap penalty. O is the gap-opening penalty E is the gap-extension penalty l is the length of the gap

Finding an optimal alignment orilla V E V D L R L L I V Y P S R V Pongo D E V E L R L F V V P Q

ligning two sequences, each with length = 1 D E

lignment 1 D D- E -E

lignment 2 D D E E

lignment 3 D -D E E-

Longer sequences up to 2 amino acids! V D V E

lignment 1 V D V E

lignment 2 V D V E

lignment 3 V D V E

lignment 4 V D V E

lignment 5 V D V E

lignment 6 V D V E

lignment 7 V D V E

lignment 8 V D V E

lignment 9 V D V E

lignment 10 V D V E

lignment 11 V D V E

lignment 12 V D V E

lignment 13 V D V E

Pongo V D E V E L R L F V V P Q orilla V E V D L R L L I V Y P S R Score 4 2-2 0 6-6 -3-4 -2-2 4 0 4-1 7 4 1 orilla V E V D L R L L I V Y P S R V Pongo D E V E L R L F V V P Q

Pongo V D E V E L R L - F V V P Q orilla V - E V D L R L L I V Y P S R Score 4-8 5 5 0 6 2 4 6 5 4-8 0 4-1 7 4 1 orilla V E V D L R L L I V Y P S R V Pongo D E V E L R L F V V P Q

length Seq # 1 length Seq # 2 # alignments 1 1 3 2 2 13 3 3 63 4 4 321 5 5 1,683 6 6 8,989 7 7 48,639 8 8 265,729 9 9 1,462,563... 17 17 1,425,834,724,419

Needleman-Wunsch algorithm (paraphrased) Work from the top left (beginning of both sequences) For each cell store the highest score possible for that cell and a back pointer to tell point to the previous step in the best path When you reach the lower right corner, you know the optimal score and the back pointers tell you the alignment. he highest-score calculation at each cell only depends on its the cell s three possible previous neighbors. If one sequence is length N, and the other is length M, then Needleman-Wunsch only takes 6N M calculations. But there are a much larger number of possible alignments.

V E V D V D E V 0

V E V D V D E V 0-5 -5

V D E V 0-5 -10 V -5 4 E -10 V D

V D E V 0-5 -10-15 V -5 4-1 E -10-1 V -15 D

V D E V 0-5 -10-15 -20 V -5 4-1 -6 E -10-1 6 V -15-6 -20 D

V D E V 0-5 -10-15 -20-25 V -5 4-1 -6-11 E -10-1 6 4 V -15-6 1-20 -11-25 D

V D E V E L V 0-5 -10-15 -20-25 -30-35 -40 E -5 4-1 -6-11 -16-21 -26-31 V -10-1 6 4-1 -6-11 -16-21 -15-6 1 4 8 3-2 -7-12 -20-11 -4 0 4 8 3-2 -7 D -25-16 -9-5 -1 10 14 9 4 L -30-21 -10-7 -6 5 9 16 11-35 -26-15 -12-6 0 4 11 20 R -40-31 -20-17 -11 0 6 6 15 L -45-36 -25-20 -16-5 1 6 10 L -50-41 -30-25 -19-10 -4 1 10 I -55-46 -35-30 -24-15 -9-4 5 V -60-51 -40-35 -27-20 -14-9 0 Y -65-56 -45-40 -31-25 -19-14 -5 P -70-61 -50-45 -36-30 -24-19 -10 S -75-66 -55-50 -41-35 -29-24 -15 R -80-71 -60-55 -46-40 -34-29 -20

ligning multiple sequences B D E

Progressive alignment Devised by Feng and Doolittle 1987 and Higgins and Sharp, 1988. n approximate method for producing multiple sequence alignments using a guide tree. Perform pairwise alignments to produce a distance matrix Produce a guide tree from the distances Use the guide tree to specify the ordering used for aligning sequences, closest to furthest.

PEEKSVLWKVNVDEV B EEKVLLWDKVNEEEV PDKNVKWKVHEY D DKNVKWSKVHEY E EHEWQLVLHVWKVEDVHQ pairwise alignment - B.17 -.59.60 - D.59.59.13 - E.77.77.75.75 - tree inference PEEKSVLWKVNVDEV B EEKVLLWDKVNEEEV PDKNVKWKVHEY + D DKNVKWSKVHEY E EHEWQLVLHVWKVEDVHQ B D E alignment stage PEEKSVLWKVN--VDEV B EEKVLLWDKVN--EEEV PDKNVKWKVHEY D DKNVKWSKVHEY E EHEWQLVLHVWKVEDVHQ

lignment stage of progressive alignments Sequences of clades become grouped into profiles as the algorithm descends the tree. he next youngest internal nodes is selected at each step to create a new profile. lignment at each step involves Sequence-Sequence Sequence-Profile Profile-Profile

ligning multiple sequences B D E 0.1 0.2 0.27 0.15.1 Seq-Seq Seq-Seq 0.12 0.1 Seq-roup 0.09 roup-roup

Profile to Profile alignment V E V D L R L L I Y P S R V E D E V L M R L F V P Q L D D E V - V R L F V P Q V E I D L - - L L L Y P R V V E V E L - - L L L Y P K I

Profile to profile alignments dding a gap to a profile means that every member of that group of sequences gets a gap at that position of the sequence. Usually the scores for each edge in the Needleman-Wünsch graph are calculated using a sum of pairs scoring system. clustal W 2 uses weights assigned to each sequence in a profile group to downweight closely related sequences so that they are not overrepresented. 2 hompson, Higgins, and ibson. Nuc. cids. Res. 1994

Profile 1 Profile 2 Seq weight taxon 0.3 V taxon 0.24 taxon E 0.19 I Seq weight taxon B 0.15 V taxon D 0.25 M D P 1,P 2 = i j w iw j d ij n i n j = 1 6 [d(v, V )w w B + d(v, M)w w D + d(, V )w w B... =... d(, M)w w D + d(i, V )w E w B + d(i, M)w E w D ] = 1 (4 0.3 0.15 + 1 0.3 0.25 + 0 0.24 0.15... 6 =... 1 0.24 0.25 + 3 0.19 0.15 + 1 0.19 0.1 = 1.46225

682 682 Opinion Opinion Dealing with alignment ambiguity 3 RENDS in Eco (a) X Y X Z Y Z (a) X 1 2 3 4 5 6 7 8 9 1 21 31 4 5 6 (b) 7 8 9 1 1 1 2 3 4 0 1 2 0 1 2 (b) Outgroup axon axon B axon axon D axon E (d) Outgroup Outgroup Outgroup axon axon axon RENDS in Ecology axon & Evolution B Vol.16 No.12 axon Decem B ber 2001 axon B axon axon axon axon D axon D axon D axon E axon E axon E X Y Z X Y X Z Y Z 1 2 3 4 5 (c) 6 7 8 9 1 1 1 (c) 2 3 4 5 6 7 8 9 1 21 31 4 5 6 (d) 7 8 9 1 1 1 0 1 2 0 1 2 0 1 2 Outgroup Outgroup axon axon axon B axon B axon axon axon - - - D - axon - Elision - D - - - axon - - - E - axon - - E - - - 3 B (e) X (e) Y X Z X Y Y Z from M. S. Y. Lee, REE, 2001 1 2 3 4 5 6 7 8 9 1 21 31 4 15 26 37 48 59 61 71 81 9 combined ( concatenated ) 0 1 2 into 0 a 1 sin 2 4, 6, 8, 9 Outgroup axon data sets because strong phylogenet required to generate incongruence; B latter criteria might lead to choosing containg the least phylogenetic info DE In the elision method, a range of pla D alignments is generated as detailed instead of being analysed separately and evaluated in a single analysis 1,1 combining the two possible alignme Outgroup axon

682 Opinion Dealing with alignment ambiguity 4 - deletion RENDS in Eco (a) X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 (b) X 1 2 3 4 Outgroup axon axon B axon axon D axon E Outgroup axon axon B axon axon D axon E X Z 1 2 3 (c) 1 1 1 0 1 2 Outgroup Outgroup axon axon axon B axon B axon axon axon D axon D axon E axon E X Y X Z Y Z 1 2 3 4 5 6 7 18 29 31 41 15 6 7 8 (d) 9 1 1 1 0 1 2 0 1 2 Outgroup axon axon B axon axon - - D-??? - - - axon - - E-??? - - - B DE D 4 (e) X Y Z from M. S. Y. Lee, REE, 2001 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 X Y 1 2 3 4 5 6 7 8 9 Outgroup axon

) xon E utgroup axon axon B axon axon D axon E axon E Dealing with alignment ambiguity 5 X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 - - - Elision method (Wheeler, 1995) involves simply concatenating matrices. - - - - - - ) X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 Outgroup axon axon B axon axon D axon E - - - - - - X Y Z 1 2 3 "Y" 1 1 1 0 1 2 5 from M. S. Y. Lee, REE, 2001 Outgroup 1 axon 1 axon B 2 axon 2 (d) (g) From state B DE X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 - - - - - - o state 1 2 3 1-4 3 2 4-3 3 3 3-4, 6, 8, 9 Deletion In th alig inst com and com into succ he is ac taxa iden he ever as th (reg dow of th cons L met

Simultaneous tree inference and alignment Ideally we would address uncertainty in both types of inference at the same time llows for application of statistical models to improve inference and assessments of reliability Just now becoming feasible: POY (Wheeler, ladstein, Laet, 2002), Handel (Holmes and Bruno, 2001), BliPhy (Redelings and Suchard, 2005), and BES(Lunter et al., 2005, Drummond and Rambaut, 2003). Se (Liu et al 2009; Yu and Holder software).