Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

hanks to Paul Lewis, Jeff horne, and Joe Felsenstein for the use of slides

Hennigian logic reconstructs the tree if we know polarity of characters and there is no homoplasy UPM infers a tree from a distance matrix: groups based on similarity fails to give the correct tree if rates of character evolution vary much Modern distance-based approaches: find trees and branch lengths: patristic distances distances from character data. do not use all of the information in the data. Parsimony: prefer the tree that requires the fewest character state changes. Minimize the number of times you invoke homoplasy to explain the data. can work well if if homoplasy is not rare fails if homoplasy very common or is concentrated on certain parts of the tree Maximum likelihood computes the probability of the data given a model (tree and branch lengths) computationally expensive

Review ree Searching Hennigian logic builds a tree directly from the characters UPM builds a tree from distances Parsimony, maximum likelihood, and modern distance methods are optimality criteria. We still have to search for the best tree. oo many trees to enumerate them exhaustively We rely on hill-climbing heuristics

Even if we find the optimal tree, we do not know that it is the true tree. How do we assess statistical support?

estimate of θ he bootstrap (unknown) true value of θ empirical distribution of sample Bootstrap replicates (unknown) true distribution Distribution of estimates of parameters Week 7: Bayesian inference, esting trees, Bootstraps p.33/54

he bootstrap for phylogenies Original Data sites sequences Bootstrap sample #1 sites Estimate of the tree sequences sample same number of sites, with replacement Bootstrap sample #2 sequences sites sample same number of sites, with replacement Bootstrap estimate of the tree, #1 (and so on) Bootstrap estimate of the tree, #2 Week 7: Bayesian inference, esting trees, Bootstraps p.34/54

Bootstrapping: first step 1 2 3 4 5 6 7... k 1... 2... 3... 4... From the original data, estimate a tree using, say, parsimony (could use NJ, LS, ML, etc., however) 1 2 3 4 opyright 2007 Paul O. Lewis 4

Bootstrapping: first replicate weights 1 2 1 1 2 2 3 0 4 0 5 1 6 3 7 1............ k 2 Sum of weights equals k (i.e., each bootstrap dataset has same number of sites as the original) 3... 4... From the bootstrap dataset, estimate the tree using the same method you used for the original dataset 1 2 3 4 opyright 2007 Paul O. Lewis 5

Bootstrapping: second replicate weights 1 2 3 1 0 2 1 3 1 4 1 5 1 6 3 7 0............... k 0 Note that weights are different this time, reflecting the random sampling with replacement used to generate the weights 4... his time the tree that is estimated is different than the one estimated using the original dataset. 1 3 2 4 opyright 2007 Paul O. Lewis 6

Bootstrapping: 20 replicates 1234 Freq ---------- -*-* 75.0 -**- 15.0 --** 10.0 1 2 3 4 1 2 1 2 3 4 1 2 1 2 3 4 1 2 1 2 3 4 1 2 1 2 3 4 1 2 3 4 3 4 4 3 3 4 3 4 Note: usually at least 100 replicates are performed, and 500 is better 1 2 3 4 1 2 4 3 1 2 3 4 1 2 3 4 1 3 2 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 3 1 3 2 4 opyright 2007 Paul O. Lewis 7

20% 10% 0.5% 5% 0.5% 4.5% 5% 10% 200 Million Year Old Fossil

20% 10% 0.5% 5% 4.5% 20% Sequence Divergence in 200 Mill. Years means 1% divergence per 10 Mill. Years 0.5% 10 Million 100 5% Million 10% he "lock Idea" 400 Million 200 Million Year Old Fossil

" comparison of the structures of homologous proteins... from different species is important, therefore, for two reasons. First, the similarities found give a measure of the minimum structure for biological function. Second, the differences found may give us important clues to the rate at which successful mutations have occurred throughout evolutionary time and may also serve as an additional basis for establishing phylogenetic relationships." From p. 143 of he Molecular Basis of Evolution by Dr. hristian B. nfinsen (Wiley, 1959)

20% 10% 0.5% 5% 4.5% problem with the "lock Idea": Rates of Molecular Evolution hange Over ime!! 0.5% 10 Million 100 5% Million 10% 400 Million 200 Million Year Old Fossil

Ernst Mayr recalled at this meeting that there are two distinct aspects to phylogeny: the splitting of lines, and what happens to the lines subsequently by divergence. He emphasized that, after splitting, the resulting lines may evolve at very different rates... How can one then expect a given type of protein to display constant rates of evolutionary modification along different lines of descent? (Evolving enes and Proteins. Zuckerkandl and Pauling, 1965, p. 138).

Molecular lock No lock B D E D amount of evolution (substitutions per site) B E

ssuming a Strict Molecular lock No lock lnl = -10623 lock lnl = -10739 LR test statistic = 232 n=15 taxa, n-2 = 13 d.f. Null (clock) hypothesis rejected Langley,. H., and W. M. Fitch. 1974. n estimation of the constancy of the rate of molecular evolution. Journal of Molecular Evolution 3:161-177. Felsenstein, J. 1983. Statistical inference of phylogenies. 2007 by Paul O. Lewis Journal of the Royal Statistical Society 146:246-272. 3

Reasons that the clock might be rejected 1. Rates of evolution vary across lineages can vary over time: (a) mutation rates can vary (mutations per cell cycle, mutations per time, number of cell cycles per generation, generation time). (b) strength and targets of selection can vary (c) population sizes can vary 2. Incorrect models of sequence evolution lead to errors in the estimation of rates (a) lmost any error in the model can lead to biases (or higher than needed variance) in detecting multiple hits (b) ssumption of a Poisson clock can be wrong even if we correctly count the number of changes, if we don t count for over-dispersion (higher than Poisson-variance in the # of substitutions) then we can falsely reject utler (2000)

Penalized likelihood (penalize rates that vary too much) Bayesian approaches: model the rate of evolution of the rate of evolution. incorporates prior knowledge of what rates combinations are most likely.

Molecular sequence data protein and (later) DN sequences clearly not environmental or plastic Kimura s neutral theory implies that homoplasy due to functional convergence should be rare

Homo sap. Pan trog. orilla gor. Pongo pyg. he sequences cannot be characters states in a Hennigian analysis No two are shared!

Homo sap. Pan trog. orilla gor. Pongo pyg. We could treat columns ( sites ) as characters and the bases as states his requires an alignment

Insertions and deletions (indels) of nucleotides occur during evolution; So, we cannot count on the 5th position in every sequence as being descended from the same ancestral base; lignment: adding gap characters ( - ) to sequences. he goal of alignment is to make homologous sites occur in the same column. Multiple sequence alignment is a very difficult problem compared to pairwise alignment.

Uses of multiple sequence alignment orrespondence We often want to know which parts do the same thing or have the same structure. Profiles we can create profiles that summarize the characteristics of a protein family. enome assembly alignment is a part of the creation of contig maps of genomic fragments such as ESs. Phylogenetics he vast majority of phylogenetic methods require aligned data.

urrent standard operating procedure for tree reconstruction from molecular sequence data 1. ollect sequences 2. lign the sequences (usually with clustalw or clustalx) 3. Remove/recode regions of uncertain alignment 4. Infer phylogenetic trees

human chimp orang KRSV KRV KPRV

human chimp orangutan KRSV KRV KPRV del S KRSV S->R P->R KPSV

human chimp gorilla orang KRSV KRV KSV KPRV How should we align these sequences? human KRSV human KRSV chimp KR-V OR chimp K-RV gorilla KS-V gorilla K-SV orang KPRV orang KPRV

Pairwise alignment ap penalties and a substitution matrix imply a score for any alignment. Pairwise alignment involves finding the alignment that maximizes this score. substitution matrices assign positive values to matches or similar substitutions (for example Leucine Isoleucine). unlikely substitutions receive negative scores gaps are rare and are heavily penalized (given large negative values).

Scoring an alignment. Simplest case osts: Match 1 Mismatch 0 ap -5 lignment: Pongo V D E V E L R L F V V P Q orilla V E V D L R L L I V Y P S R Score 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 otal score = 5

Scoring an different alignment. Simplest case Match 1 Mismatch 0 ap -5 Pongo V D E V E L R L - F V V P Q orilla V - E V D L R L L I V Y P S R Score 1-5 1 1 0 1 0 1 1 1 1-5 0 1 0 1 0 0 otal score = 0

BLOSUM 62 Substitution matrix R N D Q E H I L K M F P S W Y V 4 R -1 5 N -2 0 6 D -2-2 1 6 0-3 -3-3 9 Q -1 1 0 0-3 5 E -1 0 0 2-4 2 5 0-2 0-1 -3-2 -2 6 H -2 0 1-1 -3 0 0-2 8 I -1-3 -3-3 -1-3 -3-4 -3 4 L -1-2 -3-4 -1-2 -3-4 -3 2 4 K -1 2 0-1 -3 1 1-2 -1-3 -2 5 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 R N D Q E H I L K M F P S W Y V

Scoring an alignment with the BLOSUM 62 matrix Pongo V D E V E L R L F V V P Q orilla V E V D L R L L I V Y P S R Score 4 2-2 0 6-6 -3-4 -2-2 4 0 4-1 7 4 1 he score for the alignment is D ij = k d (k) ij If i indicates Pongo and j indicates orilla D ij = 12

Scoring an alignment with gaps If the P is -8: Pongo V D E V E L R L - F V V P Q orilla V - E V D L R L L I V Y P S R Score 4-8 5 5 0 6 2 4 6 5 4-8 0 4-1 7 4 1 By introducing gaps we have improved the score: D ij = 40

ap Penalties aps are penalized more heavily than substitutions to avoid alignments like this: Pongo orilla VDEVE-LRLFVVPQ VDEV-WLRLFVVPQ

ap Penalties Because multiple residues are often inserted or deleted at the same time, affine gap penalties are often used: P = O + le where: P is the gap penalty. O is the gap-opening penalty E is the gap-extension penalty l is the length of the gap

Finding an optimal alignment orilla V E V D L R L L I V Y P S R V Pongo D E V E L R L F V V P Q

ligning two sequences, each with length = 1 D E

lignment 1 D D- E -E

lignment 2 D D E E

lignment 3 D -D E E-

Longer sequences up to 2 amino acids! V D V E

lignment 1 V D V E

lignment 2 V D V E

lignment 3 V D V E

lignment 4 V D V E

lignment 5 V D V E

lignment 6 V D V E

lignment 7 V D V E

lignment 8 V D V E

lignment 9 V D V E

lignment 10 V D V E

lignment 11 V D V E

lignment 12 V D V E

lignment 13 V D V E

Pongo V D E V E L R L F V V P Q orilla V E V D L R L L I V Y P S R Score 4 2-2 0 6-6 -3-4 -2-2 4 0 4-1 7 4 1 orilla V E V D L R L L I V Y P S R V Pongo D E V E L R L F V V P Q

Pongo V D E V E L R L - F V V P Q orilla V - E V D L R L L I V Y P S R Score 4-8 5 5 0 6 2 4 6 5 4-8 0 4-1 7 4 1 orilla V E V D L R L L I V Y P S R V Pongo D E V E L R L F V V P Q

length Seq # 1 length Seq # 2 # alignments 1 1 3 2 2 13 3 3 63 4 4 321 5 5 1,683 6 6 8,989 7 7 48,639 8 8 265,729 9 9 1,462,563... 17 17 1,425,834,724,419

Needleman-Wunsch algorithm (paraphrased) Work from the top left (beginning of both sequences) For each cell store the highest score possible for that cell and a back pointer to tell point to the previous step in the best path When you reach the lower right corner, you know the optimal score and the back pointers tell you the alignment. he highest-score calculation at each cell only depends on its the cell s three possible previous neighbors. If one sequence is length N, and the other is length M, then Needleman-Wunsch only takes 6N M calculations. But there are a much larger number of possible alignments.

V E V D V D E V 0

V E V D V D E V 0-5 -5

V D E V 0-5 -10 V -5 4 E -10 V D

V D E V 0-5 -10-15 V -5 4-1 E -10-1 V -15 D

V D E V 0-5 -10-15 -20 V -5 4-1 -6 E -10-1 6 V -15-6 -20 D

V D E V 0-5 -10-15 -20-25 V -5 4-1 -6-11 E -10-1 6 4 V -15-6 1-20 -11-25 D

V D E V E L V 0-5 -10-15 -20-25 -30-35 -40 E -5 4-1 -6-11 -16-21 -26-31 V -10-1 6 4-1 -6-11 -16-21 -15-6 1 4 8 3-2 -7-12 -20-11 -4 0 4 8 3-2 -7 D -25-16 -9-5 -1 10 14 9 4 L -30-21 -10-7 -6 5 9 16 11-35 -26-15 -12-6 0 4 11 20 R -40-31 -20-17 -11 0 6 6 15 L -45-36 -25-20 -16-5 1 6 10 L -50-41 -30-25 -19-10 -4 1 10 I -55-46 -35-30 -24-15 -9-4 5 V -60-51 -40-35 -27-20 -14-9 0 Y -65-56 -45-40 -31-25 -19-14 -5 P -70-61 -50-45 -36-30 -24-19 -10 S -75-66 -55-50 -41-35 -29-24 -15 R -80-71 -60-55 -46-40 -34-29 -20

ligning multiple sequences B D E

Progressive alignment Devised by Feng and Doolittle 1987 and Higgins and Sharp, 1988. n approximate method for producing multiple sequence alignments using a guide tree. Perform pairwise alignments to produce a distance matrix Produce a guide tree from the distances Use the guide tree to specify the ordering used for aligning sequences, closest to furthest.

PEEKSVLWKVNVDEV B EEKVLLWDKVNEEEV PDKNVKWKVHEY D DKNVKWSKVHEY E EHEWQLVLHVWKVEDVHQ pairwise alignment - B.17 -.59.60 - D.59.59.13 - E.77.77.75.75 - tree inference PEEKSVLWKVNVDEV B EEKVLLWDKVNEEEV PDKNVKWKVHEY + D DKNVKWSKVHEY E EHEWQLVLHVWKVEDVHQ B D E alignment stage PEEKSVLWKVN--VDEV B EEKVLLWDKVN--EEEV PDKNVKWKVHEY D DKNVKWSKVHEY E EHEWQLVLHVWKVEDVHQ

lignment stage of progressive alignments Sequences of clades become grouped into profiles as the algorithm descends the tree. he next youngest internal nodes is selected at each step to create a new profile. lignment at each step involves Sequence-Sequence Sequence-Profile Profile-Profile

ligning multiple sequences B D E 0.1 0.2 0.27 0.15.1 Seq-Seq Seq-Seq 0.12 0.1 Seq-roup 0.09 roup-roup

Profile to Profile alignment V E V D L R L L I Y P S R V E D E V L M R L F V P Q L D D E V - V R L F V P Q V E I D L - - L L L Y P R V V E V E L - - L L L Y P K I

Profile to profile alignments dding a gap to a profile means that every member of that group of sequences gets a gap at that position of the sequence. Usually the scores for each edge in the Needleman-Wünsch graph are calculated using a sum of pairs scoring system. clustal W 2 uses weights assigned to each sequence in a profile group to downweight closely related sequences so that they are not overrepresented. 2 hompson, Higgins, and ibson. Nuc. cids. Res. 1994

Profile 1 Profile 2 Seq weight taxon 0.3 V taxon 0.24 taxon E 0.19 I Seq weight taxon B 0.15 V taxon D 0.25 M D P 1,P 2 = i j w iw j d ij n i n j = 1 6 [d(v, V )w w B + d(v, M)w w D + d(, V )w w B... =... d(, M)w w D + d(i, V )w E w B + d(i, M)w E w D ] = 1 (4 0.3 0.15 + 1 0.3 0.25 + 0 0.24 0.15... 6 =... 1 0.24 0.25 + 3 0.19 0.15 + 1 0.19 0.1 = 1.46225

682 682 Opinion Opinion Dealing with alignment ambiguity 3 RENDS in Eco (a) X Y X Z Y Z (a) X 1 2 3 4 5 6 7 8 9 1 21 31 4 5 6 (b) 7 8 9 1 1 1 2 3 4 0 1 2 0 1 2 (b) Outgroup axon axon B axon axon D axon E (d) Outgroup Outgroup Outgroup axon axon axon RENDS in Ecology axon & Evolution B Vol.16 No.12 axon Decem B ber 2001 axon B axon axon axon axon D axon D axon D axon E axon E axon E X Y Z X Y X Z Y Z 1 2 3 4 5 (c) 6 7 8 9 1 1 1 (c) 2 3 4 5 6 7 8 9 1 21 31 4 5 6 (d) 7 8 9 1 1 1 0 1 2 0 1 2 0 1 2 Outgroup Outgroup axon axon axon B axon B axon axon axon - - - D - axon - Elision - D - - - axon - - - E - axon - - E - - - 3 B (e) X (e) Y X Z X Y Y Z from M. S. Y. Lee, REE, 2001 1 2 3 4 5 6 7 8 9 1 21 31 4 15 26 37 48 59 61 71 81 9 combined ( concatenated ) 0 1 2 into 0 a 1 sin 2 4, 6, 8, 9 Outgroup axon data sets because strong phylogenet required to generate incongruence; B latter criteria might lead to choosing containg the least phylogenetic info DE In the elision method, a range of pla D alignments is generated as detailed instead of being analysed separately and evaluated in a single analysis 1,1 combining the two possible alignme Outgroup axon

682 Opinion Dealing with alignment ambiguity 4 - deletion RENDS in Eco (a) X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 (b) X 1 2 3 4 Outgroup axon axon B axon axon D axon E Outgroup axon axon B axon axon D axon E X Z 1 2 3 (c) 1 1 1 0 1 2 Outgroup Outgroup axon axon axon B axon B axon axon axon D axon D axon E axon E X Y X Z Y Z 1 2 3 4 5 6 7 18 29 31 41 15 6 7 8 (d) 9 1 1 1 0 1 2 0 1 2 Outgroup axon axon B axon axon - - D-??? - - - axon - - E-??? - - - B DE D 4 (e) X Y Z from M. S. Y. Lee, REE, 2001 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 X Y 1 2 3 4 5 6 7 8 9 Outgroup axon

) xon E utgroup axon axon B axon axon D axon E axon E Dealing with alignment ambiguity 5 X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 - - - Elision method (Wheeler, 1995) involves simply concatenating matrices. - - - - - - ) X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 Outgroup axon axon B axon axon D axon E - - - - - - X Y Z 1 2 3 "Y" 1 1 1 0 1 2 5 from M. S. Y. Lee, REE, 2001 Outgroup 1 axon 1 axon B 2 axon 2 (d) (g) From state B DE X Y Z 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 - - - - - - o state 1 2 3 1-4 3 2 4-3 3 3 3-4, 6, 8, 9 Deletion In th alig inst com and com into succ he is ac taxa iden he ever as th (reg dow of th cons L met

Simultaneous tree inference and alignment Ideally we would address uncertainty in both types of inference at the same time llows for application of statistical models to improve inference and assessments of reliability Just now becoming feasible: POY (Wheeler, ladstein, Laet, 2002), Handel (Holmes and Bruno, 2001), BliPhy (Redelings and Suchard, 2005), and BES(Lunter et al., 2005, Drummond and Rambaut, 2003). Se (Liu et al 2009; Yu and Holder software).