Lecture 10: Phylogeny

Similar documents
Phylogeny Jan 5, 2016

Phylogenetic Tree Reconstruction

Evolutionary Tree Analysis. Overview

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)


Molecular Evolution and Phylogenetic Tree Reconstruction

Algorithms in Bioinformatics

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Constructing Evolutionary/Phylogenetic Trees

EVOLUTIONARY DISTANCES

Phylogenetic trees 07/10/13

BINF6201/8201. Molecular phylogenetic methods

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Theory of Evolution Charles Darwin

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Hierarchical Clustering

Phylogenetics: Building Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees

Dr. Amira A. AL-Hosary

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogeny: building the tree of life

What is Phylogenetics

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetic inference

Lecture Notes: Markov chains

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

BMI/CS 776 Lecture 4. Colin Dewey

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Chapter 3: Phylogenetics

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogeny: traditional and Bayesian approaches

Phylogeny Tree Algorithms

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

A (short) introduction to phylogenetics

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Phylogenetic Networks, Trees, and Clusters

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Effects of Gap Open and Gap Extension Penalties

Tools and Algorithms in Bioinformatics

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

MOLECULAR EVOLUTION AND PHYLOGENETICS SERGEI L KOSAKOVSKY POND CSE/BIMM/BENG 181 MAY 27, 2011

Letter to the Editor. Department of Biology, Arizona State University

Lecture 4. Models of DNA and protein change. Likelihood methods

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Consistency Index (CI)

Theory of Evolution. Charles Darwin

Reconstruire le passé biologique modèles, méthodes, performances, limites

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Reading for Lecture 13 Release v10

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

molecular evolution and phylogenetics

The Generalized Neighbor Joining method

A Phylogenetic Network Construction due to Constrained Recombination

How to read and make phylogenetic trees Zuzana Starostová

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

A Minimum Spanning Tree Framework for Inferring Phylogenies

Recent Advances in Phylogeny Reconstruction

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

X X (2) X Pr(X = x θ) (3)

arxiv: v1 [q-bio.pe] 1 Jun 2014

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Copyright 2000 N. AYDIN. All rights reserved. 1

Properties of normal phylogenetic networks

Phylogenetics: Likelihood

Phylogenetic analyses. Kirsi Kostamo

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

C3020 Molecular Evolution. Exercises #3: Phylogenetics

arxiv: v5 [q-bio.pe] 24 Oct 2016

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Probabilistic modeling and molecular phylogeny

Transcription:

Computational Genomics Prof. Ron Shamir & Prof. Roded Sharan School of Computer Science, Tel Aviv University גנומיקה חישובית פרופ' רון שמיר ופרופ' רודד שרן ביה"ס למדעי המחשב,אוניברסיטת תל אביב Lecture 10: Phylogeny 25,27/12/12

Phylogeny Slides: Adi Akavia Nir Friedman s slides at HUJI (based on ALGMB 98) Anders Gorm Pedersen,Technical University of Denmark Sources: Joe Felsenstein Inferring Phylogenies (2004) 1

Phylogeny Phylogeny: the ancestral relationship of a set of species. Represented by a phylogenetic tree leaf branch?? Internal node??? Leaves - contemporary Internal nodes - ancestral Branch length - distance between sequences 2

3

4

5

6

Phylogeny schools classical Classical vs. Modern: Classical - morphological characters Modern - molecular sequences. 7

Trees and Models rooted / unrooted topology / distance binary / general molecular clock 8

To root or not to root? Unrooted tree: phylogeny without direction. 9

Rooting an Unrooted Tree We can estimate the position of the root by introducing an outgroup: a species that is definitely most distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant 10

A Scally et al. Nature 483, 169-175 (2012) doi:10.1038/nature10842

HOW DO WE FIGURE OUT THESE TREES? TIMES? 12

Dangers of Paralogs Right species distance: (1,(2,3)) Gene Duplication Sequence Homology Caused By: Orthologs - speciation, Paralogs - duplication Xenologs - horizontal (e.g., by virus) Speciation events 1A 2A 3A 3B 2B 1B 13

Dangers of Paralogs Right species distance: (1,(2,3)) If we only consider 1A, 2B, and 3A: ((1,3),2) Gene Duplication Speciation events 1A 2A 3A 3B 2B 1B 14

Type of Data Distance-based Input: matrix of distances between species Distance can be fraction of residue they disagree on, alignment score between them, Character-based Examine each character (e.g., residue) separately 15

Distance Based Methods 16

Tree based distances d(i,j)=sum of arc lengths on the path i j Given d, can we find an exactly matching tree? An approximately matching tree? 17

The Problem Input: matrix d of distances between species Goal: Find a tree with leaves=chars and edge distances, matching d best. Quality measure: sum of squares: SSQ( T ) i j i t ij : distance in the tree = w ( d ij ij t ij 2 ) w ij : pair weighting. Options: (1) 1 (2) 1/d ij (3) 1/d ij 2 NP-hard (Day 86). We ll describe common heuristics 18

UPGMA Clustering (Sokal & Michener 1958) (Unweighted pair-group method with arithmetic mean) Approach: Form a tree; closer species according to input distances should be closer in the tree Build the tree bottom up, each time merging two smaller trees All leaves are at same distance from the root 19

Efficiency lemma Approach: gradually form clusters: sets of species Repeatedly identify two clusters and merge them. For clusters C i C j, define the distance between them to be the average dist betw their members: 1 d ( C = i, C j ) d ( p, q) C C i p C i q C j Lemma: If C k is formed by merging C i and C j then for every other cluster C l d(c k,c l ) = ( C i *d(c i,c l ) + C j *d(c j,c l )) / ( C i + C j ) Can update distances between clusters in time prop. to the number of clusters. j 20

UPGMA algorithm Initialize: each node is a cluster Ci={i}. d(ci,cj)=d(i,j) set height(i) = 0 i Iterate: Find Ci,Cj with smallest d(ci,cj) Introduce a new cluster node Ck that replaces Ci and Cj //Ck represents all the leaves in clusters Ci and Cj Introduce a new tree node Aij with height(aij) = d(ci,cj)/2 // d(ci,cj) is the average dist among leaves of Ci and Cj Connect the corresponding tree nodes Ci,Cj to Aij with length(ci, Aij) = height(aij) - height(ci) length(cj,aij) = height(aij) - height(cj) For all other Cl: d(ck,cl) = ( Ci *d(ci,cl) + Cj *d(cj,cl)) / ( Ci + Cj ) //dist to any old cluster is the ave dist between its leaves and leaves in Ci, Cj Time: Naïve: O(n 3 ); Can show O(n 2 logn) (ex.) and O(n 2 ) (harder ex.) 21

UPGMA alg (2) Orange nodes represent the groups of nodes that they replaced, and maintain the average dist of the set from other leaf nodes/clusters A 12345 h 123 A 123 h 12 A 12 h 12 A 45 1 2 3 4 5 3 C 12 C 45 C 123 C 45 1, 09 22 22

UPGMA Algorithm 3 5 4 1 2 1 2 3 5 4 23

UPGMA Algorithm 3 5 4 1 2 1 2 3 5 4 24

UPGMA Algorithm 3 5 4 1 2 1 2 3 5 4 25

UPGMA Algorithm 3 5 4 1 2 1 2 3 5 4 26

UPGMA Algorithm 3 5 4 1 2 1 2 3 5 4 27

Molecular Clock UPGMA implicitly assumes the tree is ultrametric: equal leaf-root distances => common uniform clock 2 3 2 3 4 1 1 4 Works reasonably well for nearby species 28

29 Additivity A weaker additivity assumption: distances between species are the sum of distances between intermediate nodes (even if the tree is not ultrametric) a b c i j k c b k j d c a k i d b a j i d + = + = + = ), ( ), ( ), (

30 Consequences of Additivity Suppose input distances are additive For any three leaves Thus a b c i j k c b k j d c a k i d b a j i d + = + = + = ), ( ), ( ), ( m )), ( ), ( ), ( ( ), ( j i d k j d k i d 2 1 m k d + =

31 Suppose input distances are additive For any four leaves Induced tree wt: W Largest sum must appear twice Consequences of Additivity II b W l j d k i d b W k j d l i d b W l k d j i d = + + = + + = + ), ( ), ( ), ( ), ( ), ( ), ( a b c i j k l d e

Consequences of Additivity III If we can identify neighbor leaves, then can use pairwise distances to reconstruct the tree: Remove neighbors i, j from the leaf set Add k i Set d km = (d im + d jm d ij )/2 m But can we find neighbor leaves? k j b 32

Closest pairs may not be neighbors! Closest pair: kj even though they are not neighbors k 1 1 1 j 4 4 i l 33

Neighbor Joining (Saitou-Nei 87) Let where D i, j ) = d ( i, j ) ( r i + r ) ( j 1 r = i d( i, k) L 2 k Corrected average distance of i from all other nodes Theorem: if D(i,j) is minimum among all pairs of leaves, then i and j are neighbors in the tree 34

35 Neighbor Joining Set L to contain all leaves Iteration: Choose i,j such that D(i,j) is minimum Create new node k, and set remove i,j from L, and add k Update r, D Termination: when L =2 connect the two nodes 2 )) /, ( ), ( ), ( ( ), ( 2 ) / ), ( ( ), ( 2 ) / ), ( ( ), ( j i d m j d m i d m k d r r j i d k j d r r j i d k i d i j j i + = + = + = Thm: Opt tree guaranteed if distances match a tree Does not assume a clock Time: O(n 3 ) = k i k i d L r ), ( 2 1 ) ( ), ( ), ( j r i r j i d j i D + = j m i k

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D A - 17 21 27 B - 12 18 C - 14 D -

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D A - 17 21 27 B - 12 18 C - 14 D - i u i A ( 17+21+27)/2=32. 5 B ( 17+12+18)/2=23. 5 C ( 21+12+14)/2=23. 5 D ( 27+18+14)/2=29. 5

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D A - 17 21 27 B - 12 18 C - 14 D - A B C D A - - 39-35 - 35 B - - 35-35 C - - 39 D - i u i A ( 17+21+27)/2=32. 5 B ( 17+12+18)/2=23. 5 C ( 21+12+14)/2=23. 5 D ( 27+18+14)/2=29. 5 D ij - u i - u j

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D A - 17 21 27 B - 12 18 C - 14 D - A B C D A - - 39-35 - 35 B - - 35-35 C - - 39 D - i u i A ( 17+21+27)/2=32. 5 B ( 17+12+18)/2=23. 5 C ( 21+12+14)/2=23. 5 D ( 27+18+14)/2=29. 5 D ij - u i - u j

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D A - 17 21 27 B - 12 18 C - 14 D - A B C D A - - 39-35 - 35 B - - 35-35 C - - 39 D - i u i A ( 17+21+27)/2=32. 5 B ( 17+12+18)/2=23. 5 C ( 21+12+14)/2=23. 5 D ( 27+18+14)/2=29. 5 C D X D ij - u i - u j

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D A - 17 21 27 B - 12 18 C - 14 D - A B C D A - - 39-35 - 35 B - - 35-35 C - - 39 D - D ij - u i - u j i u i A ( 17+21+27)/2=32. 5 B ( 17+12+18)/2=23. 5 C ( 21+12+14)/2=23. 5 D ( 27+18+14)/2=29. 5 C 4 10 X D v C = 0. 5 x 14 + 0. 5 x ( 23. 5-29. 5) = 4 v D = 0. 5 x 14 + 0. 5 x ( 29. 5-23. 5) = 10

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D X A - 17 21 27 B - 12 18 C - 14 D - X - C 4 10 X D

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D X A - 17 21 27 B - 12 18 C - 14 D - X - D XA = ( D CA + D DA - D CD )/2 = ( 21 + 27-14)/2 = 17 D XB = ( D CB + D DB - D CD )/2 = ( 12 + 18-14)/2 = 8 C 4 10 X D

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D X A - 17 21 27 17 B - 12 18 8 C - 14 D - X - D XA = ( D CA + D DA - D CD )/2 = ( 21 + 27-14)/2 = 17 D XB = ( D CB + D DB - D CD )/2 = ( 12 + 18-14)/2 = 8 C 4 10 X D

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B X A - 17 17 B - 8 X - D XA = ( D CA + D DA - D CD )/2 = ( 21 + 27-14)/2 = 17 D XB = ( D CB + D DB - D CD )/2 = ( 12 + 18-14)/2 = 8 C 4 10 X D

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B X A - 17 17 B - 8 X - i C 4 10 X D u i A ( 17+17)/1 = 34 B ( 17+8) /1 = 25 X ( 17+8) /1 = 25

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B X A - 17 17 B - 8 X - A B X A - - 42-28 B - - 28 X - i C 4 10 X D u i A ( 17+17)/1 = 34 B ( 17+8) /1 = 25 X ( 17+8) /1 = 25 D ij - u i - u j

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B X A - 17 17 B - 8 X - A B X A - - 42-28 B - - 28 X - i C 4 10 X D u i A ( 17+17)/1 = 34 B ( 17+8) /1 = 25 X ( 17+8) /1 = 25 D ij - u i - u j

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B X A - 17 17 B - 8 X - A B X A - - 42-28 B - - 28 X - D ij - u i - u j i C 4 10 X D u i A ( 17+17)/1 = 34 B ( 17+8) /1 = 25 X ( 17+8) /1 = 25 v A = 0. 5 x 17 + 0. 5 x ( 34-25) = 13 v D = 0. 5 x 17 + 0. 5 x ( 25-34) = 4 A 13 Y 4 B

Neighbor Joining Algorithm A B X Y CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A - 17 17 B - 8 X - Y C 4 10 X D A 13 Y 4 B

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B X Y A - 17 17 B - 8 X - 4 Y D YX = ( D AX + D BX - D AB )/2 = ( 17 + 8-17)/2 = 4 C 4 10 X D A 13 Y 4 B

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X Y X - 4 Y - D YX = ( D AX + D BX - D AB )/2 = ( 17 + 8-17)/2 = 4 C 4 10 X D A 13 Y 4 B

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X Y X - 4 Y - D YX = ( D AX + D BX - D AB )/2 = ( 17 + 8-17)/2 = 4 C 4 10 D A 13 4 B 4

Neighbor Joining Algorithm CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS A B C D A - 17 21 27 B - 12 18 C - 14 D - D 10 C 4 4 4 B 13 A

Division of Population Genetics, National Institute of Genetics & Department of Genetics, Graduate University for Advanced Studies (Sokendai) Mishima, 411-8540, Japan Naruya Saitou

Probabilistic approaches 56

Likelihood of a Tree Given: n aligned sequences M= X 1,,X n A tree T, leaves labeled with X 1,,X n reconstruction t: labeling of internal nodes branch lengths Goal: Find optimal reconstruction t* : One maximizing the likelihood P(M T,t*) 57

Likelihood (2) We need a model for computing P(M T,t*) Assumptions: Each character is independent The branching is a Markov process: The probability that a node x has a specific label is only a function of the parent node y and the branch length t between them. The probabilities P(x y,t) are known 58

59 Example x 1 x 2 x 3 x 4 x 5 t 1 t 2 t 3 t 5 ) ( ), ( ), ( ), ( ), ( *),,..., ( 5 4 5 4 3 5 3 2 4 2 1 4 1 5 1 x P t x x P t x x P t x x P t x x P t T x x P =

= Calculating the Likelihood Assume that the branch lengths t uv are known. P(M T, character t*) = character j reconstrcuction P( root) P( M j j reconstrcuction R edge u v R, R T, t*) p u v ( t uv ) Independence of sites Markov property independence of each branch 60

Additional Assumed Properties Additivity: P ( a b, t) P( b c, s) = P( a c, s + t) b Reversibility (symmetry): q P( a b, t) q P( b a, t) b = Allows one to freely move the root a Provable under broad and reasonable assumptions 61

Jukes-Cantor Model (J-K 69) Assumptions: Each base in sequence has equal chance of changing Changes to other 3 bases with equal probability Characteristics Each base appears with equal frequency in DNA The quantity r is the rate of change NOTE: r is not the rate of mutation, it is the rate of substitution event of a nucleotide throughout a population. During each infinitesimal time t a substitution occurs with probability 3r t 62

a=r 63

Jukes-Cantor Model P A(t) : prob of A in the character at time t Discrete case: Prob. change A B in one time unit is r P A(t+1) = (1-3r) P A(t) + r (1- P A(t) ) P A(t+1) - P A(t) = -3r P A(t) + r (1- P A(t) ) Continuous case: P A(t) = -3r P A(t) + r (1- P A(t) )= -4r P A(t) + r Solution to the differential equation: P A(t) = 1 4 rt (1 + 3e 4 ) 64

prob. that the nucleotide remains unchanged 1 rt over t time units: P 3 4 same = + e 4 4 Probability of specific change: Probability of change: Note: For t P change 3 4 P A B = 1 1 4rt 4 e 4 3 3 4rt P change = e 4 4 65

Other Models Kimura s 2-parameter model: A,G - purines; C,T - pyrmidines Two different rates purine-purine or pyrmidine-pyrimidine (transitions) purine-pyrmidine or pyrmidine-purine (transversions) Felsenstein 84 and Yano, Hasegawa & Kishino 85 extend the Kimura model to asymmetric base frequencies. C T A G 66

Charles Cantor Boston University Professor, Biomedical Engineering Professor of Pharmacology, School of Medicine Center for Advanced Biotechnology Ph.D., Biophysical Chemistry, University of California, Berkeley Dr. Cantor is currently Chief Scientific Officer at Sequenom, Inc in San Diego, California. 67

Efficient Likelihood Calculation (Felsenstein 73) Use dynamic programming DefineS j (a,v) = Pr(subtree rooted in v v j = a) Initialization: leafv set S j (a,v) = 1 if v is labeled by a, else S j (a,v) = 0 Recursion: Traverse the tree in postorder: for each node v with children u and w, for each state x S j ( x, v) Final Soln: = S j ( y, u) px y ( tvu ) S y y j ( y, w) p L = S j ( x, root) P( x) j x x y ( t vw ) Complexity: O(nmk 2 ) n species, m chars, k states 68

69

Finding Optimal Branch lengths Optimizing the length of a single root-child branch r-v can be done using standard optimization techniques log L m = log j= 1 x, y p( x) S" j ( x, r) p x y ( t rv ) S j ( y, v) Under the symmetry assumption, each node can be made (temporarily) the root To heuristically optimize all the branch lengths: repeatedly optimize one branch at a time No guaranteed convergence, but often works 70

Character Based Methods 71

Inferring a Phylogenetic Tree Generic problem: Optimal Phylogenetic Tree: Input: n species, set of characters, for each species, the state of each of the characters. (parameters) Goal: find a fully-labeled phylogenetic tree that best explains the data. (maximizes a target function). Assumptions: characters - mutually independent species evolve independently A: CAGGTA B: CAGACA C: CGGGTA D: TGCACT E: TGCGTA A B C D E 72

A Simple Example Five species, three have C and two T at a specified position. A minimal tree has one evolutionary change: C T C C T C C T T C 73

Inferring a Phylogenetic Tree Naive Solution - Enumeration: No. of non-isomorphic, labeled, binary, unrooted trees, containing n leaves: (2n -5)!! = Π i=3 n (2i-5) Rooted: x(2n-3) for n=20 this is 10 21! n leaves n-2 internal nodes 2n-3 edges Pf: induction on n a b Each new species adds 2 new edges a b c a c b b c a Adding 3rd species 74

Parsimony Goal: explain data with min. no. of evolutionary changes ( mutations, or mismatches) Parsimony: S(T) Σ j Σ {v, u} E(T) {j: v j u j } Small parsimony problem : Input: leaf sequences + a leaf-labeled tree T Goal: Find ancestral sequences implying minimum no. of changes (most parsimonious) Large parsimony problem : Input: leaf sequences Gaol: Find a most parsimonious tree (topology, leaf labeling and internal seqs.) 75

Algorithm for the Small Parsimony Problem (Fitch `71) Initialization: scan T in post-order, assign: leaf vertex m: S m ={state at node m} internal node m with children l, r : S m = S l S r if S l S r =φ (i) S l S r o/w (ii) Solution Construction: scan tree in preorder, choose: for the root choose x S root at node m with child k: pick same state as k if possible; o/w - pick arbitrarily cost = no. of case (i) events. time: O(n k) (k = #states s) 76

Fitch s Alg T T,C T >C C A,T T >Α T C C A T T T 77

Walter Fitch May 21, 1929 - March 10, 2011 One of the most influential evolutionary biologists in the world, who established a new scientific field: molecular phylogenetics. He was a member in the National Academy of Sciences, the American Academy of Arts and Sciences and the American Philosophical Society. He co-founded and was the first president of the Society for Molecular Biology, which established the annually awarded Fitch Prize. Additionally, he was a founding editor of Molecular Biology and Evolution. Fitch was at the University of California, Irvine, until his death, preceded by three years at the University of Southern California and 24 years University of Wisconsin Madison. 78

Weighted Parsimony k states S 1,, S k C ij = cost of changing from state i to j Algorithm (Sankoff-Cedergren 88): Need: S k (x) best cost for the subtree rooted at x if state at x is k For leaf x, S k (x) = 0 if state of x is k o/w Scan T in postorder. At node a with children l, r S k (a) = min m (S m (l)+ C mk ) + min m (S m (r)+ C mk ) Opt=min m (S m (root)) time: O(n k 2 ) 79

C G C C 80

David Sankoff Over the past 30 years, Sankoff formulated and contributed to many of the fundamental problems in computational biology. In sequence comparison, he introduced the quadratic version of the Needleman-Wunsch algorithm, developed the first statistical test for alignments, initiated the study of the limit behavior of random sequences with Vaclav Chvatal and described the multiple alignment problem, based on minimum evolution over a phylogenetic tree. In the study of RNA secondary structure, he developed algorithms based on general energy functions for multiple loops and for simultaneous folding and alignment, and performed the earliest studies of parametric folding and automated phylogenetic filtering. Sankoff and Robert Cedergren collaborated on the first studies of the evolution of the genetic code based on trna sequences. His contributions to phylogenetics include early models for horizontal transfer, a general approach for optimizing the nodes of a given tree, a method for rapid bootstrap calculations, a generalization of the nearest neighbor interchange heuristic, various constraint, consensus and supertree problems, the computational complexity of several phylogeny problems with William Day, and a general technique for phylogenetic invariants with Vincent Ferretti. Over the last fifteen years he has focused on the evolution of genomes as the result of chromosomal rearrangement processes. Here he introduced the computational analysis of genomic edit distances, including parametric versions, the distribution of gene numbers in conserved segments in a random model with Joseph Nadeau, phylogeny based on gene order with Mathieu Blanchette and David Bryant, generalizations to include multigene families, including algorithms for analyzing genome duplication and hybridization with Nadia El-Mabrouk, and the statistical analysis of gene clusters with Dannie Durand. Sankoff is also well known in linguistics for his methods of studying grammatical variation and 81 change in speech communities, the quantification of discourse analysis and production models of bilingual speech.

Large Parsimony Problem Input: n x m matrix M: M ij = state of j th character of species i. M i = label of i (all labels are distinct) Goal: Construct a phylogenetic tree T with n leaves and a label for each node, s.t. 1-1 correspondence of leaves and labels cost of tree is minimum. NP-hard 82

Branch & Bound (Hendy-Penny 89) enumerate all unrooted trees with increasing no. of leaves Note: cost of tree with all leaves cost of subtree with some leaves pruned (and same labeling) => If cost of subtree best cost for full tree so far, then: can prune (ignore) all refinements of the subtree. enumeration & pruning can be done in O(1) time per visited subtree. 83

Branch swapping C D C D A B B A Each internal edge defines 4 sub-trees: Can swap two such non-adjacent sub-trees 84

Nearest Neighbor Interchanges handles n-labeled trees T and T are neighbors if one can get T by following operation on T: S R S R U V S U V U R V SV RU SU RV SR UV Use the neighborhood structure on the set of solutions (all trees) via hill climbing, annealing, other heuristics... 85

NN interchange graph 86

Wilson & Cann, Sci Amer 92 87

FIN 88