EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007
OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES
OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES
DIGITAL MOLECULES DNA DNA can be considered as a very long string over an alphabet of 4 bases (A, C, G, T ). This encyclopedia stores genetic information in volumes (chromosomes), with interesting chapters (genes), reading instructions (regulatory elements) and less interesting material (junk DNA).
GENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... HOW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C Actually, we should use edit distance and construct an alignment between strings.
GENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... HOW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C Actually, we should use edit distance and construct an alignment between strings.
HOW EVOLUTION ACTS ON DNA? EVOLUTIONARY EVENTS Evolution can modify DNA in several ways: Local pointwise mutations can substitute, delete or insert a base somewhere. Entire DNA fragments can be deleted or duplicated, possibly reversed in their order. Bigger pieces of DNA can be swapped or inverted. Entire genomes can be duplicated. MUTATIONS HAPPEN RANDOMLY!!! OUR FOCUS For simplicity, we focus simply on pointwise substitution events.
OUR SCENARIO The scenario is the following: consider two species (human and chimp) evolved from a common ancestor (some old primate). As the ancestor evolved to human or chimp, his DNA mutated pointwise in some positions, chosen randomly. evolutionary distance = number of mutations
DOES HAMMING DISTANCE COUNT THE NUMBER OF MUTATIONS? Consider the following situation: A C; A G C; A C G C The same observation A, C can corresponds to different evolutionary histories. Hamming distance ignores multiple substitutions in a site. Moreover: A G A; A C G A Hamming distance ignores back-mutation! It underestimates the number of mutations. CORRECTING DISTANCES The strategy is to develop a stochastic model of DNA evolution, and use it to correct the observed distance to account for multiple substitutions in a site.
OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES
A SIMPLE MODEL OF NUCLEOTIDE EVOLUTION HYPOTHESIS Time evolves continuously; Each site can be substituted independently; the rate of substitutions (expected frequency per unit of time) does not change in time (homogeneity); The rate of change from base i to base j does not depend on the mutation history of the site (memoryless property). CONSEQUENCES Happening time of a single mutation event is modeled by an exponential distribution. Number of mutations is modeled by a Poisson process.
MARKOV PROCESSES If we consider all possible mutations (from A to C, G, T and so on), we end up with a matrix of rates and with a time-homogeneous continuous time Markov Chain. FURTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: π A, π C, π G, π T (stationary chain). The process is time reversible: π i P ij (t) = π j P ji (t). RATE MATRIX Under the previous hypothesis, the Q-matrix decomposes in q ij = R ij π j R is a symmetric matrix π are the stationary frequencies (solution of Qπ = 0) For nucleotide substitution models, we have 6+3 parameters to set.
MARKOV PROCESSES If we consider all possible mutations (from A to C, G, T and so on), we end up with a matrix of rates and with a time-homogeneous continuous time Markov Chain. FURTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: π A, π C, π G, π T (stationary chain). The process is time reversible: π i P ij (t) = π j P ji (t). RATE MATRIX Under the previous hypothesis, the Q-matrix decomposes in q ij = R ij π j R is a symmetric matrix π are the stationary frequencies (solution of Qπ = 0) For nucleotide substitution models, we have 6+3 parameters to set.
EXPECTATIONS TOTAL RATE OF CHANGE µ = i q ii π i EXPECTED NUMBER OF CHANGES AFTER TIME t d = µt PROBABILITY OF OBSERVING A SUBSTITUTION AFTER TIME t p = 1 i π i P ii (t) p is also the expected number of observed substitutions per site.
CORRECTING HAMMING DISTANCE 1 Estimate p as ˆp = Hamming distance total length 2 From d = µt and p = 1 i π ip ii (t) deduce p = 1 i π i P ii ( d µ ). 3 Solve the previous formula for d and use the estimate ˆp of p to compute the estimate ˆd.
DIFFERENT EVOLUTIONARY MODELS There are 6 parameters to fix the rate matrix R and 3 to fix the equilibrium frequencies π.
THE JUKES-CANTOR MODEL The Jukes-Cantor model has been published in 1969. It is the simplest model of evolution, Q = assuming R ij = 1 and π i = 1 4. SOLUTION FOR P P(t) = 1 4 Qe t. 3 1 1 4 4 4 1 4 3 1 4 4 1 1 4 4 3 4 1 1 4 4 1 4 1 4 1 4 1 4 3 4 CORRECTION FOR THE DISTANCE d = 3 4 ln ( 1 4 3 ˆp )
OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES
RECONSTRUCTING HISTORY OF LIFE WHAT MEANS PHYLOGENETIC INFERENCE? All species on Earth come from a common ancestor. If we have data from a pool of species, we wish to reconstruct the history of speciation events that lead to their emergence: We want to find the phylogenetic tree giving this information! This is an hard task, because data is often incomplete (we lack information about most of the ancestor species) and noisy.
METHODS TO INFER PHYLOGENY APPROACHES TO PHYLOGENY Distance-based methods Parsimony methods Likelihood methods Bayesian inference methods DISTANCE-BASED METHODS Given a matrix of pairwise distances, find the tree that explains it better. Several algorithms: UPGMA (clustering methods) Neighbor Joining Fitch-Margolias (sum of squares methods)
AN EXAMPLE: PRIMATES DNA FROM PRIMATES Tarsius Lemur Homo Sapiens Chimp Gorilla Pongo Hylobates Macaco Fuscata AAGTTTCATTGGAGCCACCACTCTTATAATTGCCCATGGCCTCACCTCCT... AAGCTTCATAGGAGCAACCATTCTAATAATCGCACATGGCCTTACATCAT... AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCT... AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCACGGACTTACATCCT... AAGCTTCACCGGCGCAGTTGTTCTTATAATTGCCCACGGACTTACATCAT... AAGCTTCACCGGCGCAACCACCCTCATGATTGCCCATGGACTCACATCCT... AAGCTTTACAGGTGCAACCGTCCTCATAATCGCCCACGGACTAACCTCTT... AAGCTTTTCCGGCGCAACCATCCTTATGATCGCTCACGGACTCACCTCTT... DISTANCE MATRIX 0.00 0.29 0.40 0.39 0.38 0.34 0.38 0.37 0.29 0.00 0.37 0.38 0.35 0.33 0.36 0.34 0.40 0.37 0.00 0.10 0.11 0.15 0.21 0.24 0.39 0.38 0.10 0.00 0.12 0.17 0.21 0.24 0.38 0.35 0.11 0.12 0.00 0.16 0.21 0.26 0.34 0.33 0.15 0.17 0.16 0.00 0.22 0.24 0.38 0.36 0.21 0.21 0.21 0.22 0.00 0.26 0.37 0.34 0.24 0.24 0.26 0.24 0.26 0.00 Tarsius Lemur Homo Sapiens Chimp Gorilla Pongo Hylobates Macaco Fuscata
LEAST SQUARE METHOD We have our observed distance matrix D ij and a tree T with branch lengths predicting an additive distance matrix d ij. TARGET Find the tree T minimizing the error between d ij and D ij, i.e. the tree minimizing the weighted least square sum S(T ) = i,j w ij (D ij d ij ) 2 Given a tree topology, the best branch lengths for S can be computed by solving a linear system. A least square algorithms needs to search the tree space for the best tree T : this is an N P-hard problem. The search for the best tree can use branch and bound methods or heuristic state space explorations. This method gives the best explanation of the data
FITCH-MARGOLIAS ALGORITHM Letting w ij = 1 in S(T ) = Dij 2 i,j w ij(d ij d ij ) 2, we obtain the method of Fitch-Margoliash. Lemur M fuscata Hylobates Pongo Gorilla Pan Homo sap Tarsius M fuscata Hylobates Lemur Pongo Gorilla Pan Homo sap Tarsius
HEURISTIC METHODS: UPGMA HIERARCHICAL CLUSTERING Hierarchical clustering works by iteratively merging the two closest clusters (sets of elements) in the current collection of clusters. It requires a matrix of distances among singletons. Different ways of computing intercluster distances give rise to different HC-algorithms. UPGMA UPGMA (Unweighted Pair Group Method with Arithmetic mean) computes the distance between two clusters as d(a, B) = 1 A B i A,j B d ij. When two clusters A and B are merged, their union is represented by their ancestor node in the tree. The distance between A and B is evenly split between the two branches entering in A and B
UPGMA - II HYPOTHESIS UPGMA reconstructs correctly the tree if the input distance is an ultrametric (molecular clock). Tarsius Lemur Homo sap Pan Gorilla Pongo Hylobates M fuscata
HEURISTIC METHODS: NEIGHBOR-JOINING NEIGHBOR-JOINING Neighbor-Joining works similarly to UPGMA, but it merges together the two clusters minimizing D ij = d ij r i r j, where r i = 1 C 2 k d ik is the average distance of i from all other nodes. When i and j are merged, their new ancestor x has distances from another node k equal to d xk = 1 2 (d ik + d jk d ij ) The branch lengths are d ix = 1 2 (d ij + r i r j ) and d jx = 1 2 (d ij + r j r i ). NJ reconstructs the correct tree if the input distance is additive. Lemur M fuscata Hylobates Hylobates Pongo Gorilla Homo sap Pongo M fuscata Pan Gorilla Homo sap Pan Tarsius Lemur Tarsius
NOT ONLY DNA EVOLVE...
THE END Thanks for the attention!