EVOLUTIONARY DISTANCES


 Camilla Hopkins
 8 months ago
 Views:
Transcription
1 EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste Trieste, 14 th November 2007
2 OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES
3 OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES
4 DIGITAL MOLECULES DNA DNA can be considered as a very long string over an alphabet of 4 bases (A, C, G, T ). This encyclopedia stores genetic information in volumes (chromosomes), with interesting chapters (genes), reading instructions (regulatory elements) and less interesting material (junk DNA).
5 GENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... HOW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C Actually, we should use edit distance and construct an alignment between strings.
6 GENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... HOW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C Actually, we should use edit distance and construct an alignment between strings.
7 HOW EVOLUTION ACTS ON DNA? EVOLUTIONARY EVENTS Evolution can modify DNA in several ways: Local pointwise mutations can substitute, delete or insert a base somewhere. Entire DNA fragments can be deleted or duplicated, possibly reversed in their order. Bigger pieces of DNA can be swapped or inverted. Entire genomes can be duplicated. MUTATIONS HAPPEN RANDOMLY!!! OUR FOCUS For simplicity, we focus simply on pointwise substitution events.
8 OUR SCENARIO The scenario is the following: consider two species (human and chimp) evolved from a common ancestor (some old primate). As the ancestor evolved to human or chimp, his DNA mutated pointwise in some positions, chosen randomly. evolutionary distance = number of mutations
9 DOES HAMMING DISTANCE COUNT THE NUMBER OF MUTATIONS? Consider the following situation: A C; A G C; A C G C The same observation A, C can corresponds to different evolutionary histories. Hamming distance ignores multiple substitutions in a site. Moreover: A G A; A C G A Hamming distance ignores backmutation! It underestimates the number of mutations. CORRECTING DISTANCES The strategy is to develop a stochastic model of DNA evolution, and use it to correct the observed distance to account for multiple substitutions in a site.
10 OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES
11 A SIMPLE MODEL OF NUCLEOTIDE EVOLUTION HYPOTHESIS Time evolves continuously; Each site can be substituted independently; the rate of substitutions (expected frequency per unit of time) does not change in time (homogeneity); The rate of change from base i to base j does not depend on the mutation history of the site (memoryless property). CONSEQUENCES Happening time of a single mutation event is modeled by an exponential distribution. Number of mutations is modeled by a Poisson process.
12 MARKOV PROCESSES If we consider all possible mutations (from A to C, G, T and so on), we end up with a matrix of rates and with a timehomogeneous continuous time Markov Chain. FURTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: π A, π C, π G, π T (stationary chain). The process is time reversible: π i P ij (t) = π j P ji (t). RATE MATRIX Under the previous hypothesis, the Qmatrix decomposes in q ij = R ij π j R is a symmetric matrix π are the stationary frequencies (solution of Qπ = 0) For nucleotide substitution models, we have 6+3 parameters to set.
13 MARKOV PROCESSES If we consider all possible mutations (from A to C, G, T and so on), we end up with a matrix of rates and with a timehomogeneous continuous time Markov Chain. FURTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: π A, π C, π G, π T (stationary chain). The process is time reversible: π i P ij (t) = π j P ji (t). RATE MATRIX Under the previous hypothesis, the Qmatrix decomposes in q ij = R ij π j R is a symmetric matrix π are the stationary frequencies (solution of Qπ = 0) For nucleotide substitution models, we have 6+3 parameters to set.
14 EXPECTATIONS TOTAL RATE OF CHANGE µ = i q ii π i EXPECTED NUMBER OF CHANGES AFTER TIME t d = µt PROBABILITY OF OBSERVING A SUBSTITUTION AFTER TIME t p = 1 i π i P ii (t) p is also the expected number of observed substitutions per site.
15 CORRECTING HAMMING DISTANCE 1 Estimate p as ˆp = Hamming distance total length 2 From d = µt and p = 1 i π ip ii (t) deduce p = 1 i π i P ii ( d µ ). 3 Solve the previous formula for d and use the estimate ˆp of p to compute the estimate ˆd.
16 DIFFERENT EVOLUTIONARY MODELS There are 6 parameters to fix the rate matrix R and 3 to fix the equilibrium frequencies π.
17 THE JUKESCANTOR MODEL The JukesCantor model has been published in It is the simplest model of evolution, Q = assuming R ij = 1 and π i = 1 4. SOLUTION FOR P P(t) = 1 4 Qe t CORRECTION FOR THE DISTANCE d = 3 4 ln ( ˆp )
18 OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES
19 RECONSTRUCTING HISTORY OF LIFE WHAT MEANS PHYLOGENETIC INFERENCE? All species on Earth come from a common ancestor. If we have data from a pool of species, we wish to reconstruct the history of speciation events that lead to their emergence: We want to find the phylogenetic tree giving this information! This is an hard task, because data is often incomplete (we lack information about most of the ancestor species) and noisy.
20 METHODS TO INFER PHYLOGENY APPROACHES TO PHYLOGENY Distancebased methods Parsimony methods Likelihood methods Bayesian inference methods DISTANCEBASED METHODS Given a matrix of pairwise distances, find the tree that explains it better. Several algorithms: UPGMA (clustering methods) Neighbor Joining FitchMargolias (sum of squares methods)
21 AN EXAMPLE: PRIMATES DNA FROM PRIMATES Tarsius Lemur Homo Sapiens Chimp Gorilla Pongo Hylobates Macaco Fuscata AAGTTTCATTGGAGCCACCACTCTTATAATTGCCCATGGCCTCACCTCCT... AAGCTTCATAGGAGCAACCATTCTAATAATCGCACATGGCCTTACATCAT... AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCT... AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCACGGACTTACATCCT... AAGCTTCACCGGCGCAGTTGTTCTTATAATTGCCCACGGACTTACATCAT... AAGCTTCACCGGCGCAACCACCCTCATGATTGCCCATGGACTCACATCCT... AAGCTTTACAGGTGCAACCGTCCTCATAATCGCCCACGGACTAACCTCTT... AAGCTTTTCCGGCGCAACCATCCTTATGATCGCTCACGGACTCACCTCTT... DISTANCE MATRIX Tarsius Lemur Homo Sapiens Chimp Gorilla Pongo Hylobates Macaco Fuscata
22 LEAST SQUARE METHOD We have our observed distance matrix D ij and a tree T with branch lengths predicting an additive distance matrix d ij. TARGET Find the tree T minimizing the error between d ij and D ij, i.e. the tree minimizing the weighted least square sum S(T ) = i,j w ij (D ij d ij ) 2 Given a tree topology, the best branch lengths for S can be computed by solving a linear system. A least square algorithms needs to search the tree space for the best tree T : this is an N Phard problem. The search for the best tree can use branch and bound methods or heuristic state space explorations. This method gives the best explanation of the data
23 FITCHMARGOLIAS ALGORITHM Letting w ij = 1 in S(T ) = Dij 2 i,j w ij(d ij d ij ) 2, we obtain the method of FitchMargoliash. Lemur M fuscata Hylobates Pongo Gorilla Pan Homo sap Tarsius M fuscata Hylobates Lemur Pongo Gorilla Pan Homo sap Tarsius
24 HEURISTIC METHODS: UPGMA HIERARCHICAL CLUSTERING Hierarchical clustering works by iteratively merging the two closest clusters (sets of elements) in the current collection of clusters. It requires a matrix of distances among singletons. Different ways of computing intercluster distances give rise to different HCalgorithms. UPGMA UPGMA (Unweighted Pair Group Method with Arithmetic mean) computes the distance between two clusters as d(a, B) = 1 A B i A,j B d ij. When two clusters A and B are merged, their union is represented by their ancestor node in the tree. The distance between A and B is evenly split between the two branches entering in A and B
25 UPGMA  II HYPOTHESIS UPGMA reconstructs correctly the tree if the input distance is an ultrametric (molecular clock). Tarsius Lemur Homo sap Pan Gorilla Pongo Hylobates M fuscata
26 HEURISTIC METHODS: NEIGHBORJOINING NEIGHBORJOINING NeighborJoining works similarly to UPGMA, but it merges together the two clusters minimizing D ij = d ij r i r j, where r i = 1 C 2 k d ik is the average distance of i from all other nodes. When i and j are merged, their new ancestor x has distances from another node k equal to d xk = 1 2 (d ik + d jk d ij ) The branch lengths are d ix = 1 2 (d ij + r i r j ) and d jx = 1 2 (d ij + r j r i ). NJ reconstructs the correct tree if the input distance is additive. Lemur M fuscata Hylobates Hylobates Pongo Gorilla Homo sap Pongo M fuscata Pan Gorilla Homo sap Pan Tarsius Lemur Tarsius
27 NOT ONLY DNA EVOLVE...
28 THE END Thanks for the attention!