EVOLUTIONARY DISTANCES

Similar documents
Phylogenetic trees 07/10/13

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetic Tree Reconstruction

Molecular Evolution and Phylogenetic Tree Reconstruction

Algorithms in Bioinformatics

Evolutionary Tree Analysis. Overview

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut


Constructing Evolutionary/Phylogenetic Trees

Dr. Amira A. AL-Hosary

Phylogenetic inference

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

BINF6201/8201. Molecular phylogenetic methods

Theory of Evolution Charles Darwin

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Multiple Sequence Alignment. Sequences

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Reading for Lecture 13 Release v10

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogeny: traditional and Bayesian approaches

A (short) introduction to phylogenetics

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Week 5: Distance methods, DNA and protein models

Phylogenetic analyses. Kirsi Kostamo

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Constructing Evolutionary/Phylogenetic Trees

What is Phylogenetics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Lecture 4 The stochastic ingredient

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Evolutionary Models. Evolutionary Models

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogeny: building the tree of life

Phylogenetics. BIOL 7711 Computational Bioscience

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

C.DARWIN ( )

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Understanding relationship between homologous sequences

Theory of Evolution. Charles Darwin

Molecular evolution 2. Please sit in row K or forward

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

Phylogeny Tree Algorithms

Estimating Evolutionary Trees. Phylogenetic Methods

Consistency Index (CI)

Phylogenetic Assumptions

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Concepts and Methods in Molecular Divergence Time Estimation

Phylogenetic inference: from sequences to trees

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Lecture Notes: Markov chains

molecular evolution and phylogenetics

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Modeling Noise in Genetic Sequences

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Introduction to Bioinformatics Introduction to Bioinformatics

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Phylogeny. November 7, 2017

Processes of Evolution

Molecular Evolution, course # Final Exam, May 3, 2006

Phylogenetic Tree Generation using Different Scoring Methods

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Computational Biology: Basics & Interesting Problems

Reconstruire le passé biologique modèles, méthodes, performances, limites

Phylogeny Jan 5, 2016

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

BMI/CS 776 Lecture 4. Colin Dewey

Lecture 1 Modeling in Biology: an introduction

Gel Electrophoresis. 10/28/0310/21/2003 CAP/CGS 5991 Lecture 10Lecture 9 1

Lecture 11 Friday, October 21, 2011

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

I. Short Answer Questions DO ALL QUESTIONS

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Effects of Gap Open and Gap Extension Penalties

HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Lecture 10: Phylogeny

Lecture 6 Phylogenetic Inference

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Transcription:

EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007

OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES

OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES

DIGITAL MOLECULES DNA DNA can be considered as a very long string over an alphabet of 4 bases (A, C, G, T ). This encyclopedia stores genetic information in volumes (chromosomes), with interesting chapters (genes), reading instructions (regulatory elements) and less interesting material (junk DNA).

GENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... HOW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C Actually, we should use edit distance and construct an alignment between strings.

GENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... HOW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C Actually, we should use edit distance and construct an alignment between strings.

HOW EVOLUTION ACTS ON DNA? EVOLUTIONARY EVENTS Evolution can modify DNA in several ways: Local pointwise mutations can substitute, delete or insert a base somewhere. Entire DNA fragments can be deleted or duplicated, possibly reversed in their order. Bigger pieces of DNA can be swapped or inverted. Entire genomes can be duplicated. MUTATIONS HAPPEN RANDOMLY!!! OUR FOCUS For simplicity, we focus simply on pointwise substitution events.

OUR SCENARIO The scenario is the following: consider two species (human and chimp) evolved from a common ancestor (some old primate). As the ancestor evolved to human or chimp, his DNA mutated pointwise in some positions, chosen randomly. evolutionary distance = number of mutations

DOES HAMMING DISTANCE COUNT THE NUMBER OF MUTATIONS? Consider the following situation: A C; A G C; A C G C The same observation A, C can corresponds to different evolutionary histories. Hamming distance ignores multiple substitutions in a site. Moreover: A G A; A C G A Hamming distance ignores back-mutation! It underestimates the number of mutations. CORRECTING DISTANCES The strategy is to develop a stochastic model of DNA evolution, and use it to correct the observed distance to account for multiple substitutions in a site.

OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES

A SIMPLE MODEL OF NUCLEOTIDE EVOLUTION HYPOTHESIS Time evolves continuously; Each site can be substituted independently; the rate of substitutions (expected frequency per unit of time) does not change in time (homogeneity); The rate of change from base i to base j does not depend on the mutation history of the site (memoryless property). CONSEQUENCES Happening time of a single mutation event is modeled by an exponential distribution. Number of mutations is modeled by a Poisson process.

MARKOV PROCESSES If we consider all possible mutations (from A to C, G, T and so on), we end up with a matrix of rates and with a time-homogeneous continuous time Markov Chain. FURTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: π A, π C, π G, π T (stationary chain). The process is time reversible: π i P ij (t) = π j P ji (t). RATE MATRIX Under the previous hypothesis, the Q-matrix decomposes in q ij = R ij π j R is a symmetric matrix π are the stationary frequencies (solution of Qπ = 0) For nucleotide substitution models, we have 6+3 parameters to set.

MARKOV PROCESSES If we consider all possible mutations (from A to C, G, T and so on), we end up with a matrix of rates and with a time-homogeneous continuous time Markov Chain. FURTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: π A, π C, π G, π T (stationary chain). The process is time reversible: π i P ij (t) = π j P ji (t). RATE MATRIX Under the previous hypothesis, the Q-matrix decomposes in q ij = R ij π j R is a symmetric matrix π are the stationary frequencies (solution of Qπ = 0) For nucleotide substitution models, we have 6+3 parameters to set.

EXPECTATIONS TOTAL RATE OF CHANGE µ = i q ii π i EXPECTED NUMBER OF CHANGES AFTER TIME t d = µt PROBABILITY OF OBSERVING A SUBSTITUTION AFTER TIME t p = 1 i π i P ii (t) p is also the expected number of observed substitutions per site.

CORRECTING HAMMING DISTANCE 1 Estimate p as ˆp = Hamming distance total length 2 From d = µt and p = 1 i π ip ii (t) deduce p = 1 i π i P ii ( d µ ). 3 Solve the previous formula for d and use the estimate ˆp of p to compute the estimate ˆd.

DIFFERENT EVOLUTIONARY MODELS There are 6 parameters to fix the rate matrix R and 3 to fix the equilibrium frequencies π.

THE JUKES-CANTOR MODEL The Jukes-Cantor model has been published in 1969. It is the simplest model of evolution, Q = assuming R ij = 1 and π i = 1 4. SOLUTION FOR P P(t) = 1 4 Qe t. 3 1 1 4 4 4 1 4 3 1 4 4 1 1 4 4 3 4 1 1 4 4 1 4 1 4 1 4 1 4 3 4 CORRECTION FOR THE DISTANCE d = 3 4 ln ( 1 4 3 ˆp )

OUTLINE 1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS Examples 3 INFERRING PHYLOGENIES

RECONSTRUCTING HISTORY OF LIFE WHAT MEANS PHYLOGENETIC INFERENCE? All species on Earth come from a common ancestor. If we have data from a pool of species, we wish to reconstruct the history of speciation events that lead to their emergence: We want to find the phylogenetic tree giving this information! This is an hard task, because data is often incomplete (we lack information about most of the ancestor species) and noisy.

METHODS TO INFER PHYLOGENY APPROACHES TO PHYLOGENY Distance-based methods Parsimony methods Likelihood methods Bayesian inference methods DISTANCE-BASED METHODS Given a matrix of pairwise distances, find the tree that explains it better. Several algorithms: UPGMA (clustering methods) Neighbor Joining Fitch-Margolias (sum of squares methods)

AN EXAMPLE: PRIMATES DNA FROM PRIMATES Tarsius Lemur Homo Sapiens Chimp Gorilla Pongo Hylobates Macaco Fuscata AAGTTTCATTGGAGCCACCACTCTTATAATTGCCCATGGCCTCACCTCCT... AAGCTTCATAGGAGCAACCATTCTAATAATCGCACATGGCCTTACATCAT... AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCT... AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCACGGACTTACATCCT... AAGCTTCACCGGCGCAGTTGTTCTTATAATTGCCCACGGACTTACATCAT... AAGCTTCACCGGCGCAACCACCCTCATGATTGCCCATGGACTCACATCCT... AAGCTTTACAGGTGCAACCGTCCTCATAATCGCCCACGGACTAACCTCTT... AAGCTTTTCCGGCGCAACCATCCTTATGATCGCTCACGGACTCACCTCTT... DISTANCE MATRIX 0.00 0.29 0.40 0.39 0.38 0.34 0.38 0.37 0.29 0.00 0.37 0.38 0.35 0.33 0.36 0.34 0.40 0.37 0.00 0.10 0.11 0.15 0.21 0.24 0.39 0.38 0.10 0.00 0.12 0.17 0.21 0.24 0.38 0.35 0.11 0.12 0.00 0.16 0.21 0.26 0.34 0.33 0.15 0.17 0.16 0.00 0.22 0.24 0.38 0.36 0.21 0.21 0.21 0.22 0.00 0.26 0.37 0.34 0.24 0.24 0.26 0.24 0.26 0.00 Tarsius Lemur Homo Sapiens Chimp Gorilla Pongo Hylobates Macaco Fuscata

LEAST SQUARE METHOD We have our observed distance matrix D ij and a tree T with branch lengths predicting an additive distance matrix d ij. TARGET Find the tree T minimizing the error between d ij and D ij, i.e. the tree minimizing the weighted least square sum S(T ) = i,j w ij (D ij d ij ) 2 Given a tree topology, the best branch lengths for S can be computed by solving a linear system. A least square algorithms needs to search the tree space for the best tree T : this is an N P-hard problem. The search for the best tree can use branch and bound methods or heuristic state space explorations. This method gives the best explanation of the data

FITCH-MARGOLIAS ALGORITHM Letting w ij = 1 in S(T ) = Dij 2 i,j w ij(d ij d ij ) 2, we obtain the method of Fitch-Margoliash. Lemur M fuscata Hylobates Pongo Gorilla Pan Homo sap Tarsius M fuscata Hylobates Lemur Pongo Gorilla Pan Homo sap Tarsius

HEURISTIC METHODS: UPGMA HIERARCHICAL CLUSTERING Hierarchical clustering works by iteratively merging the two closest clusters (sets of elements) in the current collection of clusters. It requires a matrix of distances among singletons. Different ways of computing intercluster distances give rise to different HC-algorithms. UPGMA UPGMA (Unweighted Pair Group Method with Arithmetic mean) computes the distance between two clusters as d(a, B) = 1 A B i A,j B d ij. When two clusters A and B are merged, their union is represented by their ancestor node in the tree. The distance between A and B is evenly split between the two branches entering in A and B

UPGMA - II HYPOTHESIS UPGMA reconstructs correctly the tree if the input distance is an ultrametric (molecular clock). Tarsius Lemur Homo sap Pan Gorilla Pongo Hylobates M fuscata

HEURISTIC METHODS: NEIGHBOR-JOINING NEIGHBOR-JOINING Neighbor-Joining works similarly to UPGMA, but it merges together the two clusters minimizing D ij = d ij r i r j, where r i = 1 C 2 k d ik is the average distance of i from all other nodes. When i and j are merged, their new ancestor x has distances from another node k equal to d xk = 1 2 (d ik + d jk d ij ) The branch lengths are d ix = 1 2 (d ij + r i r j ) and d jx = 1 2 (d ij + r j r i ). NJ reconstructs the correct tree if the input distance is additive. Lemur M fuscata Hylobates Hylobates Pongo Gorilla Homo sap Pongo M fuscata Pan Gorilla Homo sap Pan Tarsius Lemur Tarsius

NOT ONLY DNA EVOLVE...

THE END Thanks for the attention!