Phylogenetic inference

Similar documents
Dr. Amira A. AL-Hosary

Quantifying sequence similarity

Constructing Evolutionary/Phylogenetic Trees

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Tree Reconstruction


Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

EVOLUTIONARY DISTANCES

Multiple Sequence Alignment. Sequences

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Algorithms in Bioinformatics

How to read and make phylogenetic trees Zuzana Starostová

Evolutionary Tree Analysis. Overview

Cladistics and Bioinformatics Questions 2013

Phylogenetics: Building Phylogenetic Trees

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogeny: building the tree of life

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

A (short) introduction to phylogenetics

1 ATGGGTCTC 2 ATGAGTCTC

What is Phylogenetics

Phylogeny. November 7, 2017

BINF6201/8201. Molecular phylogenetic methods

Phylogenetics. BIOL 7711 Computational Bioscience

Theory of Evolution Charles Darwin

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

8/23/2014. Phylogeny and the Tree of Life

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Lecture 6 Phylogenetic Inference

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

C.DARWIN ( )

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Introduction to Bioinformatics Introduction to Bioinformatics

Phylogenetic trees 07/10/13

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Lecture 11 Friday, October 21, 2011

Theory of Evolution. Charles Darwin

Biology 211 (2) Week 1 KEY!

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogenetic inference: from sequences to trees

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

X X (2) X Pr(X = x θ) (3)

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Phylogenetic analyses. Kirsi Kostamo

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

molecular evolution and phylogenetics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Letter to the Editor. Department of Biology, Arizona State University

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

How should we organize the diversity of animal life?

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Molecular Evolution and Phylogenetic Tree Reconstruction

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Phylogeny: traditional and Bayesian approaches

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Phylogeny Tree Algorithms

Tools and Algorithms in Bioinformatics

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Week 5: Distance methods, DNA and protein models

Evolutionary Models. Evolutionary Models

Understanding relationship between homologous sequences

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Computational Biology

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

Chapter 26 Phylogeny and the Tree of Life

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Classification and Phylogeny

Concepts and Methods in Molecular Divergence Time Estimation

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Transcription:

Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types for making trees in light of phylogenetic resolution and noise explain four different tree-building algorithms (UPGMA, NJ, MP, ML) and list their (dis-) advantages convert between cladograms and Newick tree format root unrooted trees in four ways identify informative positions in an alignment calculate the number of possible trees with n leaves explain bootstrapping/jackknifing and their assumptions interpret branch support values 1

Characters used for phylogenetic inference Phenotypic characters Infinite number of features Subjective choice Value can depend on observation (etc.) Sequence (protein/dna) Gene/genome is finite Objective choice A sequence is absolute Phylogenetic resolution Use variable regions to compare closely related sequences Badly conserved sequences contain too much noise to resolve distant relationships Use conserved regions to compare distantly related sequences Highly conserved sequences contain too little information to resolve close relationships

3//17 Trimming alignments Conserved regions can be more confidently aligned than variable regions Variable regions can add noise to an alignment To solve this, badly aligned regions can be trimmed before further analysis Gblocks eliminates poorly aligned positions and divergent regions of a DNA or protein alignment Types of tree-building algorithms Distance-based approaches Fastest programs for making phylogenetic trees Unweighted Pair Group Method with Arithmetic mean (UPGMA) Neighbor Joining (NJ) Maximum parsimony (MP) approaches Assume the minimal number of changes or evolutionary events Maximum likelihood (ML) approaches Depend on an explicit model of evolution Considered the most reliable way to infer phylogenies 3

Phylogenies based on distance matrices Multiple sequence alignment UPGMA and NJ Evolutionary distance matrix Calculate evolutionary divergence Jukes-Cantor correction Cluster: UPGMA, Neighbor Joining Phylogenetic tree Evolutionary distances Sequence (dis-)similarity represents evolutionary distance Use similarity quantification methods from last week s lectures But: evolutionary distance does not correlate 1:1 with sequence alignment score Because mutations at the same position in the sequence become increasingly likely 3 4 So we have to correct for that: d=- ln(1- D) Actual number of mutations Observed number of mutated positions 4 3 Jukes Cantor correction 4

UPGMA algorithm Newick tree format Phylogenetic trees can be written as a bracket-notation Also known as Newick tree format: ((((A:4.0,D:4.0):4.5,((B:0.5, F:0.5):5.75,G:6.5):.0):6.5, C:14.5):.5,E:17.0); Multifurcating branches could be included in the Newick tree format For example if the ((B,F),G) node has very low support ((((A:4.0,D:4.0):4.5,(B:0.5,F :0.5):7.75,G:8.5):6.5,C:14.5 ):.5,E:17.0); 4.5 7.75 8.5 etc. 5

Placement of the root The last thing that is added is the root UPGMA assumes that the molecular clock holds All tips have equal distance to the root (this is called ultrametric) You can tell if the tree is rooted or unrooted by looking at the number of most basal groups: With root: most basal groups ((((A:4.0,D:4.0):4.5,((B:0.5,F:0.5):5.75,G:6.5):.0):6.5,C:14.5):.5,E:17.0); Unrooted: 3 most basal groups (((A:4.0,D:4.0):4.5,((B:0.5,F:0.5):5.75,G:6.5):.0):6.5,C:14.5,E:19.5); root 0.5 Question 1 A B D E 1. The numbers indicated in the tree above are branch lengths. a) What is a common unit of branch length in molecular phylogenies? b) Assume that the molecular clock holds. Fill in the missing branch lengths. c) What algorithm was used for building this phylogenetic tree? d) What are d AB and d CD? e) Write this tree in Newick tree format with branch lengths.. Research has revealed that the molecular clock does not hold for the lineage leading to C. If d BC = 6, what is the distance between C and its last common ancestor with A, B, D, and E? C 6

0.5 Answers 1 3.5 1. Assume that the molecular clock holds a) Mutations per sequence site b) Missing branch lengths are now included in the tree above c) Unweighted Pair Group Method with Arithmetic mean (UPGMA) d) d AB = 4, d CD = 7 e) (((A:,B:):1,(D:,E:):):0.5,C:3.5);. If d BC is 6, then the branch length of C to its last common ancestor with A, B, D, and E is: 6 - - 1-0.5 =.5 (in stead of 3.5) A B D E C 0.5 1 3.5 Answers C D 0.5 E B A 1 3.5 D E A B C 1. The following bracket-notations are also correct (rotated branches): e) (((A:,B:):1,(D:,E:):):0.5,C:3.5); e) (C:3.5,((D:,E:):,(B:,A:):1):0.5); e) (C:3.5,((B:,A:):1,(D:,E:):):0.5); e) (C:3.5,((A:,B:):1,(E:,D:):):0.5); Et cetera 7

Non-uniform molecular clock Ultrametric algorithms only work if the clock runs at the same speed in all branches All distances to the root are equal This is often not the case: Species A (fast evolving) Species B (slow evolving) Species C (fast evolving) Species D (slow evolving) Now Now Unequal rates of evolution are the rule Protist mitochondrion Plant mitochondrion Neighbor Joining (NJ) is designed to account for a nonuniform molecular clock 8

Neighbor joining algorithm Before we know the tree, the distances between all nodes are represented as a star Neighbor joining algorithm NJ accounts for different rates of evolution Evolutionary distances between all nodes stored in branch lengths 9

Rooting trees using an outgroup Distantly related species or gene (homolog) Gene duplication event outgroup virus f-b f-a h-a h-b f-a h-a f-b h-b Rooting trees using prior knowledge You can also root trees using your prior knowledge I know that the root lies between X and Y Fish Mouse Human Y F M H Yeast Taxon X Taxon Y 10

Rooting trees using midpoint rooting If all else fails, you can also assume that the root lies halfway between the most distant tips UPGMA takes this approach ½ ½ gene_w 1 4 gene_z 1 1 gene_z 1 4 1 1 y w x z gene_y (4 + 1 + + 1) / = 4 ½ ½ Phylogenies based on models of evolution Evolutionary events happen on the branches of the tree Their probabilities are quantified in a model of evolution Some algorithms search all possible trees for the one that best agrees with a given model of evolution Some trees explain the data (= the alignment) better than others Maximum parsimony (MP): the tree that assumes the fewest evolutionary events on the branches to explain the alignment This is a very simplistic model of evolution Maximum likelihood (ML): gives all evolutionary events a probability, finds the tree where total probability is optimized ML can incorporate very sophisticated models of evolution 11

Maximum parsimony (MP) MP example for an alignment in 5 species: Chimpanzee Gibbon Gorilla Human Orangutan AAGT C GTTAC AAGT T GTTAC AAGT C GTTAC AAGT C GTTAC AAGT T GTTAC Draw all possible trees for the sequences/species present in your multiple alignment For each tree, identify where the mutations have taken place Make parsimony assumption: minimum number of required mutations (*) Maximum parsimony (MP) The MP tree has the minimum number of required mutations It is the simplest explanation of the alignment Informative positions contain different characters each 1: c-t 3: t-a 7: a-t 6: a-t 1

Maximum likelihood (ML) The simplest explanation (MP) is not always the most likely Some types of mutations are more likely than others ML scans all possible trees for the tree that optimizes the probabilities of different events on all the branches Models of evolution in ML can be very sophisticated, scoring the likelihood of evolutionary events like: Substitutions between different amino acids E.g. the BLOSUM matrix quantifies the likelihood of all substitutions Insertion or deletion events and their lengths E.g. fixed, linear, or affine gap penalties Site-specific mutation rates E.g. identify fast/slow evolving positions in alignment Give low likelihood to mutations at highly conserved positions Faster/slower evolving lineages in the tree E.g. use the alignment to identify lineages where mutations are more likely Make mutations on those branches cheaper Dependency between certain alignment positions (?) Drawing all possible trees? Really? How many trees are there? For 5 tips, the number of rooted/unrooted trees is: 105 / 15 For 10 tips, the number of rooted/unrooted trees is: 34,459,45 /,07,05 For 15 tips, the number of rooted/unrooted trees is: 13,458,046,676,875 / 7,905,853,580,65 # unrooted trees N U : (n - 5)!! = (n - 5) (n - 7)... 1 # rooted trees N R : (n - 3)!! = (n - 3) (n - 5) (n - 7)... 1 Note: the root is an additional node in the tree: N R (n) = N U (n + 1) Fast algorithms use heuristics Heuristic search of tree-space Start with a tree (e.g. NJ or random tree) Change a bit (e.g. swap two branches) Accept change if likelihood goes up So only a subset of all trees need to be evaluated Searching tree space Score: Initial tree ML/MP tree low high 13

Which phylogeny approach to use? A phylogenetic tree is a hypothesis of the evolutionary history Just like for multiple sequence alignments, the tree is an algorithm s best guess based on the given model of evolution Distance trees are fast and give you a quick idea of the tree NJ better than UPGMA because unequal rates of evolution are common MPis rarely used to infer molecular phylogenies Trees are not good, the model of evolution is too simplistic ML trees are slower but tend to give the most reliable trees If you know the true branching order for some lineages, you can test which tree building algorithm best recovers them Garbage in = garbage out! Remember: any sequences will be aligned by an alignment program Even of the sequences are not homologous, and thus the multiple sequence alignment is meaningless bad alignments Similarly, any alignment will be turned into a tree by a phylogeny program Even if the sequences are not homologous or if they are very badly aligned, and thus the phylogenetic tree is meaningless Solutions: Check your multiple sequence alignment carefully before making a phylogenetic tree If the tree shows unexpected branching order, think twice about your methods before interpreting it as a true biological event 14

3//17 Phylogenetic inconsistencies The phylogenies of different genes from the same genomes can be inconsistent This can be the result of: Evolution of the gene is different than the evolution of the genome Horizontal gene transfer Unrecognized paralogy Technical issues Bad model of evolution Bad alignments Bad phylogenies Biological noise Mutational saturation: multiple mutations at the same sequence site Different rates of evolution in different lineages (inconsistent molecular clock) Branch support Support values show you how reliable a branching split is How do we calculate these values? Bootstrap and jackknife statistics These are a type of permutation statistic 0.01 Posterior probabilities Based on a model of evolution Support* 100% 50% 1% Note: this is the same split (in an unrooted tree) *more about this now http://epidemic.bio.ed.ac.uk/how_to_read_a_phylogeny 15

We use statistics to assess confidence in an experiment We repeat the experiment N times and test how robust the result is How much variation is there in the result? Statistics Bootstrapping We have already used all the data to create the phylogenetic tree, so how can we get information about the confidence of the nodes in the tree? Bootstrap re-sampling or bootstrapping: Bootstrap (verb): To pull oneself up by one's bootstraps means to better oneself without external help A multiple alignment consists of many observations (i.e. the positions or columns) From these many observations, we can randomly re-sample new datasets Create new phylogenies from the re-sampledalignments Are the branchings in the tree are robust in these new datasets? 16

Bootstrapping Like many alignment algorithms, bootstrapping/jackknifing assume that all positions in the alignment evolve independently (which is not true) Bootstrap re-sampling Randomly re-sample columns (positions) from a multiple alignment The number of sampled columns is identical to the number of positions in the original alignment Sampling is done with replacement, so some positions can be sampled multiple times, while other positions are never chosen Calculate a bootstrap tree based on the randomly resampled alignment Use the same phylogenetic approach Repeat this e.g. 100-1,000 times This means making 100-1,000 trees For each branch in the original tree, in what percentage of the bootstrap trees was it correctly recovered? This is the bootstrap support of the branch 17

Jackknife re-sampling Randomly select a percentage of the columns (positions) from a multiple alignment For example, 50% of the positions in the original alignment Sampling is done without replacement Calculate a jackknife tree based on the randomly resampled alignment Use the same phylogenetic approach Repeat this e.g. 100-1,000 times This means making 100-1,000 trees For each branch in the original tree, in what percentage of the jackknife trees was it correctly recovered? This is the jackknife support of the branch Interpreting bootstrap values Any branch with 100% support is certain This means that the species within it were always found together as a cluster No other sequences belong to that cluster Exam question: Where might C. elegans CDKH be found in at least one of the bootstrap trees? At 15 positions! 18

Exam question What is the maximum number of bootstrap trees where Zebrafish_1A and Stickleback_1A could have formed a clade of two leaves? What is the maximum number of bootstrap trees where any clawed frog sequence could have formed a clade of two leaves with any stickleback sequence? 19