Phylogeny. November 7, 2017

Similar documents
Phylogenetic Tree Reconstruction

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic inference


Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetics. BIOL 7711 Computational Bioscience

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Evolutionary Tree Analysis. Overview

BINF6201/8201. Molecular phylogenetic methods

Molecular Evolution & Phylogenetics

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Algorithms in Bioinformatics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Dr. Amira A. AL-Hosary

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

C3020 Molecular Evolution. Exercises #3: Phylogenetics

A (short) introduction to phylogenetics

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Phylogeny: traditional and Bayesian approaches

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

EVOLUTIONARY DISTANCES

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Phylogeny Tree Algorithms

Theory of Evolution Charles Darwin

Phylogeny: building the tree of life

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

What is Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Estimating Evolutionary Trees. Phylogenetic Methods

Phylogenetics: Building Phylogenetic Trees

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

8/23/2014. Phylogeny and the Tree of Life

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Phylogenetic trees 07/10/13

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Consistency Index (CI)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Multiple Sequence Alignment. Sequences

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Bioinformatics course

C.DARWIN ( )

How to read and make phylogenetic trees Zuzana Starostová

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Lecture 6 Phylogenetic Inference

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Systematics - Bio 615

Phylogenetic inference: from sequences to trees

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

molecular evolution and phylogenetics

Intraspecific gene genealogies: trees grafting into networks

Phylogenetic Analysis

Phylogenetic Analysis

Letter to the Editor. Department of Biology, Arizona State University

66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Concepts and Methods in Molecular Divergence Time Estimation

MOLECULAR EVOLUTION AND PHYLOGENETICS SERGEI L KOSAKOVSKY POND CSE/BIMM/BENG 181 MAY 27, 2011

Phylogenetic Analysis

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

The practice of naming and classifying organisms is called taxonomy.

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Workshop III: Evolutionary Genomics

Molecular Evolution, course # Final Exam, May 3, 2006

Molecular Evolution and Phylogenetics...a very short course

Lecture 11 Friday, October 21, 2011

Theory of Evolution. Charles Darwin

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Phylogenetic analyses. Kirsi Kostamo

Examining Phylogenetic Reconstruction Algorithms

A Phylogenetic Network Construction due to Constrained Recombination

Bayesian Models for Phylogenetic Trees

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Finding the best tree by heuristic search

X X (2) X Pr(X = x θ) (3)

The course Phylogenetic data analysis (period IV, 2010) is an in-depth course on this topic

Transcription:

Phylogeny November 7, 2017

Phylogenetics Phylon = tribe/race, genetikos = relative to birth Phylogenetics: study of evolutionary relationships among organisms, sequences, or anything in between Related to multiple sequence alignment: MSAs can be used to construct trees and phylogenetic trees are used in MSA construction

Phylogenetic tree anatomy Separate sequences are taxa [singular: taxon] phylogenetically distinct units on the tree Branches connect nodes to nodes or nodes to leaves and represent evolutionary distance branch leaves node

Phylogenetic tree anatomy Trees are generally bifurcating (multifurcation possible for viruses) Trees may be rooted or unrooted OTU = Operational Taxonomic Unit bifurcation multifurcation

Phylogenetic tree anatomy Trees are probably not strictly bifurcating Some evidence for horizontal transfer, between- species transfer etc other reasons? forest of life or reticulated tree of life

Molecular Clock hypothesis MC hypothesis: the rate of evolution is the same in all tree branches. This is suitable for closely related species but isn t always useful or appropriate

Ultrametric distance Assumes that the rates of evolution are the same in all branches (molecular clock) If so: d AC max(d AB, d BC ) for all A, B, C This means that the maximum distance separating the three leaves is not unique X Y A B C AB = 2X AC = 2X BC = 2Y X>Y

Ultrametric distance

Ultrametric yes no

Phylogenetic analysis: methods Strong sequence similarity -> maximum parsimony Recognizable sequence similarity -> distance methods (UPGMA, WPGMA, Neighbor-joining) No? -> maximum likelihood methods

Distance methods Employ the number of changes between each pair in a group of sequences to create a phylogenetic tree Neighbors have the smallest number of sequence changes, so presumably they share their nearest common ancestor (minimize distance, minimize homoplasy) Pioneered by Feng and Doolittle

How to collect distance data Lab methods: Mix single strands of DNA/cDNA from different species and measure association parameters (like a CoT curve) Agglutination times Many more! Sequence analysis methods: Alignments k-mer counting Composition analysis Shared domains

Tree-making Goal: create a tree whose branch lengths reflect the distance metric A B C dab = 2 dbc = 1 dac = 2

UPGMA / WPGMA Unweighted/Weighted pair group method with arithmetic means Sneath, PHA and Sokal, RR, in Numerical Taxonomy (1973) Progressively cluster sequences by distance until a tree is formed

UPGMA Algorithm: Make sure distances are ultrametric Choose i and j such that d ij is minimal and join them to form cluster k Recompute distances for each of the other items (l) in the set: j i k dkl = d il + djl 2 dkl = Ni*dil + Nj*djl Ni + Nj

UPGMA: an example We have 5 sequences and all pairwise distances: A B C D B 8 C 8 2 D 6 8 8 E 2 8 8 6

UPGMA: an example A B C D B 8 C 8 2 D 6 8 8 d (AE) = 2 d (AE)B = [d AB + d EB ]/2 = 8 d (AE)C = [d AC + d EC ]/2 = 8 d (AE)D = [d AD + d ED ]/2 = 6 E 2 8 8 6 1 1 A E

UPGMA: an example B 8 AE B C C 8 2 d (BC) = 2 d (BC)(AE) = [d B(AE) + d C(AE) ]/2 = 8 d (BC)D = [d BD + d CD ]/2 = 8 D 6 8 8 1 1 B C

UPGMA: an example AE BC 1 BC 8 2 3 3 D 6 8 1 1 1 1 A E D B C d (AE)D = 6 d (AED)(BC) = [d (AED)B + d (AED)C ]/2 = 8

Not ultrametric? Additive distances satisfy the four-point metric condition, e.g. d AB + d CD max(d AC + d BD, d AD + d BC ) A B C D

Not ultrametric? Four point metric condition d AB + d CD max(d AC + d BD, d AD + d BC ) d AB = a+b; d CD = c+d; d AC = a+e+c; d BD = b+e+d; d AD = a+e+d; d BC =b+e+c this means that we can define and assign distances to inner points. A a e c C B b d D

Additive trees Additive trees don t strictly require a molecular clock. k If the evolutionary distance is additive, we can use simple arithmetic to infer inner nodes. k Known: d(i, j) j m d(i, k) i j i d(j, k) d(m, k) = (d(j,k) + d(i,k) - d(i, j))/2

Neighbor joining Good example of a minimum evolution method Goal is to minimize the sum of branch lengths (assumes this is the best estimate of phylogeny) In fact, rarely gives the shortest tree Requires additive distances (only additive distances can fit into an unrooted tree) Very fast No molecular clock required, so it s good for real-life data

Neighbor Joining: example A B C D E B 5 C 4 7 D 7 10 7 E 6 9 6 5 F 8 11 8 9 8 1. Compute the net divergence for every node r A = 5+4+7+6+8=30 r D =38 r B =5+7+10+9+11=42 r E =34 r C =32 r F =44 From The Phylogenetic Handbook, Salemi and Vandamme 2004

Neighbor Joining: example A B C D E B 5 C 4 7 D 7 10 7 E 6 9 6 5 F 8 11 8 9 8 2. Create a rate-corrected distance matrix where M ij = d ij -(r i +r j )/(N-2) M AB = d AB - (r A +r B )/(N-2) = 5 - (30+42)/4 = -13 From The Phylogenetic Handbook, Salemi and Vandamme 2004

Neighbor Joining: example A B C D E B -13 C -11.5-11.5 D -10-10 -10.5 E -10-10 -10.5-13 F -10.5-10.5-11 -11.5-11.5 3. Define a new node for which M ij is minimal (either AB or DE here) 4. Compute the branch lengths from a new node, U, to A and B S AU = d AB /2 + (r A -r B )/2(N-2) = 2.5 - (12/(2*4)) = 1 S BU = d AB - S AU = 5-1 = 4 (alternatively, S AU = 4, S BU = 1) From The Phylogenetic Handbook, Salemi and Vandamme 2004

Neighbor Joining: example 5. Compute new distances from the new node U to all other nodes d CU = (d AC + d BC - d AB )/2 = 3 d DU = 6 d EU = 5 d FU = 7 6. N = N -1, repeat steps 1 through 5 From The Phylogenetic Handbook, Salemi and Vandamme 2004

Neighbor Joining: example

Evaluating the trees Bootstrap analysis (Felsenstein 1985) Obtain a new alignment from the original by randomly choosing columns; each column can be selected more than once or not at all For each new dataset, a tree is produced look at the proportion of the trees that each branch is represented in; if < 70% for 200-2000 trees -> low confidence Long branch attraction will be supported at a high bootstrap level ACGGTGGT AGGCTGCT CCGGTCGT ACCGTCGT A GG GTT A GC CTT C GG GTT A CG GTT new tree D B A C

Evaluating the trees Jackknifing Randomly remove half the sites in each sequence Redo alignments as in bootstrapping Subtrees appearing < 70% are suspect

Rooting trees NJ and UPGMA produce unrooted trees Finding the root means conferring evolutionary directionality and more meaning to the tree One method: add an outgroup (species that is more distantly related to the species already in the tree than they are to each other) the point in the tree where the outgroup joins is a good candidate for the root Can also pick the midpoint of the longest chain of consecutive edges (will be OK if not too far from a molecular clock)

Parsimony Probably most widely used of tree-building algorithms Finds tree that can explain the alignment/data with fewest substitutions Assigns a cost to each possible tree No explicit measure of distance Character-based, not distance-based easy to implement Best for sequences with very strong similarity

Parsimony Suppose we have five species, identical except at one nucleotide, 3 Cs and 2Ts Minimal tree has one evolutionary change: ACGGATCAGCTTTAGCC ACGGATCAGCTTTAGCC ACGGATCAGCTTTAGCC ACGGATCAGTTTTAGCC ACGGATCAGTTTTAGCC T C C T C

Traditional Parsimony (Fitch 1971) Example four species, small sequences: w AAG 3 trees are possible if you have four species: x AAA w x y GGA A z AGA y w z C w y x B z y z x

Traditional Parsimony (Fitch 1971) w x y z AAG AAA GGA AGA first, examine tree layout A If this tree is correct, how many changes happened in each position of the ancestral sequence? w x position 1: position 2: position 3: y z

Traditional Parsimony (Fitch 1971) w AAG x AAA y GGA z AGA 1 2 3 A 1 2 1 B C A first, examine tree layout A If this tree is correct, how many changes happened in each position of the ancestral sequence? w y AAG GGA??? A?A AGA or AAA AAA x z AGA

Traditional Parsimony (Fitch 1971) Example: w AAG x AAA y GGA z AGA 1 2 3 A 1 2 1 B 1 2 1 C 1 1 1 A B w y w z x z y x w C x z y most parsimonious

Parsimony Can be misleading when rates of sequence change vary in different branches of the tree Different sections of genes and different regions of a genome will evolve at different rates (why?) has particular problems with homoplasy (same sequence change in more than one branch of the tree)

Speeding up parsimony Branch-and-bound Greedy algorithms with branch swapping Nearest-neighbor interchange Subtree pruning and regrafting Tree bisection and reconnection

Maximum likelihood methods Look for the tree that maximizes the likelihood of the data that have been observed With a few sequences, can just go through all the trees and calculate likelihoods Use Felsenstein s algorithm for larger # of sequences Large datasets, especially proteins, require new strategies (for example, sampling)

Sampling & Bayesian phylogeny P(T D) = P(D T)P(T) P(D) P(T1 D) P(T2 D) = P(D T 1)P(T1) P(D T2)P(T2)

Metropolis algorithm (MCMC) You have one tree. Propose a new tree, with features taken from some distribution (or by perturbing current tree). P 1 = P(T 1 D) = posterior probability of current tree P 2 = P(T 2 D) = posterior probability of new tree if (P 2 P 1 ) take new tree if (P 2 < P 1 ) take new tree with probability P 2 /P 1, else keep old tree

Metropolis algorithm (MCMC) After some amount of time and computation you have a set of trees and probabilities. Main idea: the frequency of property f seen in a chosen tree will converge to the posterior probability of that property, given the model

More realistic models We have used some drastic simplifications: ungapped alignments, each site independent and using the same substitution matrix, etc. Models exist for Allowing different rates at different sites (Yang, Felsenstein and Churchill) Allowing gaps (Mitchison and Durbin) Allowing different probabilistic models

Comparison of methods diagnostic tree: right: C wrong: C B B A A D D MLE and neighbor-joining will get the correct tree Parsimony picks the wrong tree

Comparison of methods Maximum likelihood phylogeny Advantages: most flexible good results using good evolutionary models end up with likelihood/uncertainty for all suboptimal trees Disadvantages computationally intensive bad results using bad evolutionary models

Comparison of methods UPGMA assumes a molecular clock, so provides a rooted tree (may be too strong an assumption) Neighbor-joining is good when evolutionary rates vary. Flexible. Proven to construct the correct tree under the proper circumstances. Parsimony is good for closely related sequences and is generally fast Likelihood method is the most general of all

The computational challenge >1.7 million known species; # trees increases exponentially with each new species added # unrooted trees for n species: (2n-5)!! = 3*5*7*... *(2n-5)

Newer approaches creative ways to determine distance fast clustering methods no good way to draw really big trees though!

Complete composition vector idea: we can count all dimers, trimers, tetramers etc in the sequences that we have from there we can estimate the expected frequency of each 5-mers if the prediction varies significantly from the final counts, there has been selection

CCV For every trimer and tetramer count the occurrences of that sequence. Then the predicted number of counts of each 5-mer are defined by the observed frequencies of its trimer & tetramers For example, the expected frequency of AGCTT is (p(agct)*p(gctt))/p(gct)

CCV Calculate the expected incidence of every kmer based on the observed frequencies of all of the (k-1)mers. The observed frequency of a kmer may vary significantly from what s expected, due to evolutionary pressures. This O/E value can be a type of signature for an organism.

CCV The differences in O/E for each kmer can be used as a distance metric. Successful studies so far: HIV Influenza Fungi

CCV for 86 fungi