Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Similar documents
"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Introduction to Bioinformatics Introduction to Bioinformatics

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogenetic inference

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

BINF6201/8201. Molecular phylogenetic methods

Dr. Amira A. AL-Hosary

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

8/23/2014. Phylogeny and the Tree of Life

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis


What is Phylogenetics

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Quantifying sequence similarity

Phylogenetics: Building Phylogenetic Trees

Algorithms in Bioinformatics

Phylogenetic Tree Reconstruction

Constructing Evolutionary/Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Evolutionary Tree Analysis. Overview

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Constructing Evolutionary/Phylogenetic Trees

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenetics. BIOL 7711 Computational Bioscience

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

1 ATGGGTCTC 2 ATGAGTCTC

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Cladistics and Bioinformatics Questions 2013

Phylogenetic analyses. Kirsi Kostamo

MiGA: The Microbial Genome Atlas

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

A (short) introduction to phylogenetics

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Consensus Methods. * You are only responsible for the first two

Phylogenetic inference: from sequences to trees

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

A Phylogenetic Network Construction due to Constrained Recombination

Comparative Genomics II

C.DARWIN ( )

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Computational approaches for functional genomics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Session 5: Phylogenomics

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Comparative Bioinformatics Midterm II Fall 2004

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

How to read and make phylogenetic trees Zuzana Starostová

Big Idea #1: The process of evolution drives the diversity and unity of life

EVOLUTIONARY DISTANCES

Chapter 19: Taxonomy, Systematics, and Phylogeny

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

7. Tests for selection

Bioinformatics Exercises

Multiple Sequence Alignment. Sequences

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Phylogeny: building the tree of life

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Reading for Lecture 13 Release v10

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab

Chapter 27: Evolutionary Genetics

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biol478/ August

Sequence analysis and comparison

Theory of Evolution Charles Darwin

Phylogeny and the Tree of Life

Example of Function Prediction

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Using Bioinformatics to Study Evolutionary Relationships Instructions

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Evolution by duplication

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Transcription:

Inferring phylogeny Constructing phylogenetic trees Tõnu Margus

Contents What is phylogeny? How/why it is possible to infer it? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny?

Phylogenetics In biology, phylogenetics is the study of evolutionary relatedness among various groups of organisms (for example, species or populations), which is discovered through molecular sequencing data and morphological data matrices. A phylogenetic analysis is the scientific procedure that lets you reconstruct the evolutionary history of a group of organisms or sequences.

Time aspect - We are successful descendants of our predecessors

Some surprising cases of using concept of phylogeny Who is Probo? warnings!!! do not take it too seriously!

warnings!!! do not take it too seriously!

Contents What is phylogeny? How/why is possible to infer phylogeny? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny?

How/why sequences evolve? Copies are made from chromosomes and divided between daughter cells during cell division. However, these copies are not perfect they contain few mistakes, also referred as mutations. accumulating mutations into sequence have been proposed to be proportional with time what separates them from ancestral sequence Cell division. Yellow are chromosomes Wang Z et al 2010 ATGTGGCATTAGCGGCTATTCGGC ATGTGGCAGTAGCGGCTATTCGGC ATGTGGCAGTAGCGTTTATTCGGC ATGTGGCAGTAG--TTTATTCTGC ATGTGGCAGTAG---TTATTCTGC AGTTGGCATTAG---TTATTCTGC

How/why is possible to infer phylogeny? Because, accumulating mutations into sequence is proportional with time Closely related sequences are more similar i. e. differences between them are smaller Differences can be expressed as proportion of changed positions between two sequences for example changes pre position A - B - ATGTGGCATT ATGTGGCAGT there is one difference per 10 nt => 0.1 changes pre position

Calculating distance Pair wise distances are calculated for each pair of sequences and expressed, for example, as changes pre position pair wise distances A - B - C - ATGTGGCATT ATGTGGCAGT ATCTCGTAGT A 0 B 0.1 0 C 0.4 0.3 0 A B C

Drawing tree neighbour joining First, closest sequences are connected Then more distant are added A 0 B 0.1 0 C 0.4 0.3 0 A B C A B C

Contents What is phylogeny? How/why is possible to infer phylogeny? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny?

Understanding tree A, B &C are LEAVES or OTU (operational taxonomic unit) Hypothetical ancestral sequence could occupy NODE NODE A 0 B 0.1 0 C 0.4 0.3 0 A B C A B leaf or OTU C branch length

Rooted and unrooted trees ROOTING means the determining of ancestral node Other leaves (OTU's) are rearranged according it Often, root is represented by additional line without OTU ancestral node for A, B & C ROOT B A C

Rooted and unrooted trees For ROOTING we can use OUTGROUP OUTGROUP can be a sequence from organism, what inhabits earth before these species, which we try to root ancestral node for human & horse B A human horse C mouse as outgroup For example, for rooting human and horse we can use mouse as outgoup

Rooted trees ROOTING defines the branching order by placing it into proper line in time More ancient events are close to ROOT and more recent events ace close to LEAFs ancestral node for A, B & C ROOT B A C t i m e

Unrooted trees different ways for presenting the same tree A B B A C C often used for unrooted trees

Contents What is phylogeny? How/why is possible to infer phylogeny? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny?

Four major reasons why you may want to use phylogenetics Determining the closest relatives of the organism that you re interested in: For instance, if you re studying a new bacterium, you can sequence and use its ribosomal RNA for constructing a phylogenetic tree Discovering the function of a gene: you can use phylogenetic trees to be sure that the gene you re interested in is orthologous to another well-characterized gene in another species Retracing the origin of a gene: From time to time, individual genes may jump from one species to another. Phylogenetic trees are a great way to reveal such events, which are called horizontal (or lateral) transfers Characterizing a gene family: Describing the structure of the gene/protein family: determine functionally homogeneous subsets families/subfamilies; gene distribution in different organisms; gene duplications and LGT

Determining closest relatives example of environmental sample on 16S tree Firmicutes F i r m i c u t e s unknown species 16S https://www.cosmoss.org/physcome_project/wiki/contaminations/what_is_the_closest_sequenced_taxon

Resolve evolutionary history of living organisms constructing species trees

Discovering the function of a gene each group of proteins formed a distinct branche unrooted tree of translational GTPases Tree of translational GTPases (Margus et al 2007)

Example of lateral gene transfer (LGT) alpha-proteobacteria -proteobacteria becomes nested by a-protebacteria genes -proteobacteria have acquired an extra copy of EF-G laterally

Studying gene duplications and gene families Gene duplications are very widespread Gene duplication generates a material (copies) for evolution Copies start to change/evolve and, in some cases, based on them genes with the new functions appeared It makes difficult to recognize genes/proteins which carry the same function in different organisms mainly because it might difficult to choose the proper gene amongst many homologs Genes, which share common ancestry in two different organisms and are closest pair of proteins between them and called ORTHOLOGS

Orthologs and paralogs Duplicate genes in the same genome are called PARALOGS When time pass the functional differences might appear between PARALOGS

Orthologs and paralogs ORTHOLOGS ORTHOLOGS have the same function!

Example of a protein family tree Big tree is difficult to read We need some marker sequences we need good support Similar OTU's (subfamily, orthologs...) can be compressed COMPRESSED No. of sequences good support diversity diversity

Example of a protein family tree proteins in one compressed triangle are orthologs Phylogenetic tree of elongation factor G (EF-G). Four subfamilies are clearly seen

Associate function to orthologs UNKNOWN translocation and ribosome recycling translocationg recycling

Phylogenetic profiling map subfamilies to bacterial phylogeny Species tree based on 16S rrna EF-G Subfamilies -proteobacteria -proteobacteria Spirocheates disappearing EF-G-I appearing spdefgs

Go to infer function?! We have well-defined sets of homologous proteins Function is charactherized for three of them Several 3-D structures are available Several functional domains/ motifs/& amino acids has been determined for EF-G I Can we find positions which are characterizing best the unknown yet the function of this subfamily?

Contents What is phylogeny? How/why is possible to infer phylogeny? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny? methods data estimating reliability

Methods for building trees

There are two very different ways to produce trees The first one uses ClustalW; it s quick, hassle-free, and somehow very similar to (good) fast food. The second step list is for those of you out there who love to buy fresh vegetables on the market and make your own salad dressing sticklers for detail, if you catch our drift. With these steps, you can control every ingredient that...

Distance methods Single statistics DISTANCE is calculated for each pair of sequences Based on distances, the final tree were built using neighbour joining (nj) or UPGMA methods For calculating DISTANCE better methods are using: amino acid distance matrix (PAM or BLOSUM) correcting distance for multiple changes at the same position enabling position classes (invariant, slowly evolving, medium and fast) Distance methods are fast You can use much large (~10 times) datasets than for ML of Bayesian methods

Parsimony methods (MP- maximum parsimony) are good for inferring trees from DNA data many models for DNA evolution models for coding region and for each position in codon separately No models for protein evolution Likelihood methods (ML maximum likelihood) takes into account amino acids replacement patterns observed from sequences (PAM, BLOSUM, WAG... many different models) use for computing protein trees All these methods are CPU expensive and it might take days to compute tree for 200-300 organism

Data selected sequences for input are crucial

About importance of preparing data

Choosing the right sequences for the right tree There is the assumption that the sequences you are comparing have a common ancestor

Using DNA or protein? DNA > 70% identical You can align it If it is coding sequence then align protein coding DNA < 70% Translate to protein and align proteins; then use protein alignment to align DNA on codon base http://coot.embl.de/pal2nal If most synonymous sites are different; refers to saturation. It is safer use proteins for phylogeny If synonymous sites are not saturated, distance measure from DNA is more accurate

Choosing sequences to make either a gene tree or a species tree Homologous genes are genes that derive from a common ancestor. They can have three types of relationships: When you need species tree use ONLY Orthologs They are only separated by speciation Orthologs Paralogs Separated by duplication event (within a genome) Xenologs Xeno gr. foreigner result of LGT between two organisms when original copy of a gene is replaced with a foreigner ortholog

Pre-computed sets of orthologous proteins/genes COG Clustero of Orthologous Groups at NCBI by Eugene Koonin www.ncbi.nlm.nih.gov/cog Collection of homologous genes include HOGENOM and HOVERGEN developed by the Pôle Bioinformatique Lyonnais http://pbil.univ-lyon1.fr/ RDP II Ribosomal database project http://rdp.cme.msu.edu/ contains mainly bacterial structurally aligned rrna sequences. Here are lot of paralogues, however widely used for inferring species trees of bacteria

Create the perfect set

Preparing your multiple sequence alignment Computing multiple sequence alignment (ClustalW, T-Coffee, MUSCLE) Making sure you have the right multiple sequence alignment The quality of your multiple sequence alignment is the real limiting factor when you make a tree; there is no way you can make a good tree with a bad alignment.

To ensure that your multiple alignment is both accurate and suitable 1. Make sure there are as many gap-free columns as possible. Gaps cause trouble for most phylogeny reconstruction methods. 2. Remove the extremities of your multiple alignment. The N-terminus and the C-terminus tend to be poorly conserved and therefore poorly aligned. You can safely remove them 3. Remove the gap-rich regions of your alignment. Internal, gap-rich regions in a multiple sequence alignment often correspond to loops. 4. Be sure to keep the most informative blocks. The ideal multiple alignment for building a tree would be a high-quality alignment of sequences with a low level of identity How a good block looks like? It s typically 20 to 30 amino acids long, and contains a few conserved positions. Such blocks are ideal for producing high-quality trees. You can use the T-coffee server to evaluate your multiple sequence alignment and remove columns that are unlikely to be correctly aligned The best way to edit your multiple alignment is to use BioEdit or Jalview

Contents What is phylogeny? How/why is possible to infer phylogeny? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny? methods data estimating reliability

The spectrum of available sequences is restricted with the current time window generally... NO ancestral type of sequences have preserved Therefore, evolutionary history need to be reconstructed by using data what have survived and available NOW A B C D E current timeframe time

Q? about reliability of a tree True tree: There is only one true tree Inferred tree: A tree that is obtained by using a certain set of data and a certain method of tree reconstruction How reliable is inferred tree?

Bootstrapping - step 1 generate alignments based on original Bootstrap Alignment n

Bootstrapping - step 2 computing trees seq1 seq2 tree 1 seq3 seq1 seq2 tree 2 seq3 Bootstrap Alignment n n's tree

Bootstrapping - step 3 computing consensus trees tree seq1 seq2 seq3 seq1 seq2 seq3 100-95% considered very good 96-80% is good < 50% branches are not supported 80 seq1 seq2 seq3 this branching order was found in 80% cases (bootstrap trees)

Second round models of the sequence evolution

Flow diagram

Assumptions Phylogenetic reconstruction from a set of homologous sequences would be considerably easier than it is if two conditions had held during sequence evolution; first, that all the sequences evolved at a constant mutation rate for all mutations at all times; If true then the number of observed differences between any two aligned sequences would be directly proportional to the time elapsed since they diverged from their most recent common ancestor. second, that the sequences have only diverged to a moderate degree such that no position has been subjected to more than one mutation. If true then once the sequences have been accurately aligned, all the mutational events could be observed as non identical aligned bases and the mutation could be assumed to be from one base to the other.

Distance correction The simplest way of estimating the evolutionary distance from an alignment is to count the fraction of nonidentical alignment positions, to obtain a measure called the p-distance.

The rate of accepted mutation is usually not the same for all types of base substitution The models of evolution used for phylogenetic analysis define base mutation rates and substitution preferences for each position in the alignment. The simplest models assume all rates to be identical and time-invariant with no substitution preferences. More sophisticated models have been proposed that relax these assumptions.

Transitions trasversions if a purine base is replaced by another purine on mutation, or a pyrimidine by a pyrimidine, the structure will suffer little if any distortion. Such mutations are called transition mutations, as opposed to transversions in which purines become pyrimidines or vice versa transitions transversions transversions Note that there are twice as many ways of generting transvertions than transitions

Different codon positions have different mutation rate The points on each line represent percentage GC values for each of 11 bacteria at the codon position indicated, plotted against the overall genome percentage GC content. While all three codon positions adapt to some extent to the compositional bias of the genome, the third position adapts most.

Distance and distance correction p-distance If an alignment of two sequences has L positions (gaps excluded), of which D differ, then the fractional alignment difference, usually called the p- distance, is defined p = D L Poisson distance correction takes account of multiple mutations at the same site d P = - ln(1-p) The assumption have made, that each position in a given sequence mutates with the same rate (r) Where the p is p-distance and the d P is corrected distance called Poisson distance Gamma distribution assumes that different positions can mutate with a different rates (r)

Gamma distribution (Γ) Gamma distribution, proposed by T. Uzzell and K Corbinin 1971, which takes account of mutation rate variation at different sequence positions. Corrected distance is called the gamma distance d Γ = a (1-p) -1/a -1 It is more realistic model for distribution of sites with different mutation rate constants. Such a distribution can be written with one parameter α, which determines the site variation. Values of α have been estimated from data. When p < 0.2 then d Γ is not significantly different from p-distance