Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Inferring phylogeny Constructing phylogenetic trees Tõnu Margus

Contents What is phylogeny? How/why it is possible to infer it? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny?

Phylogenetics In biology, phylogenetics is the study of evolutionary relatedness among various groups of organisms (for example, species or populations), which is discovered through molecular sequencing data and morphological data matrices. A phylogenetic analysis is the scientific procedure that lets you reconstruct the evolutionary history of a group of organisms or sequences.

Time aspect - We are successful descendants of our predecessors

Some surprising cases of using concept of phylogeny Who is Probo? warnings!!! do not take it too seriously!

warnings!!! do not take it too seriously!

Contents What is phylogeny? How/why is possible to infer phylogeny? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny?

How/why sequences evolve? Copies are made from chromosomes and divided between daughter cells during cell division. However, these copies are not perfect they contain few mistakes, also referred as mutations. accumulating mutations into sequence have been proposed to be proportional with time what separates them from ancestral sequence Cell division. Yellow are chromosomes Wang Z et al 2010 ATGTGGCATTAGCGGCTATTCGGC ATGTGGCAGTAGCGGCTATTCGGC ATGTGGCAGTAGCGTTTATTCGGC ATGTGGCAGTAG--TTTATTCTGC ATGTGGCAGTAG---TTATTCTGC AGTTGGCATTAG---TTATTCTGC

How/why is possible to infer phylogeny? Because, accumulating mutations into sequence is proportional with time Closely related sequences are more similar i. e. differences between them are smaller Differences can be expressed as proportion of changed positions between two sequences for example changes pre position A - B - ATGTGGCATT ATGTGGCAGT there is one difference per 10 nt => 0.1 changes pre position

Calculating distance Pair wise distances are calculated for each pair of sequences and expressed, for example, as changes pre position pair wise distances A - B - C - ATGTGGCATT ATGTGGCAGT ATCTCGTAGT A 0 B 0.1 0 C 0.4 0.3 0 A B C

Drawing tree neighbour joining First, closest sequences are connected Then more distant are added A 0 B 0.1 0 C 0.4 0.3 0 A B C A B C

Contents What is phylogeny? How/why is possible to infer phylogeny? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny?

Understanding tree A, B &C are LEAVES or OTU (operational taxonomic unit) Hypothetical ancestral sequence could occupy NODE NODE A 0 B 0.1 0 C 0.4 0.3 0 A B C A B leaf or OTU C branch length

Rooted and unrooted trees ROOTING means the determining of ancestral node Other leaves (OTU's) are rearranged according it Often, root is represented by additional line without OTU ancestral node for A, B & C ROOT B A C

Rooted and unrooted trees For ROOTING we can use OUTGROUP OUTGROUP can be a sequence from organism, what inhabits earth before these species, which we try to root ancestral node for human & horse B A human horse C mouse as outgroup For example, for rooting human and horse we can use mouse as outgoup

Rooted trees ROOTING defines the branching order by placing it into proper line in time More ancient events are close to ROOT and more recent events ace close to LEAFs ancestral node for A, B & C ROOT B A C t i m e

Unrooted trees different ways for presenting the same tree A B B A C C often used for unrooted trees

Contents What is phylogeny? How/why is possible to infer phylogeny? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny?

Four major reasons why you may want to use phylogenetics Determining the closest relatives of the organism that you re interested in: For instance, if you re studying a new bacterium, you can sequence and use its ribosomal RNA for constructing a phylogenetic tree Discovering the function of a gene: you can use phylogenetic trees to be sure that the gene you re interested in is orthologous to another well-characterized gene in another species Retracing the origin of a gene: From time to time, individual genes may jump from one species to another. Phylogenetic trees are a great way to reveal such events, which are called horizontal (or lateral) transfers Characterizing a gene family: Describing the structure of the gene/protein family: determine functionally homogeneous subsets families/subfamilies; gene distribution in different organisms; gene duplications and LGT

Determining closest relatives example of environmental sample on 16S tree Firmicutes F i r m i c u t e s unknown species 16S https://www.cosmoss.org/physcome_project/wiki/contaminations/what_is_the_closest_sequenced_taxon

Resolve evolutionary history of living organisms constructing species trees

Discovering the function of a gene each group of proteins formed a distinct branche unrooted tree of translational GTPases Tree of translational GTPases (Margus et al 2007)

Example of lateral gene transfer (LGT) alpha-proteobacteria -proteobacteria becomes nested by a-protebacteria genes -proteobacteria have acquired an extra copy of EF-G laterally

Studying gene duplications and gene families Gene duplications are very widespread Gene duplication generates a material (copies) for evolution Copies start to change/evolve and, in some cases, based on them genes with the new functions appeared It makes difficult to recognize genes/proteins which carry the same function in different organisms mainly because it might difficult to choose the proper gene amongst many homologs Genes, which share common ancestry in two different organisms and are closest pair of proteins between them and called ORTHOLOGS

Orthologs and paralogs Duplicate genes in the same genome are called PARALOGS When time pass the functional differences might appear between PARALOGS

Orthologs and paralogs ORTHOLOGS ORTHOLOGS have the same function!

Example of a protein family tree Big tree is difficult to read We need some marker sequences we need good support Similar OTU's (subfamily, orthologs...) can be compressed COMPRESSED No. of sequences good support diversity diversity

Example of a protein family tree proteins in one compressed triangle are orthologs Phylogenetic tree of elongation factor G (EF-G). Four subfamilies are clearly seen

Associate function to orthologs UNKNOWN translocation and ribosome recycling translocationg recycling

Phylogenetic profiling map subfamilies to bacterial phylogeny Species tree based on 16S rrna EF-G Subfamilies -proteobacteria -proteobacteria Spirocheates disappearing EF-G-I appearing spdefgs

Go to infer function?! We have well-defined sets of homologous proteins Function is charactherized for three of them Several 3-D structures are available Several functional domains/ motifs/& amino acids has been determined for EF-G I Can we find positions which are characterizing best the unknown yet the function of this subfamily?

Contents What is phylogeny? How/why is possible to infer phylogeny? Representing evolutionary relationships on trees What type questions questions we can ask? How to infer phylogeny? methods data estimating reliability

Methods for building trees

There are two very different ways to produce trees The first one uses ClustalW; it s quick, hassle-free, and somehow very similar to (good) fast food. The second step list is for those of you out there who love to buy fresh vegetables on the market and make your own salad dressing sticklers for detail, if you catch our drift. With these steps, you can control every ingredient that...

Distance methods Single statistics DISTANCE is calculated for each pair of sequences Based on distances, the final tree were built using neighbour joining (nj) or UPGMA methods For calculating DISTANCE better methods are using: amino acid distance matrix (PAM or BLOSUM) correcting distance for multiple changes at the same position enabling position classes (invariant, slowly evolving, medium and fast) Distance methods are fast You can use much large (~10 times) datasets than for ML of Bayesian methods

Parsimony methods (MP- maximum parsimony) are good for inferring trees from DNA data many models for DNA evolution models for coding region and for each position in codon separately No models for protein evolution Likelihood methods (ML maximum likelihood) takes into account amino acids replacement patterns observed from sequences (PAM, BLOSUM, WAG... many different models) use for computing protein trees All these methods are CPU expensive and it might take days to compute tree for 200-300 organism

Data selected sequences for input are crucial

About importance of preparing data

Choosing the right sequences for the right tree There is the assumption that the sequences you are comparing have a common ancestor

Using DNA or protein? DNA > 70% identical You can align it If it is coding sequence then align protein coding DNA < 70% Translate to protein and align proteins; then use protein alignment to align DNA on codon base http://coot.embl.de/pal2nal If most synonymous sites are different; refers to saturation. It is safer use proteins for phylogeny If synonymous sites are not saturated, distance measure from DNA is more accurate

Choosing sequences to make either a gene tree or a species tree Homologous genes are genes that derive from a common ancestor. They can have three types of relationships: When you need species tree use ONLY Orthologs They are only separated by speciation Orthologs Paralogs Separated by duplication event (within a genome) Xenologs Xeno gr. foreigner result of LGT between two organisms when original copy of a gene is replaced with a foreigner ortholog

Pre-computed sets of orthologous proteins/genes COG Clustero of Orthologous Groups at NCBI by Eugene Koonin www.ncbi.nlm.nih.gov/cog Collection of homologous genes include HOGENOM and HOVERGEN developed by the Pôle Bioinformatique Lyonnais http://pbil.univ-lyon1.fr/ RDP II Ribosomal database project http://rdp.cme.msu.edu/ contains mainly bacterial structurally aligned rrna sequences. Here are lot of paralogues, however widely used for inferring species trees of bacteria

Create the perfect set

Preparing your multiple sequence alignment Computing multiple sequence alignment (ClustalW, T-Coffee, MUSCLE) Making sure you have the right multiple sequence alignment The quality of your multiple sequence alignment is the real limiting factor when you make a tree; there is no way you can make a good tree with a bad alignment.

To ensure that your multiple alignment is both accurate and suitable 1. Make sure there are as many gap-free columns as possible. Gaps cause trouble for most phylogeny reconstruction methods. 2. Remove the extremities of your multiple alignment. The N-terminus and the C-terminus tend to be poorly conserved and therefore poorly aligned. You can safely remove them 3. Remove the gap-rich regions of your alignment. Internal, gap-rich regions in a multiple sequence alignment often correspond to loops. 4. Be sure to keep the most informative blocks. The ideal multiple alignment for building a tree would be a high-quality alignment of sequences with a low level of identity How a good block looks like? It s typically 20 to 30 amino acids long, and contains a few conserved positions. Such blocks are ideal for producing high-quality trees. You can use the T-coffee server to evaluate your multiple sequence alignment and remove columns that are unlikely to be correctly aligned The best way to edit your multiple alignment is to use BioEdit or Jalview

The spectrum of available sequences is restricted with the current time window generally... NO ancestral type of sequences have preserved Therefore, evolutionary history need to be reconstructed by using data what have survived and available NOW A B C D E current timeframe time

Q? about reliability of a tree True tree: There is only one true tree Inferred tree: A tree that is obtained by using a certain set of data and a certain method of tree reconstruction How reliable is inferred tree?

Bootstrapping - step 1 generate alignments based on original Bootstrap Alignment n

Bootstrapping - step 2 computing trees seq1 seq2 tree 1 seq3 seq1 seq2 tree 2 seq3 Bootstrap Alignment n n's tree

Bootstrapping - step 3 computing consensus trees tree seq1 seq2 seq3 seq1 seq2 seq3 100-95% considered very good 96-80% is good < 50% branches are not supported 80 seq1 seq2 seq3 this branching order was found in 80% cases (bootstrap trees)

Second round models of the sequence evolution

Flow diagram

Assumptions Phylogenetic reconstruction from a set of homologous sequences would be considerably easier than it is if two conditions had held during sequence evolution; first, that all the sequences evolved at a constant mutation rate for all mutations at all times; If true then the number of observed differences between any two aligned sequences would be directly proportional to the time elapsed since they diverged from their most recent common ancestor. second, that the sequences have only diverged to a moderate degree such that no position has been subjected to more than one mutation. If true then once the sequences have been accurately aligned, all the mutational events could be observed as non identical aligned bases and the mutation could be assumed to be from one base to the other.

Distance correction The simplest way of estimating the evolutionary distance from an alignment is to count the fraction of nonidentical alignment positions, to obtain a measure called the p-distance.

The rate of accepted mutation is usually not the same for all types of base substitution The models of evolution used for phylogenetic analysis define base mutation rates and substitution preferences for each position in the alignment. The simplest models assume all rates to be identical and time-invariant with no substitution preferences. More sophisticated models have been proposed that relax these assumptions.

Transitions trasversions if a purine base is replaced by another purine on mutation, or a pyrimidine by a pyrimidine, the structure will suffer little if any distortion. Such mutations are called transition mutations, as opposed to transversions in which purines become pyrimidines or vice versa transitions transversions transversions Note that there are twice as many ways of generting transvertions than transitions

Different codon positions have different mutation rate The points on each line represent percentage GC values for each of 11 bacteria at the codon position indicated, plotted against the overall genome percentage GC content. While all three codon positions adapt to some extent to the compositional bias of the genome, the third position adapts most.

Distance and distance correction p-distance If an alignment of two sequences has L positions (gaps excluded), of which D differ, then the fractional alignment difference, usually called the p- distance, is defined p = D L Poisson distance correction takes account of multiple mutations at the same site d P = - ln(1-p) The assumption have made, that each position in a given sequence mutates with the same rate (r) Where the p is p-distance and the d P is corrected distance called Poisson distance Gamma distribution assumes that different positions can mutate with a different rates (r)

Gamma distribution (Γ) Gamma distribution, proposed by T. Uzzell and K Corbinin 1971, which takes account of mutation rate variation at different sequence positions. Corrected distance is called the gamma distance d Γ = a (1-p) -1/a -1 It is more realistic model for distribution of sites with different mutation rate constants. Such a distribution can be written with one parameter α, which determines the site variation. Values of α have been estimated from data. When p < 0.2 then d Γ is not significantly different from p-distance