Session 5: Phylogenomics

Similar documents
Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Using Bioinformatics to Study Evolutionary Relationships Instructions

Homology and Information Gathering and Domain Annotation for Proteins

Homology. and. Information Gathering and Domain Annotation for Proteins

Comparative Genomics II

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Example of Function Prediction

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Techniques for generating phylogenomic data matrices: transcriptomics vs genomics. Rosa Fernández & Marina Marcet-Houben

Phylogenetic inference

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Phylogenetics a primer.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

BLAST. Varieties of BLAST

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

SUPPLEMENTARY INFORMATION

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Dr. Amira A. AL-Hosary

Emily Blanton Phylogeny Lab Report May 2009

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenetic analyses. Kirsi Kostamo

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Comparative genomics of gene families in relation with metabolic pathways for gene candidates highlighting

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Introduction to Bioinformatics Introduction to Bioinformatics

8/23/2014. Phylogeny and the Tree of Life

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Evolutionary Tree Analysis. Overview


UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Tree Building Activity

Phylogenetic Tree Reconstruction

Gene function annotation

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

What is Phylogenetics

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Reconstructing the history of lineages

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species.

Constructing Evolutionary/Phylogenetic Trees

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetic analysis. Characters

Tools and Algorithms in Bioinformatics

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Anatomy of a species tree

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Computational methods for predicting protein-protein interactions

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Processes of Evolution

EVOLUTIONARY DISTANCES

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Chapter 27: Evolutionary Genetics

Chapter 19: Taxonomy, Systematics, and Phylogeny

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Phylogenetics: Building Phylogenetic Trees

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Homology Modeling. Roberto Lins EPFL - summer semester 2005

molecular evolution and phylogenetics

Hands-On Nine The PAX6 Gene and Protein

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

BINF6201/8201. Molecular phylogenetic methods

Bioinformatics Exercises

Introduction to protein alignments

Phylogenomics Resolves The Timing And Pattern Of Insect Evolution. - Supplementary File Archives -

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Open a Word document to record answers to any italicized questions. You will the final document to me at

C.DARWIN ( )

Computational approaches for functional genomics

1 ATGGGTCTC 2 ATGAGTCTC

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Chapter 26: Phylogeny and the Tree of Life

Introduction to Bioinformatics

USE OF CLUSTERING TECHNIQUES FOR PROTEIN DOMAIN ANALYSIS

Introduction to Bioinformatics Online Course: IBT

Intraspecific gene genealogies: trees grafting into networks

objective functions...

Supplementary Materials for

Orthologs Detection and Applications

Basic Local Alignment Search Tool

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Transcription:

Session 5: Phylogenomics B.- Phylogeny based orthology assignment REMINDER: Gene tree reconstruction is divided in three steps: homology search, multiple sequence alignment and model selection plus tree reconstruction. There are many ways and many programs to perform each of the steps so it is often up to the person building the tree to decide how to build them. Here are some of the programs you can use to perform the different steps. Homology search: blast, hmm, and similar programs as seen before Alignment reconstruction: mafft, muscle, t-coffee Tree reconstruction: phyml, raxml, fasttree, iqtree In these exercises we are going to start with pre-build trees which you can find in the folder called exercise 4. To build this tree we have performed a blast search using protein ASPCL_0083_04882 as a starting point, then we have filtered the blast results to keep only those that have an evalue below 1e-05. The multiple sequence alignment was done with muscle and the resulting fasta alignment was converted to phylip format using readal. Finally RAxML was used to build the phylogenetic tree using the PROTGAMMALG model and 100 rapid bootstrap repetitions. Species tree: 4.- Obtain orthologs based on gene trees manually: Open in a browser the following link: http://phylo.io/ Now insert the contents of the file ASPCL_0083_04882.tree.txt into the tree data part of the web and visualize the tree. Using this tree as example we are going to discuss how orthology inference is done using two kinds of algorithms: reconciliation and species overlap (see slides of session 5).

4.1.- Based on what we have discussed for the tree ASPCL_0083_04882.tree.txt, now do the same for PENEN_0144_10558.tree.txt. And fill in the following table of orthology and paralogy relationships when referring to PENEN_0144_10558: BRH Orthologs ASPCL_0077_01917 Paralogs - InParanoid None (When run with outgroup) - orthomcl Tree based (Species overlap) ASPCL_0077_01917 ASPFL_0017_13218 PENCH_0037_09950 PENMQ_0029_09206 PENCH_0037_09950 PENMQ_0029_09206 ASPCL_0077_01917 See image below None See image below Tree based (Reconciliation) PENCH_0037_09950 See image below See image below Discuss the differences between the different predictions and why they happened. Which method do you think is the most reliable? In general tree methods are more reliable because they are able to distinguish better when there are complex evolutionary scenarios. Between the Species overlap and the reconciliation algorithms there is not much different through the first one is more simple and does make fewer assumptions. For the reconciliation method you are assuming you know your species tree and that it is correct, which may not be the case. Reconciliation (Blue speciation orthologs; Red duplication paralogs):

Species overlap: C.- Phylogenomics What we have done above for one single tree can be done for multiple trees at the same time. To do that we can use tools such as ape ( http://ape-package.ird.fr/ ) or ete ( http://etetoolkit.org/ ). Both are programming libraries that allow the user to work with trees, manipulate them and extract information. In the folder called extra in session 5 you can see an example of a script that uses ETE to calculate orthology relationships based on the species overlap algorithm. While orthology prediction is often one of the objectives of working with collections of trees, there are other kind of analyses that can be done such as:

1.- Search for evolutionary events 2.- Test hypothesis based on topology 3.- Population genetics analyses We are going to focus on the first two points. 5.1.- Go to the exercise5 folder. There you will find a file called tree_collection.txt. In there you will find a collection of 25 trees in the following format: tree_number <tab> newick tree In this exercise the species to which each sequence belongs is indicated at the end of the code (i.e. sequence Phy00BX4MK_ACYPI belongs to species ACYPI). Given the species tree below: Visualize each tree in the tree_collection.txt file using phylo.io and match each tree to the conditions below. Each tree can be matched to multiple conditions. Trees may need to be rooted properly. a.- Search for trees that have species specific duplications. b.- If we assume that a duplication happened at the common ancestor of the species involved in the duplication, how many duplications happened at the node X? (use the species overlap algorithm to infer duplications) c.- If we assume that species DIACI and BEMTA are very far related from species ACYPI, search for trees that could show a horizontal gene transfer event. d.- Search for trees where RHOPD and ACYPI form a monophyletic clade. e.- See the two topologies below, we are interested in knowing how many trees support each of the topologies shown. (Note: When doing this kind of analysis we never consider paralogy relationships, we only count orthologs. So if a tree has only orthologs between two of the species and the third is a paralog this tree is not considered. In addition notice that the tree needs to have an outgroup and there cannot be other species in between our species of interest).

TOPOLOGY 1: TOPOLOGY 2: f.- Search for trees that are identical to the species tree. g.- Search for trees that are congruent with the species tree. h.- One of the main applications for this kind of methodology is to build a species tree. When building species trees, we need groups of orthologous genes that have a one-to-one relationship in all the species of interest. Search among your trees which ones would be suitable for such analysis. i.- Search for a tree that shows a protein family that was created de novo in ACYPI. Tree 1: E2, H

Tree 2: H Tree 3: B, E1, G

Tree 4: E1, H Tree5: A, B

Tree 6: E1, G Tree 7: A, C

Tree 8: E2, H Tree 9: A, B, E1

Tree 10: D Tree 11: E1, G

Tree 12: E1, H Tree 13: E1, G

Tree 14: E2 Tree 15: A

Tree 16: A, E1

Tree 17: A

Tree 18: E1, H Tree 19: E1, G

Tree 20: E1, F, G, H Tree 21: A, C

Tree 22: A, D

Tree 23: E1, F, G, H Tree 24: E2 Tree 25

5.2.- Having done the analysis above, answer the following questions: A.- Which is the percentage of gene trees that follow the species tree? Is this more or less trees than you were expecting? 2 out of 25 B.- Given the results obtained in point e, which of the two topologies is the most represented in the trees? Is it the same we find in the species tree? Topology 1 and yes C.- How many of the trees that can be used to construct the species tree are actually congruent with it? Do you think this may affect the species tree reconstruction? Two, and yes, if they have always the same topological change they could affect the species tree reconstruction. D.- There was one tree of a family that appeared de novo in ACYPI. How sure are we that the protein was really created in ACYPI? Are there any ways we could check it out? As seen in the species tree, the gene trees were created by a limited number of species. This means that these proteins were de novo for ACYPI in the context of this set of species. If a new species was included it may have homologs to this protein and then it would stop being a de novo original protein. The best way we currently have to ensure that proteins don t have other homologs is to perform blasts in the large databases such as NCBI or Uniprot.