Algorithms in Bioinformatics

Similar documents
Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic Tree Reconstruction

Multiple Sequence Alignment. Sequences

BINF6201/8201. Molecular phylogenetic methods


What is Phylogenetics

Lecture 11 Friday, October 21, 2011

EVOLUTIONARY DISTANCES

8/23/2014. Phylogeny and the Tree of Life

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic trees 07/10/13

Chapter 26 Phylogeny and the Tree of Life

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Dr. Amira A. AL-Hosary

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Constructing Evolutionary/Phylogenetic Trees

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Examples of Phylogenetic Reconstruction

Phylogeny: building the tree of life

Evolutionary Tree Analysis. Overview

A (short) introduction to phylogenetics

Chapter 26 Phylogeny and the Tree of Life

Phylogeny Tree Algorithms

Phylogenetic inference

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Introduction to Bioinformatics Introduction to Bioinformatics

Theory of Evolution Charles Darwin

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogeny: traditional and Bayesian approaches

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Macroevolution Part I: Phylogenies

Name: Class: Date: ID: A

Cladistics and Bioinformatics Questions 2013

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetics: Building Phylogenetic Trees

Organizing Life s Diversity

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

In a way, organisms determine who belongs to their species by choosing with whom they will! MODERN EVOLUTIONARY CLASSIFICATION 18-2 MATE

Modern Evolutionary Classification. Section 18-2 pgs

Constructing Evolutionary/Phylogenetic Trees

Reconstructing the history of lineages

Phylogeny. November 7, 2017

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

PHYLOGENY & THE TREE OF LIFE

Biology. Slide 1 of 24. End Show. Copyright Pearson Prentice Hall

Classification and Phylogeny

Phylogenetic inference: from sequences to trees

CHAPTER 26 PHYLOGENY AND THE TREE OF LIFE Connecting Classification to Phylogeny

Chapter 19: Taxonomy, Systematics, and Phylogeny

Phylogenetic analyses. Kirsi Kostamo

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Classification and Phylogeny

Fig. 26.7a. Biodiversity. 1. Course Outline Outcomes Instructors Text Grading. 2. Course Syllabus. Fig. 26.7b Table

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

How to read and make phylogenetic trees Zuzana Starostová

Phylogeny is the evolutionary history of a group of organisms. Based on the idea that organisms are related by evolution

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Phylogenetic Analysis

Phylogenetic Analysis

Phylogenetic Analysis

Phylogenetics. BIOL 7711 Computational Bioscience

CLASSIFICATION OF LIVING THINGS. Chapter 18

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Classification, Phylogeny yand Evolutionary History

Theory of Evolution. Charles Darwin

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Chapter 16: Reconstructing and Using Phylogenies

Phylogeny and the Tree of Life

Evolution Problem Drill 10: Human Evolution

C.DARWIN ( )

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Evolution and Our Heritage

Chapter 26: Phylogeny and the Tree of Life

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

AP Biology Notes Outline Enduring Understanding 1.B. Big Idea 1: The process of evolution drives the diversity and unity of life.

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Phylogeny and the Tree of Life

Introduction to characters and parsimony analysis

AP Biology. Cladistics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Chapters 25 and 26. Searching for Homology. Phylogeny

Molecular Evolution and Phylogenetic Tree Reconstruction

Transcription:

Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods Molecular Clock UPGMA 3.1

Phylogeny Terminology Phylogeny- the history of descent of a group of organisms from a common ancestor From Greek: phylon = tribe, race genesis = source Taxonomy- the science of classification of organisms. From Greek: taxis = to arrange, classify Clade: A set of species which includes all of the species derived from a single common ancestor Phylogeny: Inference Tool Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are also used for phylogenetic analyses. 3.2

Importance of Phylogeny How many genes are related to my favorite gene? Was the extinct quagga more like a zebra or a horse? Was Darwin correct when he stated that humans are the closest to chimps and gorillas? How related are whales and dolphins to cows? Where and when did HIV originate? What is the history of life on earth? Picture of Last Quagga Died in Amsterdam zoo in 1883. 3.3

Phylogenetic Analysis A phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family might have been derived during evolution. Two sequences that are very much alike will be located as neighboring outside branches (leaves) and will be joined by a common branch beneath them. Aim of Phylogeneitc Analysis The evolutionary relationships among the sequences are depicted by placing the sequences as outer branches on a tree. The branching relationships on the inner part of the tree then reflect the degree to which different sequences are related. The aim of phylogenetic analysis is to discover all of the branching relationships in the tree and the branch lengths. 3.4

Phylogenetic tree: diagram showing evolutionary paths of species/genes. Why do we construct phylogenetic trees? To understand the path (lineage) of various species. To understand how various functions evolved. To perform multiple alignment. Additional Uses of To study the evolutionary relationships of different species and to understand how species relate to one another. To predict the unknown gene s function according to its phylogenetic relationship to other genes. 3.5

Globin Family Evolution 600-800 my Fossil Dating 385 my 300 my 260 my 175 my 120 my 45 my 35 my 50 my myoglobin -globins -globins Major Divisions of Living Things by C. Woese on the basis of 15S RNA sequences 3.6

Time Line For Life Earth Forms Dinosaurs First Micro-organism 4.5 4.0 3.5 Arthropods 3.0 2.5 2.0 1.5 1.0 0.5 age of microbes humans 3.5 million Billions of years ago A Possible Evolution Tree For Humans H. sapiens Homo habilis Homo ergaster H. heidelbergensis H. neanderthalensis Australopithecus afarensis H. erectus A. robustus Ardipithecus ramidus A. africanus A. boisei 5 4 3 2 Millions of Years Ago 1 0 3.7

Terminology Leaves represent objects (genes, species) being compared Taxon refers to the leaves when they represent species and broader classifications of organisms. Internal nodes are hypothetical ancestral units In a rooted tree, the path from root to a node represents an evolutionary path. An unrooted tree specifies relationships among objects, but not evolutionary paths. Rooted and Unrooted Trees All objects in a rooted tree have a single common ancestor. In general, rooted trees require more information to construct than unrooted ones. Objects are leaves in an unrooted tree and internal nodes are common ancestors. In general, given any two leaves, we cannot tell if they have a common ancestor. 3.8

Unrooted and Rooted Trees (extinct ancestor species) Have yellow birds evolved from brown birds, or brown birds from yellow birds? Last common ancestor of all yellow and brown birds Tree construction could be based on: morphological features, or sequence data Unrooted: cannot tell Rooted: can tell Understanding Bioinformatics by M. Zvelebil and J. Baum Cladogram: Branch length carry no meaning Additive Tree: Branch length measure evolutionary divergence Ultrametric Tree: Additive tree and constant rate of mutation along branches Additive Tree: with outgroup Understanding Bioinformatics by M. Zvelebil and J. Baum 3.9

Outgroup Properties The outgroup should be a sequence known to be less closely related to the rest of the sequences than they are to each other. It should ideally be as closely related as possible to the rest of the sequences while still satisfying the first condition. The root must be somewhere between the outgroup and the rest (either on the node or in a branch). Rooting a Tree The best place for having the root of the phylogenetic tree 3.10

Gene Duplication and Speciation Understanding Bioinformatics by M. Zvelebil and J. Baum Building of a Phylogenetic Tree Sequence Selection: Identify a DNA or protein sequence. Obtain related sequences by performing a database search. Perform multiple alignment. Build a phylogenetic tree. Check the robustness of the tree. 3.11

Distance and Character Based Trees The construction of the tree is: distance-based: measures the distance between species/genes (eg. mutations, time, distance metric). First calculate the overall distance between all pairs of sequences, then construct a tree based on the distances. character-based: morphological features (eg. number of legs), DNA/protein sequences. Use the individual substitutions among sequences to determine the most likely ancestral relationships. The tree is constructed based on the gain or loss of traits. Methods for Constructing Distance-Based Methods: Unweighted Pair Group Method Using Arithmetic Averages (UPGMA) Fitch Margoliash (FM) Neighbor Joining (NJ) Character-Based Methods: Maximum Parsimony (MP) Maximum Likelihood (ML) 3.12

Other Methods for Constructing Trees Clustering Methods Follow a set of steps (an algorithm) and arrive at a tree. Use distance data. Produce a single tree. Do not use objective functions to compare the current tree to other trees. Optimality Criterion Use objective functions to compare different trees. First define an optimality criterion, i.e. minimum branch length, and then find the tree with the best value for the objective function. Clustering Algorithms The strength of clusterting algorithms is: Their speed Their robustenss Their ability to reconstruct trees for very large numbers (thousands) of sequences. Most clustering methods reconstruct phylogenetic trees for a set of sequences on the basis of their pairwise evolutionary distances. 3.13

Classification of Tree Building Methods Tree Building Methods Clustering Algorithm Optimality Criterion Type of Data Distance- Based Character- Based UPGMA Neighbor Joining Fitch-Margoliash Maximum Parsimony Maximum Likelihood Distance-Based Method Given: an n n matrix M, where M(i,j) is the distance between objects i and j Build an edge-weighted tree such that the distances between leaves i and j correspond to M(i,j) A B C A 0 B 8 0 C 8 3 0 D 5 8 8 E 3 8 8 4 3 2 1 D E 0 5 0 A E D B C 3.14

UPGMA UPGMA is a sequential clustering algorithm. It works by clustering the sequences, at each stage amalgamating two operational taxonomic units (OTUs) and at the same time creating a new node in the tree. The edge lengths are determined by the difference in the heights of the nodes at the top and bottom of an edge. The Molecular Clock UPGMA assumes that: the gene substitution rate is constant, in other words: divergence of sequences is assumed to occur at the same rate at all points in the tree. Known as the Molecular Clock. the distance is linear with evolutionary time. 3.15

UPGMA Algorithm The algorithm iteratively picks two clusters and merges them, thus creating a new node in the tree. The average distance between two clusters is determined by: d ij C i 1 C j d p C, q i C j pq, where C i and C are clusters. j The UPGMA Algorithm Initialization Assign each sequence i to its own cluster Ci, Define one leaf of T for each sequence; place at height zero. Iteration while more than two clusters, do Determine the two clusters Ci, Cj for which dij is minimal. Define a new cluster Ck = Ci Cj; compute dkl for all l. Define a node k with children i and j; place it at height dij/2. Replace clusters Ci and Cj with Ck. Termination Join last two clusters, Ci and Cj; place the root at height dij/2. 3.16

UPGMA: Example (1 st Iteration) Sequences A and D are the closest and are combined to create a new cluster V of height ½ in T. Understanding Bioinformatics by M. Zvelebil and J. Baum UPGMA: Example (2 nd Iteration) The table of distances is updated to reflect the average distances from V to the other sequences. V and E are the closest and are combined to create a new cluster W of height 1 in T. Understanding Bioinformatics by M. Zvelebil and J. Baum 3.17

UPGMA: Example (3 rd Iteration) After updating the table of distances, B and F are the closest sequences and are combined to create a new cluster X of height 2 in T. Understanding Bioinformatics by M. Zvelebil and J. Baum UPGMA: Example (4 th Iteration) Once more the table is updated. W and X are the closest sequences and are combined to create a new cluster Y of height 3 in T. Understanding Bioinformatics by M. Zvelebil and J. Baum 3.18

UPGMA: Example (Termination) The remaining 2 sequences, C and Y of distance 8 are combined to create a new cluster Z of height 4 in T. Understanding Bioinformatics by M. Zvelebil and J. Baum Software Tools PHYLIP Phylogeny Inference Package. http://evolution.genetics.washington.edu/phylip.html Free. Developed by Dr. Joe Felsenstein from the Department of Genome Sciences at the University of Washington. Source code is written in ANSI C. Executables are available for different platforms: Windows, UNIX and Macintosh. PAUP* Phylogenetic Analysis Using Parsimony. http://www.lms.si.edu/paup/about.html Developed by Dr. David Swofford of the Laboratory of Molecular Systematics, National Museum of Natural History The most sophisticated parsimony program. 3.19