Computational Genomics

Similar documents
Molecular Evolution and Comparative Genomics


Phylogeny Tree Algorithms

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Dr. Amira A. AL-Hosary

BINF6201/8201. Molecular phylogenetic methods

Theory of Evolution. Charles Darwin

Algorithms in Bioinformatics

Phylogenetic Tree Reconstruction

Constructing Evolutionary/Phylogenetic Trees

Evolutionary Models. Evolutionary Models

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

What is Phylogenetics

Inferring Molecular Phylogeny

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Multiple Sequence Alignment. Sequences

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Evolutionary Tree Analysis. Overview

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Constructing Evolutionary/Phylogenetic Trees

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Phylogenetic inference

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogeny: building the tree of life

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

EVOLUTIONARY DISTANCES

Reading for Lecture 13 Release v10

Advanced Machine Learning

8/23/2014. Phylogeny and the Tree of Life

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

STA 414/2104: Machine Learning

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

A (short) introduction to phylogenetics

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Dynamic Approaches: The Hidden Markov Model

Bayesian Machine Learning - Lecture 7

Estimating Evolutionary Trees. Phylogenetic Methods

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Theory of Evolution Charles Darwin

Phylogeny. November 7, 2017

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Lecture 4. Models of DNA and protein change. Likelihood methods

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic trees 07/10/13

Week 7: Bayesian inference, Testing trees, Bootstraps

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

C.DARWIN ( )

7. Tests for selection

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

O 3 O 4 O 5. q 3. q 4. Transition

Inferring Speciation Times under an Episodic Molecular Clock

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Understanding relationship between homologous sequences

STA 4273H: Statistical Machine Learning

Quantifying sequence similarity

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogenetics: Building Phylogenetic Trees

An Introduction to Bayesian Inference of Phylogeny

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture Notes: Markov chains

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Molecular Evolution and Phylogenetic Tree Reconstruction

Today s Lecture: HMMs

Bioinformatics 2 - Lecture 4

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Taming the Beast Workshop

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Non-independence in Statistical Tests for Discrete Cross-species Data

Hidden Markov Models

Introduction to characters and parsimony analysis

Phylogeny: traditional and Bayesian approaches

10-701/15-781, Machine Learning: Homework 4

Markov Chains and Hidden Markov Models. = stochastic, generative models

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Using algebraic geometry for phylogenetic reconstruction

Transcription:

omputational Genomics Molecular Evolution: Phylogenetic trees Eric Xing Lecture, March, 2007 Reading: DTW boo, hap 2 DEKM boo, hap 7, Phylogeny In, Ernst Haecel coined the word phylogeny and presented phylogenetic trees for most nown groups of living organisms. Ernst Haecel (3-99)

Phylogenetic Inference Given a multiple alignment, how do we construct the tree? - GTTGTGTTGT B TTGTTGTTGT TTGTGGT D - TTGGTTTTT E GTGGTTTGT F - TTTTGG? What Is a Tree? tree is a mathematical structure which represents a model of an actual evolutionary history of a group of sequences or organisms. In other words, it is an evolutionary hypothesis. tree consists of nodes connected by branches. One unique internal node is the root of the tree: the ancestor of all the sequences. Internal nodes represent hypothetical ancestors Terminal nodes represent sequences or organisms for which we have data. Each is typically called a Operational Taxonomical Unit or OTU. 2

Types of Trees Rooted vs. Unrooted Branches Nodes Rooted Interior M 2 M Total 2M 2 2M Unrooted Interior M 3 M 2 Total 2M 3 2M 2 M is the number of OTU s The number of rooted and unrooted trees: Number of OTU s Possible Number of Rooted trees 2 Unrooted trees 3 3 5 3 5 05 5 95 05 7 0395 95 3535 0395 9 2027025 3535 0 35925 2027025 3

More different inds of trees Different inds of trees can be used to depict different aspects of evolutionary history. ladogram: simply shows relative order of common ancestry 2 3 3 7 5 3 2. dditive trees: a cladogram with branch lengths, also called phylograms and metric trees 3 3 2 3. Ultrametric trees: (dendograms) special ind of additive tree in which the tips of the trees are all equidistant from the root Phylogeny methods Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is pretty certain avoid areas with (too many) gaps Major methods:

UPGM onstruction of a distance tree using clustering with the Unweighted Pair Group Method with rithmatic Mean (UPGM) First, construct a distance matrix: - GTTGTGTTGT B TTGTTGTTGT TTGTGGT D - TTGGTTTTT E GTGGTTTGT F - TTTTGG B D E F 2 B D E From http://www.icp.ucl.ac.be/~opperd/private/upgma.html Ultrametric Trees 3 2 3 a b c Metric distances (for additive trees) must obey rules: Non-negativity: d(a,b) 0 Distinctness: d(a,b) = 0 if and only if a = b Symmetry: d(a,b) = d(b,a) Triangle Inequality: d(a,c) d(a,b) + d(b,c) Ultrametric must obey one additional rule: Three point condition: d(a,b) max( d(a,c), d(b,c) ) 0. 2 a c b 5

UPGM First round B D E F 2 B D E hoose the most similar pair, cluster them together and calculate the new distance matrix. dist(,b), = (dist + distb) / 2 = dist(,b),d = (distd + distbd) / 2 = dist(,b),e = (diste + distbe) / 2 = dist(,b),f = (distf + distbf) / 2 = D E F B, D E UPGM Second round D E F B, D E Third round,d E F B,,D E

UPGM Forth round,b E,D E,D F Fifth round DE,B F Note the this method identifies the root of the tree. UPGM assumes a molecular cloc The UPGM clustering method is very sensitive to unequal evolutionary rates (assumes that the evolutionary rate is the same for all branches). lustering wors only if the data are ultrametric Ultrametric distances are defined by the satisfaction of the 'threepoint condition'. The three-point condition: B B For any three taxa, the two greatest distances are equal. 7

UPGM fails when rates of evolution are not constant B 5 B 7 D E D 7 0 7 E 9 5 F 9 From http://www.icp.ucl.ac.be/~opperd/private/upgma.html (Neighbor joining will get the right tree in this case.) Phylogeny based upon the molecular cloc Evidence for a human mitochondrial origin in frica: frican sequence diversity is twice as large as that of non-frican Gyllensten and colleagues estimate that the divergence of fricans and non-fricans occurred 52,000 to 2,000 years ago. Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 0: 70-73.

pair of homologous bases ancestor? T years Q h Q m Typically, the ancestor is unnown. How does sequence variation arise? Mutation: (a) Inherent: DN replication errors are not always corrected. (b) External: exposure to chemicals and radiation. Selection: Deleterious mutations are removed quicly. Neutral and rarely, advantageous mutations, are tolerated and stic around. Fixation: It taes time for a new variant to be established (having a stable frequency) in a population. 9

Modeling DN base substitution Strictly speaing, only applicable to regions undergoing little selection. Standard assumptions (sometimes weaened). Site independence. 2. Site homogeneity. 3. Marovian: given current base, future substitutions independent of past.. Temporal homogeneity: stationary Marov chain. More assumptions Q h = s h Q and Q m = s m Q, for some positive s h, s m, and a rate matrix Q. The ancestor is sampled from the stationary distribution π of Q. Q is reversible: for a, b, t 0 π(a)p(t,a,b) = P(t,b,a)π(b), (detailed balance). 0

The stationary distribution probability distribution π on {,,G,T} is a stationary distribution of the Marov chain with transition probability matrix P = P(i,j), if for all j, i π(i) P(i,j) = π(j). Exercise. Given any initial distribution, the distribution at time t of a chain with transition matrix P converges to π as t. Thus, π is also called an equilibrium distribution. Exercise. For the Jues-antor and Kimura models, the uniform distribution is stationary. (Hint: diagonalize their infinitesimal rate matrices.) We often assume that the ancestor sequence is i.i.d π. New picture from a statistical evolution model ancestor ~ π s h T PMs Q Q s m T PMs

The Jues-antor adjustment ssume that the common ancestor has, G, or T with probability /. common ancestor Then the chance of the nt differing p = 3/ ( e αt ) = 3/ ( e /3 ), since =2 3αt 3/ orang human When =.0, described as PM t Joint probability of and Under the model in the previous slides, the joint probability is p(,) = = a a π ( a ) p( s T, Q, a ) p( s T, Q, a ) π () p( a s T, Q, ) p( s T, Q, a ) = π () p( s T + s T, Q, ) = ( t,,) h h h m m m where t = s h T+ s m T is the (evolutionary) distance between and. Note that s h, s m and T are not identifiable. The matrix F(t) is symmetric. It is equally valid to view as the ancestor of or vice versa. 2

Estimating the evolutionary distance between two sequences Suppose two aligned protein sequences a a n and b b n are separated by t PMs. Under a reversible substitution model that is IID across sites, the lielihood of t is L( t ) = p( a a, b Kb model) = = a, b K n ( t, a, b ) ( t, a, b) ( a, b ) where c(a,b) = # { : a = a, b = b}. n Maximizing this quantity gives the maximum lielihood estimate of t. This generalizes the distance correction with Jues-antor. Phylogeny The shaded nodes represent the observed nucleotides at a given site for a set of organisms The unshaded nodes represent putative ancestral nucleotides Transitions between nodes capture the dynamic of evolution 3

Lielihood methods tree, with branch lengths, and the data at a single site. t 7 t 5 t t t 2 t t 3 t GTGGGT GTGGTGT TGTGGTG TGTGGTGG TGTGGTGG Since the sites evolve independently on the same tree, m ( i ) L = P ( D T ) = i = P ( D T ) Lielihood at one site on a tree We can compute this by summing over all assignments of states x, y, z and w to the interior nodes: ( i ) P ( D T ) P (,,,,, x, y, z, w T ) = x y z w x Due to the Marov property of the tree, we can factorize the complete lielihood according to the tree topology: t t z y t 7 t 2 t w t 5 t 3 t P(,,,,, x, y, z, w T ) = P ( x ) P ( y x, t) P ( y, t ) P ( y, t2) P z x, t ) P ( y, ) ( t3 P ( w z, t7 ) P ( y, t) P ( y, t5 Summing this up, there are 25 terms in this case! )

Getting a recursive algorithm when we move the summation signs as far right as possible: t 5 ( i) P ( D T ) P(,,,,, x, y, z, w T ) = ( = x y z w x y P ( x) z P ( y x, t ) P( y, t ) P( y, t ) 2 P z x, t ) P( z, t ) ( 3 P ( w z, t7 ) P( w, t) P( w, t5) ( w )) x t t z y t 7 t 3 t 2 t w t Felsenstein s Pruning lgorithm To calculate P(x, x 2,, x N T, t) Initialization: Set = 2N Recursion: ompute P(L a) for all a Σ If is a leaf node: Set P(L a) = (a = x ) If is not a leaf node:. ompute P(L i b), P(L j c) for all b and c, for daughter nodes i, j Termination: a b c i j 2. Set P(L a) = Σ b, c P(b a, t i )P(L i b) P(c a, t j ) P(L j c) Lielihood at this column = P(x, x 2,, x N T, t) = Σ a P(L 2N- a)p(a) a This algorithm can easily handle mbiguity and error in the sequences (how?) 5

Finding the ML tree So far I have just taled about the computation of the lielihood for one tree with branch lengths nown. To find a ML tree, we must search the space of tree topologies, and for each one examined, we need to optimize the branch lengths to maximize the lielihood. Bayesian phylogeny methods Bayesian inference has been applied to inferring phylogenies (Rannala and Yang, 99;Mau and Larget, 997; Li, Pearl and Doss, 2000). ll use a prior distribution on trees. The prior has enough influence on the result that its reasonableness should be a major concern. In particular, the depth of the tree may be seriously affected by the distribution of depths in the prior. ll use Marov hain Monte arlo (MM) methods. They sample from the posterior distribution. When these methods mae sense they not only get you a point estimate of the phylogeny, they get you a distribution of possible phylogenies.

Modeling rate variation among sites G G... T G G model of variation in evolutionary rates among sites The basic idea is that the rate at each site is drawn independently from a distribution of rates. The most widely used choice is the Gamma distribution, which has density function: α α λ r e f ( r ) = Γ( α) Gamma distributions (α,θ) λr α r e = Γ( α) θ r / θ α 7

Unrealistic aspects of the model: There is no reason, aside from mathematical convenience, to assume that the Gamma is the right distribution. common variation is to assume there is a separate probability f 0 of having rate 0. Rates at different sites appear to be correlated, which this model does not allow. Rates are not constant throughout evolution, they change with time. Rates varying among sites If L (i) (r i ) is the lielihood of the tree for site i given that the rate of evolution at site i is r i, we can integrate this over a gamma density: so that the overall lielihood is L ( i ) = ( i ) f ( r i ; α) L ( ri ) dri 0 L = m i = 0 f ( r ; α) L i ( i ) ( r i ) dri Unfortunately these integrals cannot be evaluated for trees with more than a few tips as the quantities L (i) (r i ) becomes complicated.

Modeling rate variation among sites There are a finite number of rates (denote rate i as r i ). There are probabilities p i of a site having rate i. process not visible to us ("hidden") assigns rates to sites. The probability of our seeing some data are to be obtained by summing over all possible combinations of rates, weighting appropriately by their probabilities of occurrence. Rocall the HMM Y Y 2 Y 3... Y T i i i i i e e 2 e 3 e e 5 e e 7 e i i i i G T G G G T T... X X 2 X 3 X T The shaded nodes represent the observed nucleotides at particular sites of an organism's genome For discrete Y i, widely used in computational biology to represent segments of sequences gene finders and motif finders profile models of protein domains models of secondary structure 9

Definition (of HMM) Observation space lphabetic set: Euclidean space: Index set of hidden states x I = {, 2, L,M} Transition probabilities between any two states j i p y = y = ) = a, or p = d R Start probabilities p y ) ~ Multinomial( π, π, K, π ). Emission probabilities associated with each state or in general: i p( x y = ) ~ f θ, i I { c, c, L, } ( t t i,j i ( yt yt = ) ~ Multinomial ai,, ai, 2, K, ai, M, i I p ( 2 M 2 c K y ( ). i y = ) ~ Multinomial( b, b, K, b ), i. ( xt t i, i, 2 i, K I t t ( ). i y 2 y 3 x 2 x 3...... Graphical model K 2 State automata y T x T Hidden Marov Phylogeny... G G G T G Replacing the standard emission model with a tree process not visible to us (.hidden") assigns rates to sites. It is a Marov process woring along the sequence. For example it might have transition probability Prob (j i) of changing to rate j in the next site, given that it is at rate i in this site. These are the most widely used models allowing rate variation to be correlated along the sequence. 20

The Forward lgorithm α t We can compute for all, t, using dynamic programming! Initialization: α = P ( x y = ) π α = P ( x, y = P ( x y = ) = P ( x y = ) P ( y = ) = ) π Iteration: α P ( x t y t = ) i t = P ( xt y t = ) α i t ai, Termination: P (x) = α T The Bacward lgorithm β t We can compute for all, t, using dynamic programming! Initialization: β T =, Iteration: β = ) β t i i a i, ip ( x t + y t + = ) t + Termination: P (x) α β = 2

Hidden Marov Phylogeny... G G G T G this yields a gene finder that exploits evolutionary constraints omparison of comparative genomic gene-finding and isolated gene-finding Based on sequence data from 2-5 primate species, Mculiffe et al (2003) obtained sensitivity of 00%, with a specificity of 9%. Genscan (state-of-the-art gene finder) yield a sensitivity of 5%, with a specificity of 3%. 00 90 0 70 0 50 0 30 20 0 0 sensitivity specificity Phylogenetic HMM Genscan (HMM) 22

Open questions (philosophical) Observation: Finding a good phylogeny will help in finding the genes. Finding the genes will help to find biologically meaningful phylogenetic trees Which came first, the chicen or the egg? Open questions (technical) How to learn a phylogeny (topology and transition prob.)? Should different site use the same phylogeny? Functionspecific phylogeny? Other evolutionary events: duplication, rearrangement, lateral transfer, etc. 23

cnowledgments Terry Speed: for some of the slides modified from his lectures at U Bereley Phil Green and Joe Felsenstein: for some of the slides modified from his lectures at Univ. of Washington Itai Yanai: for some of the slides modified from his lectures at Weizmann Institute of Science 2