Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Size: px
Start display at page:

Transcription

1 Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed.

2 Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community has to do with the difficulty of understanding the theory and also the absence (initially) of good quality software with choice of models and ease of interaction with data. Also, at the time, it was computationally intractable to analyze large datasets (consider that in the mid-1980s a typical desktop computer had a processor speed of less than 30 MHz). In recent times, software, models and computer hardware have become sufficiently sophisticated that ML is becoming a method of choice.

3 ML: comparison with other methods ML assumes a model of sequence evolution. ML attempts to answer the question: What is the probability that I would observe these data (a multiple sequence alignment), given a particular model of evolution (a tree and a process). ML uses a model. This is justifiable, since molecular sequence data can be shown to have arisen according to a evolutionary process.

4 Maximum Likelihood - goal To estimate the probability that we would observe a particular dataset, given a phylogenetic tree and some notion of how the evolutionary process worked over time. Probability of given π = [ a,c,g,t] a b c d b a e f c e a g d c f a

5 What is the probability of observing a datum? If we flip a coin and get a head and we think the coin is unbiased, then the probability of observing this head is 0.5. If we think the coin is biased so that we expect to get a head 80% of the time, then the likelihood of observing this datum (a head) is 0.8. Therefore: The likelihood of making some observation is entirely dependent on the model that underlies our assumption. p =? Lesson: The datum has not changed, our model has. Therefore under the new model the likelihood of observing the datum has changed.

6 What is the probability of observing a 'G' nucleotide? Question:If we have a DNA sequence of one nucleotide in length and the identity of this nucleotide is 'G', what is the likelihood that we would observe this 'G'? Answer: In the same way as the coin-flipping observation, the likelihood of observing this 'G' is dependent on the model of sequence evolution that is thought to underlie the data. E.g. Model 1: frequency of G = 0.40 => likelihood(g) = 0.40 Model 2: frequency of G = 0.10 => likelihood(g) = 0.10 Model 3: frequency of G = 0.25 => likelihood(g) = 0.25

7 One rule the rule of 1 The sum of the likelihoods of all the possibilities will always equal 1 E.g. for DNA p(a)+p(c)+p(g)+p(t)=1

8 What about longer sequences? If we consider a gene of length 2: Gene 1: ga The the probability of observing this gene is the product of the probabilities of observing each character. E.g p(g) = 0.40; p(a)=0.15 (for instance) Likelihood (ga) = 0.40 x 0.15 = 0.06

9 or even longer sequences? Gene 1: gactagctagacagatacgaattac Model (simple base frequency model): p(a)=0.15; p(c)=0.20; p(g)=0.40; p(t)=0.25; (the sum of all probabilities must equal 1) Like(Gene 1) = = 1.8 x 10-17

10 Note about models You might notice that our model of base frequency is not the optimal model for our observed data. If we had used the following model: p(a)=0.40; p(c) =0.20; p(g)= 0.20; p(t) = 0.20; The likelihood of observing the gene is: Like(gene 1) = = 3.3 x (a value that is almost 10,000 times higher vs ) Lesson: The datum has not changed, our model has. Therefore under the new model the likelihood of observing the datum has changed.

11 How does this relate to phylogenetic trees? Consider an alignment of two sequences: Gene 1: gaac Gene 2: gacc We assume these genes are related by a (simple) phylogenetic tree with branch lengths.

12 Increase in model sophistication It is no longer possible to simply invoke a model that encompasses base composition, we must also include the mechanism of sequence change and stasis. There are two parts to this model - the tree and the process (the latter is confusingly referred to as the model, although both parts really compose the model).

13 The model The two parts of the model are the tree and the process (the model). The model is composed of the composition and the substitution process rate of change from one character state to another character state. Model = + a b c d b a e f c e a g d c f a [ ] π = a,c,g,t

14 Simple time-reversible model A simple model is that the rate of change from a to c or vice versa is 0.4, the composition of a is 0.25 and the composition of c is 0.25 (a simplified version of the Jukes and Cantor 1969 model) P = π = [ ]

15 Probability of the third nucleotide position in our current alignment p(a) =0.25; p(c) = 0.25; p a c = 0.4 Starting with a, the likelihood of the nucleotide is 0.25 and the likelihood of the substitution (branch) is 0.4. So the likelihood of observing these data is: *Likelihood model = prob (D M) = 0.25 x 0.4 =0.01 Note: you will get the same result if you start with c, since this model is reversible *The likelihood of the model, given the data.

16 Substitution matrix For nucleotide sequences, there are 16 possible ways to describe substitutions - a 4x4 matrix. P = a b c d e f g h i j k l m n o p Convention dictates that the order of the nucleotides is a,c,g,t Note: for amino acids, the matrix is a 20 x 20 matrix and for codon-based models, the matrix is 61 x 61

17 Substitution matrix - an example P = In this matrix, the probability of an a changing to a c is 0.01 and the probability of a c remaining the same is 0.979, etc. Note: The rows of this matrix sum to 1 - meaning that for every nucleotide, we have covered all the possibilities of what might happen to it. The columns do not sum to anything in particular.

18 To calculate the likelihood of the entire dataset, given a substitution matrix, base composition and a branch length of one "certain evolutionary distance" or "ced" Likelihood of Gene 1: ccat Gene 2: ccgt given P = π=[0.1,0.4,0.2,0.3]

19 Likelihood of a two-sequence alignment ccat ccgt π c P c > c π c P c >c π a P a > g π t P t > t =0.4x0.983x0.4x0.983x0.1x0.007x0.3x0.979 = Likelihood of going from the first to the second sequence is

20 Different Branch Lengths For very short branch lengths, the probability of a character staying the same is high and the probability of it changing is low (for our particular matrix). For longer branch lengths, the probability of character change becomes higher and the probability of staying the same is lower. The previous calculations are based on the assumption that the branch length describes one Certain Evolutionary Distance or CED. If we want to consider a branch length that is twice as long (2 CED), then we can multiply the substitution matrix by itself (matrix 2 ).

21 2 CED model P = = X P = Which gives a likelihood of Note the higher likelihood

22 For 3 CED P 3 = This gives a likelihood of Note that as the branch lengths increase, the values on diagonal decrease and the values on the off-diagonals increase.

23 For higher values of CED units L i k e l i h o o d Branch Length

24 Raising P to a large power If we raise P to a very large power, we find that the ML base composition pops out P 10 6 = So the base composition is built into the probability matrix P.

25 Consider the following equation Rate Matrices 5 4 = exp( 4log(5) ) In the same way, raising a matrix to the power of a number can be calculated by taking the log of the matrix, multiplying it by branch length and taking the exponent of the product. In this way, you can exponentiate the matrix by a number that is not a whole number (e.g or whatever). E.g. The log (natural log base e = ln) of the previous matrix, P is: log P = Note: the sum of each row is zero.

26 log P = This matrix corresponds to one CED. What we want is to derive a matrix so that when we exponentiate it, the values correspond to substitutions per site. We must therefore scale logp so that when the rows of logp are multiplied by π row the off-diagonal elements sum to 1. The resulting scaled logp (called Q), when its exponent is taken gives a P corresponding to 1 substitution per site.

27 Converting to substitutions per For a branch length of v: site e Qv = P(v) If we scale the logp appropriately, we will get a Q matrix. If we multiply this Q matrix by a diagonal matrix of the composition we get a matrix where the off-diagonal elements sum to 1 and the diagonal elements sum to -1.

28 Scaling logp appropriately LogP scaled by a factor of 50 (in this case) Q = Q = (diagonal matrix of the composition) Off-diagonal elements sum to 1, diagonal elements sum to -1 P s generated from this Q will give branch lengths in substitutions per site

29 Separating composition from rates π col If we divide the columns of Q by the composition is separated from the rates. You can then use the exact same rate matrix with different matrices of base composition. For the model we have been using, the rate matrix R is: R = The diagonal elements do not matter and are left out. The model is symmetrical (time reversible).

30 Relationships between R, Q and P matrices R Multiply columns by the composition, scale so that the offdiagonals sum to 1 Q Multiply by branch length, then exponentiate P Divide columns by the composition Log, then scale so that off-diagonals sum to 1 P = substitution matrix Q = scaled log substitution matrix R = rate matrix (defines the model of evolution)

31 Likelihood of the alignment at various branch lengths ccat ccgt The maximum likelihood value is at a branch length of

32 Likelihood of a two-branch tree O 0.1 A 0.2 B O is the origin or root, the numbers represent branch lengths. The likelihood can be calculated in three ways: from A to B in one step (this amounts to the previous method) from A to B in two steps (through O) in two parts starting at O

33 Lesson about O O is an unknown sequence. We can only speculate what each position in the alignment would be if we could observe the sequence of O. What we do know is that the sum of all possibilities is equal to 1. Therefore we must sum the likelihoods for all possibilities of O. This becomes computationally intensive. O 0.1 A For position 1: {a,c,g,t} 0.1 {c} 0.2 B 0.2 {c}

34 A three branch tree A C B 0.2 The tree can be rooted anywhere and the substitutions calculated accordingly. There are many ways of doing this and this is left as an exercise for the student.

35 Increasing the sophistication of models So far, the models we have dealt with assume that change is equally likely at all positions and that the rate of change is constant for the entire duration of the phylogeny. This is not a realistic model for all sequences (it is a neutral model with a constant molecular clock). A B A p CORRECT TREE q q q p p > q D WRONG TREE C D C B

36 Small subunit ribosomal RNA 18S or 16S rrna

37 The molecular clock for alpha-globin: Each point represents the number of substitutions separating each animal from humans number of substitutions cow platypus chicken shark carp Time to common ancestor (millions of years)

38 Invariable sites For a given dataset we can assume that a certain proportion of sites are not free to vary (e.g. purifying selection prevents these sites from changing). We can therefore observe invariable positions either because they are under this selective constraint or because they have not had a chance to vary or because there is homoplasy in the dataset and a reversal (say) has caused the site to appear constant. The likelihood that a site is invariable can be calculated by incorporating this possibility into our model and calculating for every site the likelihood that it is an invariable site.

39 Variable sites Obviously other sites in the dataset are free to vary. Selection intensity on these sites is rarely uniform, so it is desirable to model site-by-site rate variation. This is done in two ways: site specific (codon position, or alpha helix etc.) using a discrete approximation to a continuous distribution (gamma distribution). Again, these variables are modeled over all possibilities of sequence change over all possibilities of branch length over all possibilities of tree topology.

40 The shape of the gamma distribution for different values of alpha

41 Does changing a model affect the outcome? There are different models Jukes and Cantor (JC69): All base compositions equal (0.25 each), rate of change from one base to another is the same Kimura 2-Parameter (K2P): All base compositions equal (0.25 each), different substitution rate for transitions and transversions). Hasegawa-Kishino-Yano (HKY): Like the K2P, but with base composition free to vary. General Time Reversible (GTR): Base composition free to vary, all possible substitutions can differ. All these models can be extended to accommodate invariable sites and site-to-site rate variation.

42

43 Long-branch attraction (LBA) In the case below, the wrong tree is often selected. ML will not be prone to this problem, if the correct model of sequence evolution is used. A B A p CORRECT TREE q C q q D p p > q D C WRONG TREE B This is often called the Felsenstein-zone.

44 Strengths of ML Does not try to make an observation of sequence change and then a correction for superimposed substitutions. There is no need to correct for anything, the models take care of superimposed substitutions. Accurate branch lengths. Each site has a likelihood. If the model is correct, we should retrieve the correct tree*. You can use a model that fits the data. ML uses all the data (no selection of sites based on informativeness, all sites are informative). ML can not only tell you about the phylogeny of the sequences, but also the process of evolution that led to the observations of today s sequences. *If we have long-enough sequences and a sophisticated-enough model.

45 Models You can use models that: Deal with different transition/transversion ratios. Deal with unequal base composition. Deal with heterogeneity of rates across sites. Deal with heterogeneity of the substitution process (different rates across lineages, different rates at different parts of the tree). The more free parameters, the better your model fits your data (good). The more free parameters, the higher the variance of the estimate (bad). Use a model that fits your data.

46 Over-fitting a model to your data

47 Weaknesses of ML Can be inconsistent if we use models that are not accurate. Model might not be sophisticated enough (you can max-out on models). Very computationally-intensive. Might not be possible to examine all models (substitution matrices, tree topologies, etc.).

48 Recommendations Interact with the data. If you have collected enough data, you might get a good picture of the underlying model of sequence evolution. Use a test of alternative models (implemented in the Modeltest software). Don t just choose a model, use a model that fits your data. Don t over-fit a model to your data.

49 How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth. Sherlock Holmes to Dr. Watson in The Sign of Four, by A. Conan Doyle.

Consensus methods. Strict consensus methods

Consensus methods A consensus tree is a summary of the agreement among a set of fundamental trees There are many consensus methods that differ in: 1. the kind of agreement 2. the level of agreement Consensus

The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed

The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed A gentle introduction, for those of us who are small of brain, to the calculation of the likelihood of molecular

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Maximum Likelihood Tree Estimation Carrie Tribble IB 200 9 Feb 2018 Outline 1. Tree building process under maximum likelihood 2. Key differences between maximum likelihood and parsimony 3. Some fancy extras

Lab 9: Maximum Likelihood and Modeltest

Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2010 Updated by Nick Matzke Lab 9: Maximum Likelihood and Modeltest In this lab we re going to use PAUP*

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

Probabilistic modeling and molecular phylogeny

Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU) What is a model? Mathematical

Dr. Amira A. AL-Hosary

Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

Mutation models I: basic nucleotide sequence mutation models

Mutation models I: basic nucleotide sequence mutation models Peter Beerli September 3, 009 Mutations are irreversible changes in the DNA. This changes may be introduced by chance, by chemical agents, or

Inferring Molecular Phylogeny

Dr. Walter Salzburger he tree of life, ustav Klimt (1907) Inferring Molecular Phylogeny Inferring Molecular Phylogeny 55 Maximum Parsimony (MP): objections long branches I!! B D long branch attraction

What Is Conservation?

What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

Phylogenetics: Building Phylogenetic Trees

1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

How should we go about modeling this? gorilla GAAGTCCTTGAGAAATAAACTGCACACACTGG orangutan GGACTCCTTGAGAAATAAACTGCACACACTGG Model parameters? Time Substitution rate Can we observe time or subst. rate? What

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

Quantifying sequence similarity

Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

Understanding relationship between homologous sequences

Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

Preliminaries Download PAUP* from: http://people.sc.fsu.edu/~dswofford/paup_test 1 A model of the Boston T System 1 Idea from Paul Lewis A simpler model? 2 Why do models matter? Model-based methods including

Phylogenetic Inference using RevBayes

Phylogenetic Inference using RevBayes Model section using Bayes factors Sebastian Höhna 1 Overview This tutorial demonstrates some general principles of Bayesian model comparison, which is based on estimating

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Ziheng Yang Department of Biology, University College, London An excess of nonsynonymous substitutions

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

Phylogenetic Assumptions

Substitution Models and the Phylogenetic Assumptions Vivek Jayaswal Lars S. Jermiin COMMONWEALTH OF AUSTRALIA Copyright htregulation WARNING This material has been reproduced and communicated to you by

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

Maximum Likelihood in Phylogenetics

Maximum Likelihood in Phylogenetics June 1, 2009 Smithsonian Workshop on Molecular Evolution Paul O. Lewis Department of Ecology & Evolutionary Biology University of Connecticut, Storrs, CT Copyright 2009

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

How Molecules Evolve Guest Lecture: Principles and Methods of Systematic Biology 11 November 2013 Chris Simon Approaching phylogenetics from the point of view of the data Understanding how sequences evolve

Lecture Notes: Markov chains

Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3

Evolutionary Models. Evolutionary Models

Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

Phylogenetic inference

Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

Evolutionary Analysis of Viral Genomes

University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC

Molecular Evolution, course # Final Exam, May 3, 2006

Molecular Evolution, course #27615 Final Exam, May 3, 2006 This exam includes a total of 12 problems on 7 pages (including this cover page). The maximum number of points obtainable is 150, and at least

7. Tests for selection

Sequence analysis and genomics 7. Tests for selection Dr. Katja Nowick Group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute for Brain Research www. nowicklab.info

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Lie Markov models Jeremy Sumner School of Physical Sciences University of Tasmania, Australia Stochastic Modelling Meets Phylogenetics, UTAS, November 2015 Jeremy Sumner Lie Markov models 1 / 23 The theory

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

Natural selection on the molecular level

Natural selection on the molecular level Fundamentals of molecular evolution How DNA and protein sequences evolve? Genetic variability in evolution } Mutations } forming novel alleles } Inversions } change

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD

Annu. Rev. Ecol. Syst. 1997. 28:437 66 Copyright c 1997 by Annual Reviews Inc. All rights reserved PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD John P. Huelsenbeck Department of

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Method KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging Zhang Zhang 1,2,3#, Jun Li 2#, Xiao-Qian Zhao 2,3, Jun Wang 1,2,4, Gane Ka-Shu Wong 2,4,5, and Jun Yu 1,2,4 * 1

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern

Base Range A 0.00 to 0.25 C 0.25 to 0.50 G 0.50 to 0.75 T 0.75 to 1.00

PhyloMath Lecture 1 by Paul O. Lewis, 22 January 2004 Simulation of a single sequence under the JC model We drew 10 uniform random numbers to simulate a nucleotide sequence 10 sites long. The JC model

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

Lecture 11 Increasing Model Complexity I. Introduction. At this point, we ve increased the complexity of models of substitution considerably, but we re still left with the assumption that rates are uniform

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

Phylogenetics. BIOL 7711 Computational Bioscience

Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

Lecture Notes: BIOL2007 Molecular Evolution

Lecture Notes: BIOL2007 Molecular Evolution Kanchon Dasmahapatra (k.dasmahapatra@ucl.ac.uk) Introduction By now we all are familiar and understand, or think we understand, how evolution works on traits

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Statistical estimation of models of sequence evolution Phylogenetic inference using maximum likelihood:

Maximum Likelihood in Phylogenetics

Maximum Likelihood in Phylogenetics 26 January 2011 Workshop on Molecular Evolution Český Krumlov, Česká republika Paul O. Lewis Department of Ecology & Evolutionary Biology University of Connecticut,

C.DARWIN ( )

C.DARWIN (1809-1882) LAMARCK Each evolutionary lineage has evolved, transforming itself, from a ancestor appeared by spontaneous generation DARWIN All organisms are historically interconnected. Their relationships

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

Markov Chains and Related Matters

Markov Chains and Related Matters 2 :9 3 4 : The four nodes are called states. The numbers on the arrows are called transition probabilities. For example if we are in state, there is a probability of going

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

Bayes Formula MATH 07: Finite Mathematics University of Louisville March 26, 204 Test Accuracy Conditional reversal 2 / 5 A motivating question A rare disease occurs in out of every 0,000 people. A test

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/39

Phylogenetic Inference using RevBayes

Phylogenetic Inference using RevBayes Substitution Models Sebastian Höhna 1 Overview This tutorial demonstrates how to set up and perform analyses using common nucleotide substitution models. The substitution

Molecular Evolution and Comparative Genomics

Molecular Evolution and Comparative Genomics --- the phylogenetic HMM model 10-810, CMB lecture 5---Eric Xing Some important dates in history (billions of years ago) Origin of the universe 15 ±4 Formation

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley D.D. Ackerly Feb. 26, 2018 Maximum Likelihood Principles, and Applications to

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

Chapter 7: Models of discrete character evolution

Chapter 7: Models of discrete character evolution pdf version R markdown to recreate analyses Biological motivation: Limblessness as a discrete trait Squamates, the clade that includes all living species

Local Alignment Statistics

Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

Consensus Methods. * You are only responsible for the first two

Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Today s topics Inferring phylogeny Introduction! Distance methods! Parsimony method!"#\$%&'(!)* +,-.'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;?@AB=C?DEF Overview of phylogenetic inferences Methodology Methods

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Evolutionary trees Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan Describe the relationship between objects, e.g. species or genes Early evolutionary studies The evolutionary relationships between

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008 University of California, Berkeley B.D. Mishler March 18, 2008. Phylogenetic Trees I: Reconstruction; Models, Algorithms & Assumptions

Week 5: Distance methods, DNA and protein models

Week 5: Distance methods, DNA and protein models Genome 570 February, 2016 Week 5: Distance methods, DNA and protein models p.1/69 A tree and the expected distances it predicts E A 0.08 0.05 0.06 0.03

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Molecular Phylogenetics and Evolution 31 (2004) 865 873 MOLECULAR PHYLOGENETICS AND EVOLUTION www.elsevier.com/locate/ympev Efficiencies of maximum likelihood methods of phylogenetic inferences when different

Systematics - Bio 615

Bayesian Phylogenetic Inference 1. Introduction, history 2. Advantages over ML 3. Bayes Rule 4. The Priors 5. Marginal vs Joint estimation 6. MCMC Derek S. Sikes University of Alaska 7. Posteriors vs Bootstrap

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Bootstrapping and Tree reliability Biol4230 Tues, March 13, 2018 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 Rooting trees (outgroups) Bootstrapping given a set of sequences sample positions randomly,

Phylogenetic methods in molecular systematics

Phylogenetic methods in molecular systematics Niklas Wahlberg Stockholm University Acknowledgement Many of the slides in this lecture series modified from slides by others www.dbbm.fiocruz.br/james/lectures.html

Chapter 2.5 Random Variables and Probability The Modern View (cont.)

Chapter 2.5 Random Variables and Probability The Modern View (cont.) I. Statistical Independence A crucially important idea in probability and statistics is the concept of statistical independence. Suppose

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates

Probability and Estimation. Alan Moses

Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Week 8: Testing trees, ootstraps, jackknifes, gene frequencies Genome 570 ebruary, 2016 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.1/69 density e log (density) Normal distribution:

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço jcarrico@fm.ul.pt Charles Darwin (1809-1882) Charles Darwin s tree of life in Notebook B, 1837-1838 Ernst Haeckel (1934-1919)

The following are generally referred to as the laws or rules of exponents. x a x b = x a+b (5.1) 1 x b a (5.2) (x a ) b = x ab (5.

Chapter 5 Exponents 5. Exponent Concepts An exponent means repeated multiplication. For instance, 0 6 means 0 0 0 0 0 0, or,000,000. You ve probably noticed that there is a logical progression of operations.

Molecular Evolution and Phylogenetic Tree Reconstruction

1 4 Molecular Evolution and Phylogenetic Tree Reconstruction 3 2 5 1 4 2 3 5 Orthology, Paralogy, Inparalogs, Outparalogs Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length

π b = a π a P a,b = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ),

ABC estimation of the scaled effective population size. Geoff Nicholls, DTC 07/05/08 Refer to http://www.stats.ox.ac.uk/~nicholls/dtc/tt08/ for material. We will begin with a practical on ABC estimation

CS 124 Math Review Section January 29, 2018

CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to

BIOINFORMATICS TRIAL EXAMINATION MASTERS KT-OR

BIOINFORMATICS KT Maastricht University Faculty of Humanities and Science Knowledge Engineering Study TRIAL EXAMINATION MASTERS KT-OR Examiner: R.L. Westra Date: March 30, 2007 Time: 13:30 15:30 Place:

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr

Introduction to MEGA

Introduction to MEGA Download at: http://www.megasoftware.net/index.html Thomas Randall, PhD tarandal@email.unc.edu Manual at: www.megasoftware.net/mega4 Use of phylogenetic analysis software tools Bioinformatics