Lab 9: Maximum Likelihood and Modeltest

Similar documents
Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Phylogenetics. Andreas Bernauer, March 28, Expected number of substitutions using matrix algebra 2

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Mutation models I: basic nucleotide sequence mutation models

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Molecular Evolution, course # Final Exam, May 3, 2006

Inferring Molecular Phylogeny

LAB 2 - ONE DIMENSIONAL MOTION

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Week 5: Distance methods, DNA and protein models

What Is Conservation?

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Experiment 1: The Same or Not The Same?

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Evolutionary Models. Evolutionary Models

Pinvar approach. Remarks: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r)

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Phylogenetic Assumptions

Lecture 4. Models of DNA and protein change. Likelihood methods

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

X X (2) X Pr(X = x θ) (3)

Lecture Notes: Markov chains

Phylogenetic Inference using RevBayes

Phylogenetic Inference using RevBayes

Constructing Evolutionary/Phylogenetic Trees

Handout 1 EXAMPLES OF SOLVING SYSTEMS OF LINEAR EQUATIONS

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Physics 212E Spring 2004 Classical and Modern Physics. Computer Exercise #2

Constructing Evolutionary/Phylogenetic Trees

Letter to the Editor. Department of Biology, Arizona State University

Phylogenetics. BIOL 7711 Computational Bioscience

1 Newton s 2nd and 3rd Laws

Retrieve and Open the Data

MATERIAL MECHANICS, SE2126 COMPUTER LAB 3 VISCOELASTICITY. k a. N t

Lab I. 2D Motion. 1 Introduction. 2 Theory. 2.1 scalars and vectors LAB I. 2D MOTION 15

SEM Day 1 Lab Exercises SPIDA 2007 Dave Flora

Probabilistic modeling and molecular phylogeny

Consensus Methods. * You are only responsible for the first two

Lab I. 2D Motion. 1 Introduction. 2 Theory. 2.1 scalars and vectors LAB I. 2D MOTION 15

Determinants of 2 2 Matrices

v Prerequisite Tutorials GSSHA WMS Basics Watershed Delineation using DEMs and 2D Grid Generation Time minutes

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Phylogenetic inference

Row Reduction

The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed

EXPERIMENT: REACTION TIME

Designing Information Devices and Systems I Spring 2017 Babak Ayazifar, Vladimir Stojanovic Homework 2

DOAS measurements of Atmospheric Species

Homework 9: Protein Folding & Simulated Annealing : Programming for Scientists Due: Thursday, April 14, 2016 at 11:59 PM

Inferring Speciation Times under an Episodic Molecular Clock

The Geodatabase Working with Spatial Analyst. Calculating Elevation and Slope Values for Forested Roads, Streams, and Stands.

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

Relative Photometry with data from the Peter van de Kamp Observatory D. Cohen and E. Jensen (v.1.0 October 19, 2014)

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

EXPERIMENT 7: ANGULAR KINEMATICS AND TORQUE (V_3)

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Phylogenetic analyses. Kirsi Kostamo

Introduction to Computer Tools and Uncertainties

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Using models of nucleotide evolution to build phylogenetic trees

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Lab 1: Dynamic Simulation Using Simulink and Matlab

PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD

Phylogenetic Tree Reconstruction

Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation

Math 31 Lesson Plan. Day 5: Intro to Groups. Elizabeth Gillaspy. September 28, 2011

CE 365K Exercise 1: GIS Basemap for Design Project Spring 2014 Hydraulic Engineering Design

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

Overlay Analysis II: Using Zonal and Extract Tools to Transfer Raster Values in ArcMap

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

SpeciesNetwork Tutorial

Calculating NMR Chemical Shifts for beta-ionone O

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

AMS 132: Discussion Section 2

Students will explore Stellarium, an open-source planetarium and astronomical visualization software.

PHY 111L Activity 2 Introduction to Kinematics

Downloading GPS Waypoints

Introduction to MEGA

Infer relationships among three species: Outgroup:

Using Microsoft Excel

Lab 4. Current, Voltage, and the Circuit Construction Kit

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

EXPERIMENT 2 Reaction Time Objectives Theory

GMS 8.0 Tutorial MT3DMS Advanced Transport MT3DMS dispersion, sorption, and dual domain options

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Transcription:

Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2010 Updated by Nick Matzke Lab 9: Maximum Likelihood and Modeltest In this lab we re going to use PAUP* to find a phylogeny using molecular data and Maximum Likelihood as the optimality criterion. The computer evaluates the likelihood of each tree, including topology and branch lengths, one at a time. It calculates the probability of each base pair changing in such a way as to generate the states observed at the tips of the branches based on the tree and a set of parameters describing how the bases change with time. The likelihood of a data set for a given tree is the product of these probabilities for all the base pairs. The computer chooses the topology and branch lengths that produce the highest likelihood for the data set. So what parameters of nucleotide change do we use and what values do we give them? This is called the model of nucleotide change and today we will pick a model using ModelTest. There are an infinite number of possible models. Many have been implemented in various programs, many have been suggested and never implemented, and even more have never been conceived. Today we are only going to deal with a few models that are implemented in PAUP* and evaluated by ModelTest. A model is considered nested within another model if its parameters are a limited set of the parameters in the other model. For example the Jukes-Cantor model, which assumes that every nucleotides has the same rate of change to any other, is nested within the Kimura two parameter model, which assumes different transition and transversion rates. A model without any invariant sites would be nested within one with some percentage of invariant sites. Any two models are not necessarily nested. Adding parameters to a model always increases the maximum likelihood of the data. However, if a model has too many parameters, then maximum likelihood becomes unreliable. Therefore to accept a new parameter into your model it must produce a significant increase in the likelihood. How do you tell if a difference in likelihood is significant? Well, I m sure you ll be shocked to learn that there is a formula. It is called the Likelihood Ratio Test (LRT). For a given model with likelihood, Λ 1, nested within another model with likelihood, Λ 2, with n less parameters: Χ 2 (chi squared) = 2 * (ln (Λ 2 ) ln (Λ 1 )) with n degrees of freedom. You can use this equation to pick the most inclusive model that can not be significantly improved on. The only drawback of this equation is that you can not use it to compare different trees, because different trees are not different models they are more like alternative parameter values. Therefore, you have to compare the different models on a single tree, and which tree to compare them on may not be obvious. Luckily, you tend to get similar results as long as you use a reasonable tree.

Models of Nucleotide Change The Transition Matrix The transition matrix (not as in transition/transversion) is a matrix showing the instantaneous stochastic rate of change between any two nucleotides. It can be used to calculate the chance of one nucleotide changing into another on a branch with a given length. The most unrestrained matrix would look like this: A C G T A α β γ α β γ C δ δ ε ζ ε ζ G η θ η θ ι ι T κ λ µ κ λ µ As you can see, the diagonals are all negative as each nucleotide will be changing away from itself at any instant, so that each row adds up to 0. Furthermore, the average rate of change of all the off diagonals is normalized to 1, so that you can eliminate another parameter for a total of 11 parameters. On the other hand the Kimura two parameter model would look like this: A C G T A α 2β β α β C β α 2β β α G α β α 2β β T β α β α 2β Here there are two parameters, transition and transversion rate, which can be reduced to just one by normalizing the matrix. Most programs, PAUP* included, can only calculate matrices with reversible models. This means that change has an equal probability of happening in either direction on a branch. Thus trees can be evaluated as unrooted networks, making the computationally-intensive likelihood calculations much easier. If you used an irreversible model then you could assign a root without the use of an outgroup, although I don t know how reliable an estimate that would be. For a model to be reversible it must be true that: π X R X>Y =π Y R Y>X where R X>Y is the instantaneous rate of change from nucleotide X to nucleotide Y, and π X is the equilibrium frequency of nucleotide X. The equilibrium frequency is the frequency of that nucleotide if the substitution process is allowed to run forever, and can be considered another parameter. Thus any model in which R X>Y =π Y r XY, will be reversible. So the General Time Reversible (GTR) matrix looks like:

A C G T A π C r AC π G r AG π T r AT C π A r AC π G r CG π T r CT G π A r AG π C r CG π T r GT T π A r AT π C r CT π G r GT with the diagonal filled in appropriately. The sum of the equilibrium frequencies for all four bases must equal one, so that there are three equilibrium frequency parameters. Furthermore, one of the rate parameters can be eliminated by normalizing the matrix, leaving eight parameters total. Some special cases of the GTR that are commonly used (and that might be familiar from last week s lab on distance methods) are: JC : Jukes and Cantor (1969) - All nucleotide substitutions are equal and all base frequencies are equal. This is the most restricted (=specific) model of substitution because it assumes all changes are equal. F81 : Felsenstein (1981) - All nucleotide substitutions are equal, base frequencies allowed to vary. K2P : Kimura two-parameter model, Kimura (1980) - Two nucleotide substitutions types are allowed, those between transitions and transversions. Base frequencies are assumed equal. HKY85: Hasegawa-Kishino-Yano (1985) - Two nucleotide substitutions types are allowed, those between transitions and transversions. Base frequencies are allowed to vary. Proportion of Invariable Sites (I) This is a model that assumes some proportion of the sites, p i, can not change. Thus it makes two calculations for each base pair. First it calculates the chance, λ i, that that base pair would have the observed distribution that it does if it could not change. This will be 1, if it is the same in all taxa, or 0, if there are any differences among the taxa. It then calculates the probability, λ v, that it would have the observed distribution if it could change, using the transition matrix and the tree. Then it calculates the overall likelihood for that base as: λ = p i λ i + (1-p i ) λ v Among-site rate variation (Γ, also abbreviated G by ModelTest) Under the null hypothesis, all sites are assumed to have equal rates of substitution. One way of relaxing this assumption is to allow the rates at different sites to be drawn from a gamma distribution (with the mean value across all sites within a class, such as A-

T, represented in the substitution matrix). The gamma distribution is used because the shape of the curve (α = shape parameter) changes dramatically depending on the parameter values of the distribution. This calculation is done essentially the same way as it is for invariable sites. The likelihood is calculated for each value of the gamma distribution for each base pair and added together. In practice this is only done for a few values of the gamma distribution, as there are an infinite number of possible values for the gamma distribution and each likelihood calculation is computationally burdensome. This serves as a good approximation of a true gamma distribution. Choosing a Model Using ModelTest ModelTest is an extension for PAUP* by Posada and Crandall, which is freely available at http://darwin.uvigo.es/software/modeltest.html. It uses PAUP* to calculate the likelihoods of several different models. The Modeltest program chooses among the models using two different criteria. The first is the LRT that we discussed above. The other is the Akaike Information Criterion (AIC), which makes slightly different calculations to compare the models, but the principle of comparison is basically the same. Each criterion produces a different model choice, although they often agree. 1. Download the Nexus file of Cephalopod COI genes that we ve been using from http://ib.berkeley.edu/courses/ib200a/cephalopod_coi_clustalw.nex or the sylalbus page, whichever is easier. Save it to a folder that you make on your desktop. Copy the folder Applications>IB200 >Modeltest3.7 folder into this folder. 2. Open PAUP*. 3. Execute the sequence file in PAUP*. 4. Execute the Modeltest PAUP block: File>Open, then navigate to Modeltest3.7 folder>paupblock >modelblockpaupb10 and choose execute. (You may need to switch the display from showing only NEXUS files to showing all files to make this file show up.) PAUP* will now run, while it evaluates the different models. This may take a few minutes. When it is done it will stop running and say that it is completed. 5. Now, go to the desktop, and open yourfoldername>modeltest3.7 folder>paupblock. You will find a file model.scores in the paupblock folder. Rename this file, but make sure it still ends in.scores then copy it, and paste it into yourfoldername>modeltest3.7 folder>bin 6. In order to run Modeltest, we will need to use the command prompt. Go to the Start menu and choose All Programs>Accessories>Command Prompt 7. Once the command prompt is open, you need to move into the bin directory where you just pasted your.scores file. There are a couple of ways to do this, but one of the easiest is to type cd then open the Modeltest3.7 folder and drag-and-drop the bin folder onto the command prompt window. This is almost perfect, unfortunately you need to delete the marks that Windows inserts before you press return.

8. Ok, now we are ready to go. Type the following at the command line: modeltest3.7 < filename.scores > anotherfilename.modeltest You should replace filename with whatever name you used above and anotherfilename with any other name you d like. Press return. Once the data is processed and a new prompt (C: > ) appears, you can close the command prompt. 9. A new output file with the name you gave it will have appeared in your bin folder. 10. Open this file in a text editor. (Select program from a list>notepad) Now, we will get to see which models Modeltest suggests based on several different modeltesting criteria. Both are ways to pick the most inclusive model that can not be significantly improved on. Scroll down until you see the banner indicating the results from the Hierarchical Likelihood Ratio Tests: * HIERARCHICAL LIKELIHOD RATIO TESTS (hlrts) * Then keep going until you see: Model selected: followed by a type of model. Make a note of which model Modeltest chose under this criterion. If you get the same answer as me, the model will be Model selected: GTR+I+G Which stands for a General Time Reversible model with a certain proportion of Invariant sites and a Gamma distribution of changes. 11. Keep scrolling down until you see the results under the Akaike Information Criterion: * AKAIKE INFORMATION CRITERION (AIC) * Right under it you should see Model selected: followed by a type of model. What model did Modeltest choose under each criterion? Are they the same? Go ahead a peruse some of the other statistics in the file, and see what other models Modeltest looked at and rejected. Finding a Maximum Likelihood Tree in PAUP* Fixed Parameter Values First let s use the parameter values chosen by Modeltest. 1. In the Modeltest output file you will find a PAUP block that can be inserted directly into the Nexus file. It starts BEGIN PAUP; and ends with

END; This block changes the Likelihood Settings (Lset), by setting the base frequencies at equilibrium (Base), the number of substitution types (Nst), the rate matrix of instantaneous substitution rates (Rmat), the among site rate variation (Rates), the shape of the gamma distribution (Shape), and the proportion of invariant sites (Pinvar). 2. Copy the PAUP block from the text file. Edit your Nexus file in PAUP* (or a text editor) and paste the PAUP block from Modeltest directly into it. It can go after any END; statement. 3. Execute the newly-edited sequence file in PAUP* again. 4. Set the search criterion to maximum likelihood: set criterion = likelihood; 5. Then do a heuristic search hs; Fit the Parameter Values Along with Finding the Tree It is also possible for PAUP* to search for the parameter values at the same time as it searches for the best tree using the model - but not the parameter values- chosen by Modeltest. The following Lset command will require PAUP to estimate the base frequencies, rate matrix, shape of the gamma distribution, and proportion of invariant sites: Lset Base=estimate Nst=6 Rmat=estimate Rates=gamma Shape=estimate Pinvar=estimate; If you ran this, are you still waiting? Yeah this takes way too long. Just stop it. Why do you think it takes so much longer? If you did let it finish would the best tree have a higher or lower likelihood than with the fixed parameter values? What are the advantages of each method? If you try to estimate all of these at once, it will take an extremely long time (it is even possible, with this much uncertainty, that the search will never converge and go on forever.) If you want to try this, it is probably better to try just one or two parameters at a time. It will still probably take a while. Likelihood searches are pretty slow.