Special features of phangorn (Version 2.3.1)

Similar documents
samplesizelogisticcasecontrol Package

CGEN(Case-control.GENetics) Package

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Estimating phylogenetic trees with phangorn

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

林仲彥. Dec 4,

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Unstable Laser Emission Vignette for the Data Set laser of the R package hyperspec

Introduction to pepxmltab

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

7. Tests for selection

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Lecture 4. Models of DNA and protein change. Likelihood methods

Quantitative genetic (animal) model example in R

Lecture Notes: Markov chains

Week 5: Distance methods, DNA and protein models

Visualising very long data vectors with the Hilbert curve Description of the Bioconductor packages HilbertVis and HilbertVisGUI.

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Phylogenetic Tree Reconstruction

Constructing Evolutionary/Phylogenetic Trees

Molecular Evolution & Phylogenetics

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetics. BIOL 7711 Computational Bioscience

Dr. Amira A. AL-Hosary

Constructing Evolutionary/Phylogenetic Trees

The NanoStringDiff package

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Disease Ontology Semantic and Enrichment analysis

Lecture 4. Models of DNA and protein change. Likelihood methods

Phylogenetics: Building Phylogenetic Trees

What Is Conservation?

Estimating maximum likelihood phylogenies with PhyML

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Natural selection on the molecular level

Letter to the Editor. Department of Biology, Arizona State University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

coenocliner: a coenocline simulation package for R

EVOLUTIONARY DISTANCES

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Chapter 7: Models of discrete character evolution

Lab 9: Maximum Likelihood and Modeltest

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Understanding relationship between homologous sequences

Computational Biology

Probabilistic modeling and molecular phylogeny

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Inference using RevBayes

Evolutionary Tree Analysis. Overview

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Introduction to phylogenetics using

Lecture Notes: BIOL2007 Molecular Evolution

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Molecular Evolution, course # Final Exam, May 3, 2006

Concepts and Methods in Molecular Divergence Time Estimation

Evolutionary Models. Evolutionary Models

Phylogenetic inference

Maximum Likelihood in Phylogenetics

Phylogeny of Mixture Models

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Phylogenetic analyses. Kirsi Kostamo

EECS730: Introduction to Bioinformatics

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Maximum Likelihood in Phylogenetics

The Generalized Neighbor Joining method

pensim Package Example (Version 1.2.9)

Nucleotide substitution models


Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

In: P. Lemey, M. Salemi and A.-M. Vandamme (eds.). To appear in: The. Chapter 4. Nucleotide Substitution Models

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Phylogenetic Inference using RevBayes

C.DARWIN ( )

BIOINFORMATICS TRIAL EXAMINATION MASTERS KT-OR

Evolutionary Genomics: Phylogenetics

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Mutation models I: basic nucleotide sequence mutation models

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Transcription:

Special features of phangorn (Version 2.3.1) Klaus P. Schliep November 1, 2017 Introduction This document illustrates some of the phangorn [4] specialised features which are useful but maybe not as well-known or just not (yet) described elsewhere. This is mainly interesting for someone who wants to explore different models or set up some simulation studies. We show how to construct data objects for different character states other than nucleotides or amino acids or how to set up different models to estimate transition rate. The vignette Trees describes in detail how to estimate phylogenies from nucleotide or amino acids. 1 User defined data formats To better understand how to define our own data type it is useful to know a bit more about the internal representation of phydat objects. The internal representation of phydat object is very similar to factor objects. As an example we will show here several possibilities to define nucleotide data with gaps defined as a fifth state. Ignoring gaps or coding them as ambiguous sites - as it is done in most programs, also in phangorn as default - may be misleading (see Warnow(2012)[6]). When the number of gaps is low and the gaps are missing at random coding gaps as separate state may be not important. Let assume we have given a matrix where each row contains a character vector of a taxonomical unit: library(phangorn) data = matrix(c("r","a","y","g","g","a","c","-","c","t","c","g", "a","a","t","g","g","a","t","-","c","t","c","a", "a","a","t","-","g","a","c","c","c","t","?","g"), dimnames = list(c("", "", ""),NULL), nrow=3, byrow=true) data mailto:klaus.schliep@gmail.com 1

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] "r" "a" "y" "g" "g" "a" "c" "-" "c" "t" "c" "g" "a" "a" "t" "g" "g" "a" "t" "-" "c" "t" "c" "a" "a" "a" "t" "-" "g" "a" "c" "c" "c" "t" "?" "g" Normally we would transform this matrix into an phydat object and gaps are handled as ambiguous character like?. gapsdata1 = phydat(data) gapsdata1 3 sequences with 12 character and 11 different site patterns. The states are a c g t Now we will define a USER defined object and have to supply a vector levels of the character states for the new data, in our case the for nucleotide states and the gap. Additional we can define ambiguous states which can be any of the states. gapsdata2 = phydat(data, type="user", levels=c("a","c","g","t","-"), ambiguity = c("?", "n")) gapsdata2 3 sequences with 10 character and 9 different site patterns. The states are a c g t - This is not yet what we wanted as two sites of our alignment, which contain the ambiguous characters r and y, got deleted. To define ambiguous characters like r and y explicitly we have to supply a contrast matrix similar to contrasts for factors. contrast = matrix(data = c(1,0,0,0,0, 0,1,0,0,0, 0,0,1,0,0, 0,0,0,1,0, 1,0,1,0,0, 0,1,0,1,0, 0,0,0,0,1, 1,1,1,1,0, 1,1,1,1,1), ncol = 5, byrow = TRUE) dimnames(contrast) = list(c("a","c","g","t","r","y","-","n","?"), c("a", "c", "g", "t", "-")) contrast a c g t - a 1 0 0 0 0 c 0 1 0 0 0 2

g 0 0 1 0 0 t 0 0 0 1 0 r 1 0 1 0 0 y 0 1 0 1 0-0 0 0 0 1 n 1 1 1 1 0? 1 1 1 1 1 gapsdata3 = phydat(data, type="user", contrast=contrast) gapsdata3 3 sequences with 12 character and 11 different site patterns. The states are a c g t - Here we defined n as a state which can be any nucleotide but not a gap - and? can be any state including a gap. These data can be used in all functions available in phangorn to compute distance matrices or perform parsimony and maximum likelihood analysis. 2 Markov models of character evolution To model nucleotide substitutions across the edges of a tree T we can assign a transition matrix. In the case of nucleotides, with four character states, each 4 4 transition matrix has, at most, 12 free parameters. Time-reversible Markov models are used to describe how characters change over time, and use fewer parameters. Time-reversible means that these models need not be directed in time, and the Markov property states that these models depend only on the current state. These models are used in analysis of phylogenies using maximum likelihood and MCMC, computing pairwise distances, as well in simulating sequence evolution. We will now describe the General Time-Reversible (GTR) model [5]. The parameters of the GTR model are the equilibrium frequencies π = (π A, π C, π G, π T ) and a rate matrix Q which has the form απ C βπ G γπ T Q = απ A δπ G ɛπ T βπ A δπ C ηπ T (1) γπ A ɛπ C ηπ G where we need to assign 6 paramters α,..., η. The elements on the diagonal are chosen so that the rows sum to zero. The Jukes-Cantor (JC) [1] model can be derived as special case from the GTR model, for equal equilibrium frequencies π A = π C = π G = π T = 0.25 and equal rates set to α = β = γ = δ = η. Table 2 lists all built-in nucleotide models in phangorn. The transition probabilities which describe the probabilities of change from character i to j in time t, are given by the corresponding entries of the matrix 3

exponential P (t) = (p ij (t)) = e Qt, p ij = 1 where P (t) is the transition matrix spanning a period of time t. 3 Estimation of non-standard transition rate matrices In the last section 1 we described how to set up user defined data formats. Now we describe how to estimate transition matrices with pml. Again for nucleotide data the most common models can be called directly in the optim.pml function (e.g. JC69, HKY, GTR to name a few). Table 2 lists all the available nucleotide models, which can estimated directly in optim.pml. For amino acids several transition matrices are available ( WAG, JTT, LG, Dayhoff, cprev, mtmam, mtart, MtZoa, mtrev24, VT, RtREV, HIVw, HIVb, FLU, Blossum62, Dayhoff DCMut and JTT-DCMut ) or can be estimated with optim.pml. For example Mathews et al. (2010) [2] used this function to estimate a phytochrome amino acid transition matrix. We will now show how to estimate a rate matrix with different transition (α) and transversion ratio (β) and a fixed rate to the gap state (γ) - a kind of Kimura twoparameter model (K81) for nucleotide data with gaps as fifth state (see table 1). a c g t - a c β g α β t β α β - γ γ γ γ Table 1: Rate matrix K to optimise. If we want to fit a non-standard transition rate matrices, we have to tell optim.pml, which transitions in Q get the same rate. The parameter vector subs accepts a vector of consecutive integers and at least one element has to be zero (these gets the reference rate of 1). Negative values indicate that there is no direct transition is possible and the rate is set to zero. tree = unroot(rtree(3)) fit = pml(tree, gapsdata3) fit = optim.pml(fit, optq=true, subs=c(1,0,1,2,1,0,2,1,2,2), control=pml.control(trace=0)) fit 4 j

loglikelihood: -33.00773 unconstrained loglikelihood: -28.43259 Rate matrix: a c g t - a 0.000000e+00 2.584351e-06 1.000000e+00 2.584351e-06 0.6911908 c 2.584351e-06 0.000000e+00 2.584351e-06 1.000000e+00 0.6911908 g 1.000000e+00 2.584351e-06 0.000000e+00 2.584351e-06 0.6911908 t 2.584351e-06 1.000000e+00 2.584351e-06 0.000000e+00 0.6911908-6.911908e-01 6.911908e-01 6.911908e-01 6.911908e-01 0.0000000 Base frequencies: 0.2 0.2 0.2 0.2 0.2 Here are some conventions how the models are estimated: If a model is supplied the base frequencies bf and rate matrix Q are optimised according to the model (nucleotides) or the adequate rate matrix and frequencies are chosen (for amino acids). If optq=true and neither a model or subs are supplied than a symmetric (optbf=false) or reversible model (optbf=true, i.e. the GTR for nucleotides) is estimated. This can be slow if the there are many character states, e.g. for amino acids. 4 Codon substitution models A special case of the transition rates are codon models. phangorn now offers the possibility to estimate the d N /d S ratio (sometimes called ka/ks), for an overview see [7]. These functions extend the option to estimates the d N /d S ratio for pairwise sequence comparison as it is available through the function kaks in seqinr. The transition rate between between codon i and j is defined as follows: 0 if i and j differ in more than one position π j for synonymous transversion q ij = π j κ for synonymous transition π j ω for non-synonymous transversion π j ωκ for non synonymous transition where ω is the d N /d S ratio, κ the transition transversion ratio and π j is the the equilibrium frequencies of codon j. For ω 1 the an amino acid change is neutral, for ω < 1 purifying selection and ω > 1 positive selection. There are four models available: codon0, where both parameter κ and ω are fixed to 1, codon1 where both parameters are estimated and codon2 or codon3 where κ or ω is fixed to 1. 5

model optq optbf subs df JC FALSE FALSE c(0, 0, 0, 0, 0, 0) 0 F81 FALSE TRUE c(0, 0, 0, 0, 0, 0) 3 K80 TRUE FALSE c(0, 1, 0, 0, 1, 0) 1 HKY TRUE TRUE c(0, 1, 0, 0, 1, 0) 4 TrNe TRUE FALSE c(0, 1, 0, 0, 2, 0) 2 TrN TRUE TRUE c(0, 1, 0, 0, 2, 0) 5 TPM1 TRUE FALSE c(0, 1, 2, 2, 1, 0) 2 K81 TRUE FALSE c(0, 1, 2, 2, 1, 0) 2 TPM1u TRUE TRUE c(0, 1, 2, 2, 1, 0) 5 TPM2 TRUE FALSE c(1, 2, 1, 0, 2, 0) 2 TPM2u TRUE TRUE c(1, 2, 1, 0, 2, 0) 5 TPM3 TRUE FALSE c(1, 2, 0, 1, 2, 0) 2 TPM3u TRUE TRUE c(1, 2, 0, 1, 2, 0) 5 TIM1e TRUE FALSE c(0, 1, 2, 2, 3, 0) 3 TIM1 TRUE TRUE c(0, 1, 2, 2, 3, 0) 6 TIM2e TRUE FALSE c(1, 2, 1, 0, 3, 0) 3 TIM2 TRUE TRUE c(1, 2, 1, 0, 3, 0) 6 TIM3e TRUE FALSE c(1, 2, 0, 1, 3, 0) 3 TIM3 TRUE TRUE c(1, 2, 0, 1, 3, 0) 6 TVMe TRUE FALSE c(1, 2, 3, 4, 2, 0) 4 TVM TRUE TRUE c(1, 2, 3, 4, 2, 0) 7 SYM TRUE FALSE c(1, 2, 3, 4, 5, 0) 5 GTR TRUE TRUE c(1, 2, 3, 4, 5, 0) 8 Table 2: DNA models available in phangorn, how they are defined and number of parameters to estimate. The elements of the vector subs correspond to α,..., η in equation (1) 6

We compute the d N /d S for some sequences given a tree using the ML functions pml and optim.pml. First we have to transform the the nucleotide sequences into codons (so far the algorithms always takes triplets). library(phangorn) fdir <- system.file("extdata/trees", package = "phangorn") primates <- read.phydat(file.path(fdir, "primates.dna"), format = "phylip", type = tree <- NJ(dist.ml(primates)) dat <- phydat(as.character(primates), "CODON") fit <- pml(tree, dat) fit0 <- optim.pml(fit, control = pml.control(trace = 0)) fi <- optim.pml(fit, model="codon1", control=pml.control(trace=0)) fi <- optim.pml(fit, model="codon2", control=pml.control(trace=0)) fi <- optim.pml(fit, model="codon3", control=pml.control(trace=0)) anova(fit0, fi, fi, fi) Likelihood Ratio Test Table Log lik. Df Df change Diff log lik. Pr(> Chi ) 1-2291.4 27 2-2385.7 26-1 -188.661 3-2292.6 26 0 186.268 <2e-16 4-2291.4 27 1 2.394 0.1218 The models described here all assume equal frequencies for each codon (=1/61). One can optimise the codon frequencies setting the option to optbf=true. As the convergence of the 61 parameters the convergence is likely slow set the maximal iterations to a higher value than the default (e.g. control = pml.control(maxit=50)). The YN98 model [?] is similar to the codon1, but addional optimises the codon frequencies. 5 Generating trees phangorn has several functions to generate tree topologies, which may are interesting for simulation studies. alltrees computes all possible bifurcating tree topologies either rooted or unrooted for up to 10 taxa. One has to keep in mind that the number of trees is growing exponentially, use (howmanytrees) from ape as a reminder. trees = alltrees(5) par(mfrow=c(3,5), mar=rep(0,4)) for(i in 1:15)plot(trees[[i]], cex=1, type="u") nni returns a list of all trees which are one nearest neighbor interchange away. trees = nni(trees[[1]]) rnni and rspr generate trees which are a defined number of NNI (nearest neighbor interchange) or SPR (subtree pruning and regrafting) away. 7

Figure 1: all (15) unrooted trees with 5 taxa 8

References [1] Thomas H. Jukes and Charles R. Cantor. {CHAPTER} 24 - evolution of protein molecules. In H.N. Munro, editor, Mammalian Protein Metabolism, pages 21 132. Academic Press, 1969. [2] S. Mathews, M.D. Clements, and M.A. Beilstein. A duplicate gene rooting of seed plants and the phylogenetic position of flowering plants. Phil. Trans. R. Soc. B, 365:383 395, 2010. [3] Emmanuel Paradis. Analysis of Phylogenetics and Evolution with R. Springer, New York, second edition, 2012. [4] Klaus Peter Schliep. phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4):592 593, 2011. [5] Simon Tavaré. Some probabilistic and statistical problems in the analysis of dna sequences. Lectures on Mathematics in the Life Sciences, (17):57 86, 1986. [6] Tandy Warnow. Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent. PLOS Currents Tree of Life, 2012. [7] Ziheng Yang. Molecular Evolution: A Statistical Approach. Oxford University Press, Oxford, 2014. 6 Session Information The version number of R and packages loaded for generating the vignette were: R version 3.4.2 (2017-09-28), x86_64-pc-linux-gnu Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8, LC_IDENTIFICATION=C Running under: Ubuntu 17.04 Matrix products: default BLAS: /usr/lib/openblas-base/libblas.so.3 LAPACK: /usr/lib/libopenblasp-r0.2.19.so Base packages: base, datasets, grdevices, graphics, grid, methods, stats, utils 9

Other packages: ape 5.0, magrittr 1.5, phangorn 2.3.1, seqlogo 1.42.0, xtable 1.8-2 Loaded via a namespace (and not attached): Matrix 1.2-11, Rcpp 0.12.13, compiler 3.4.2, fastmatch 1.1-0, igraph 1.1.2, knitr 1.17, lattice 0.20-35, nlme 3.1-131, parallel 3.4.2, pkgconfig 2.0.1, quadprog 1.5-5, stats4 3.4.2, tools 3.4.2 10