Package WGDgc. October 23, 2015

Similar documents
Package WGDgc. June 3, 2014

Package OUwie. August 29, 2013

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Processes of Evolution

Biology 211 (2) Week 1 KEY!

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Package FossilSim. R topics documented: November 16, Type Package Title Simulation of Fossil and Taxonomy Data Version 2.1.0

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Package cellscape. October 15, 2018

Hominid Evolution What derived characteristics differentiate members of the Family Hominidae and how are they related?

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Concepts and Methods in Molecular Divergence Time Estimation

Comparative Genomics II

Dr. Amira A. AL-Hosary

Evolutionary Tree Analysis. Overview

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

How should we organize the diversity of animal life?

8/23/2014. Phylogeny and the Tree of Life

Rapid evolution of the cerebellum in humans and other great apes

Package vhica. April 5, 2016

AP Biology. Cladistics

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Cladistics and Bioinformatics Questions 2013

Mathematical Framework for Phylogenetic Birth-And-Death Models

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Welcome to Evolution 101 Reading Guide

Introduction to characters and parsimony analysis

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Phylogenetic Networks, Trees, and Clusters

Taming the Beast Workshop

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Package PVR. February 15, 2013

Session 5: Phylogenomics

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

Phylogenetic inference

paleotree: an R package for paleontological and phylogenetic analyses of evolution

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

Multiple Sequence Alignment. Sequences

Chapter 7: Models of discrete character evolution

Lecture 11 Friday, October 21, 2011

Phylogenetic Networks with Recombination

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Statistical Models in Evolutionary Biology An Introductory Discussion

A Phylogenetic Network Construction due to Constrained Recombination

Theory of Evolution Charles Darwin

Chapter 19: Taxonomy, Systematics, and Phylogeny

From Individual-based Population Models to Lineage-based Models of Phylogenies

Biology 1B Evolution Lecture 2 (February 26, 2010) Natural Selection, Phylogenies

Using the Birth-Death Process to Infer Changes in the Pattern of Lineage Gain and Loss. Nathaniel Malachi Hallinan

6 Introduction to Population Genetics

Chapter 27: Evolutionary Genetics

What is Phylogenetics

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

Package KMgene. November 22, 2017

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Package leiv. R topics documented: February 20, Version Type Package

Intraspecific gene genealogies: trees grafting into networks

6 Introduction to Population Genetics

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Non-independence in Statistical Tests for Discrete Cross-species Data

Patterns of Evolution

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

Package locfdr. July 15, Index 5

Classifications can be based on groupings g within a phylogeny

Package gtheory. October 30, 2016

New Tools for Visualizing Genome Evolution

Phylogenetic Tree Reconstruction

Properties of normal phylogenetic networks

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Package polypoly. R topics documented: May 27, 2017

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

Constructing Evolutionary/Phylogenetic Trees

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogeny and the Tree of Life

PHYLOGENY AND SYSTEMATICS

The phylogenetic effective sample size and jumps

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Laboratory. Phylogenetics

Integrating Fossils into Phylogenies. Throughout the 20th century, the relationship between paleontology and evolutionary biology has been strained.

Generating phylogenetic trees with Phylomatic and dendrograms of functional traits in R

Rate of Evolution Juliana Senawi

Constructing Evolutionary/Phylogenetic Trees

ASYMMETRIES IN PHYLOGENETIC DIVERSIFICATION AND CHARACTER CHANGE CAN BE UNTANGLED

Package symmoments. R topics documented: February 20, Type Package

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

C.DARWIN ( )

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference

Algorithms in Bioinformatics

Package Delaporte. August 13, 2017

Unit 9: Evolution Guided Reading Questions (80 pts total)

Supplementary Figure 1: Distribution of carnivore species within the Jetz et al. phylogeny. The phylogeny presented here was randomly sampled from

Inferring the Dynamics of Diversification: A Coalescent Approach

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Chapter 16: Reconstructing and Using Phylogenies

Inferring Speciation Times under an Episodic Molecular Clock

Transcription:

Package WGDgc October 23, 2015 Version 1.2 Date 2015-10-22 Title Whole genome duplication detection using gene counts Depends R (>= 3.0.1), phylobase, phyext, ape Detection of whole genome duplications and triplications on phylogenies using gene count data, with estimation of background rates of gene duplication and loss and estimation of gene retention rates following whole genome duplications/triplications Encoding UTF-8 License GPL (>= 3) file LICENSE URL http://www.stat.wisc.edu/~ane/wgd/ NeedsCompilation no Author Tram Ta [aut],charles-elie Rabier Rabier [aut],cécile Ané [aut, cre] Maintainer Cécile Ané <cecile.ane@wisc.edu> R topics documented: getedgeorder........................................ 2 getlikgenecount...................................... 3 loglik_csurosmiklos.................................... 4 MLEGeneCount....................................... 6 processinput......................................... 8 rgenecount.......................................... 9 sampledata1........................................ 10 sampledata2........................................ 12 Index 13 1

2 getedgeorder getedgeorder list the tree nodes in a post-order traversal Preprocessing to list the edges in a post-order traversal, for future use in likelihood calculation. The output includes information on which edges the birth-death process applies to, and which edges represent a whole genome duplication or triplication event. getedgeorder(phylomat,nleaf,wgdtab) Arguments phylomat nleaf wgdtab Matrix representation of the species tree and WGD events Number of present-day species (i.e. number of leaves) Table representation of WGD events with retention rates This function assumes that speciation nodes in phylomat are given lower indices than singleton nodes when the tree is read in by phyext, that speciation nodes are in pre-order in phylomat, and that 2 singleton nodes are used to represent each WGD. Value Data frame listing the edges in a post-order traversal, with the following components child edge type scdsib index of the edge s child node index of the edge, i.e. its row in phylomat "BD" if birth-death edge, "WGD" or "WGT" if the edge is modelling a WGD/T event, or "rootprior" if the edge is parent to the root node TRUE if the edge is listed after a sibling edge, FALSE otherwise Author(s) Cécile Ané See Also processinput.

getlikgenecount 3 getlikgenecount Negative log-likelihood of gene count data Calculates the overall negative log-likelihood of gene count data on a phylogenetic tree under a birth-and-death process and whole genome duplication events. getlikgenecount(para, input, genecountdata, mmax=null, geomprob=null, dirac=null, userootstatemle=false, conditioning=c("oneormore", "twoormore", "oneinbothclades", "none"), equalbdrates=false, fixedretentionrates=true) Arguments Value para vector of parameters (see ) input object output by function processinput genecountdata data frame with one column for each species and one row for each family, containing the number of gene copies in each species for each gene family. The column names must match the species names in the tree. mmax geomprob maximum number of surviving lineages at the root, at which the likelihood will be evaluated. inverse of the prior mean number of gene lineages at the root. dirac value for the number of genes at the root, when this is assumed to have a fixed value (according to a dirac prior distribution). userootstatemle if TRUE, the most likely number of genes at the root is determined for each family separately and is used to evaluate the likelihood function. conditioning type of conditioning for the likelihood calculation. The default is to calculate conditional probabilities on observing families with at least 1 gene copy (see in MLEGeneCount). equalbdrates if TRUE, the duplication and loss rates are equal. fixedretentionrates if TRUE, it uses retention rates present in input$wgdtab. If FALSE, it uses retention rates in para. The vector para for the parameters to be used is of size 1+number of WGD/Ts if the birth and death rates are assumed equal, or 2+number of WGD/Ts otherwise. It starts with log(startingbdrates[1]) if equalbdrates is TRUE, with log(startingbdrates) otherwise. The remaining components correspond to retention rates. negative log-likelihood value

4 loglik_csurosmiklos References Csuros M and Miklos I (2009). Streamlining and large ancestral genomes in archaea inferred with a phylogenetic birth-and-death model. Molecular Biology and Evolution. 26:2087-2095. Charles-Elie Rabier, Tram Ta and Cécile Ané (2013). Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach. Molecular Biology and Evolution. 31(3):750-762. See Also MLEGeneCount, loglik_csurosmiklos. tre.string = "(D:{0,18.03},(C:{0,12.06},(B:{0,7.06}, A:{0,7.06}):{0,2.49:wgd,0:0,2.50}):{0, 5.97});" tre.phylo4d = read.simmap(text=tre.string) dat = data.frame(a=c(2,2,3,1), B=c(3,0,2,1), C=c(1,0,2,2), D=c(2,1,1,1)); a = processinput(tre.phylo4d, startingq=0.9) getlikgenecount(log(c(.01,.02)),a,dat,mmax=8,geomprob=1/1.5, conditioning="oneormore") loglik_csurosmiklos Log-likelihood of count data on a phylogenetic tree Calculates the probability of gene count data on a phylogenetic tree under a birth-and-death process and whole genome duplication (or triplication) events, conditional on n surviving gene lineages at the root. Also computes the probability of a family going extinct. loglik_csurosmiklos(loglamlogmu, nleaf, nfamily, phylomat, genecountdata, mmax, wgdtab, edgeorder) Arguments loglamlogmu nleaf nfamily vector of size 1 or 2, for the log of the duplication and loss rates. When a single rate is provided, the duplication and loss rates are assumed to be equal. number of present-day species. number of gene families. phylomat a phylogenetic matrix with 4 columns: parent (ancestor node), child (descendant node), time (branch length), and species names. The number of rows is the number of nodes in the tree. genecountdata data frame with one column for each species and one row for each family, containing the number of gene copies in each species for each gene family. The column names must match the species names in the tree. mmax maximum number of surviving lineages at the root, at which the likelihood will be computed.

loglik_csurosmiklos 5 wgdtab edgeorder data frame with 5 columns: node before event, event type (WGD or WGT) and retention rates of 1, 2 and 3 gene copies. The number of rows is the number of WGD events. a data frame listing the tree edges in post-order traversal with information on which are birth-death and WGD/T edges. Value loglikroot matrix of size nmax+1 by nfamily giving the log likelihood of each gene family given that there are n surviving gene lineages at the root in row n+1. Column k corresponds to family k. doomedroot probability that a single gene lineage present at the root goes extinct. doomedrootleft probability that a single gene lineage at the root goes extinct in the clade on the left side of the root. doomedrootright probability that a single gene lineage at the root goes extinct in the clade on the right side of the root. Author(s) Cécile Ané References Csuros M and Miklos I (2009). Streamlining and large ancestral genomes in archaea inferred with a phylogenetic birth-and-death model. Molecular Biology and Evolution. 26:2087-2095. Charles-Elie Rabier, Tram Ta and Cécile Ané (2013). Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach. Molecular Biology and Evolution. 31(3):750-762. See Also processinput, getedgeorder. tre.string = "(D:{0,18.03},(C:{0,12.06},(B:{0,7.06}, A:{0,7.06}):{0,2.49:wgd,0:0,2.50}):{0, 5.97});" tre.phylo4d = read.simmap(text=tre.string) dat = data.frame(a=c(2,2,3,1), B=c(3,0,2,1), C=c(1,0,2,2), D=c(2,1,1,1)); a = processinput(tre.phylo4d, startingq=0.9) loglik_csurosmiklos(log(c(.01,.02)), nleaf=4, nfamily=4, a$phylomat,dat,mmax=8,a$wgdtab, a$edgeorder)

6 MLEGeneCount MLEGeneCount Maximum likelihood estimation of gene turnover rates with WGD Uses gene count data to estimates rates of gene duplication and gene loss along a phylogeny with zero, one or more whole genome duplication (WGD) or triplication (WGT) events. Also estimates the gene retention rate after each WGD/WGT event. MLEGeneCount(tr, genecountdata, mmax=null, geommean=null, dirac=null, userootstatemle=false, conditioning=c("oneormore", "twoormore", "oneinbothclades", "none"), equalbdrates=false, fixedretentionrates=false, startingbdrates=c(0.01, 0.02),startingQ=NULL) Arguments tr a species tree in SIMMAP format (see ). genecountdata data frame with one column for each species and one row for each family, containing the number of gene copies in each species for each gene family. The column names must match the species names in the tree. mmax geommean maximum number of surviving lineages at the root, at which the likelihood will be computed. the mean of the prior geometric distribution for the number of genes at the root. dirac value for the number of genes at the root, when this is assumed to have a fixed value (according to a dirac prior distribution). userootstatemle if TRUE, the most likely number of surviving genes at the root is determined for each family separately, and is used to calculate the overall likelihood of the data. This value at the root may vary with the parameter values during likelihood optimization. conditioning type of conditioning for the likelihood calculation. The default is to calculate conditional probabilities on observing families with at least 1 gene copy (see ). equalbdrates if TRUE, the duplication and loss rates are constrained to be equal. fixedretentionrates if TRUE, retention rates from the user-defined tree are fixed and used as provided. If FALSE, retention rates are considered as parameters and are estimated by maximum likelihood. startingbdrates Vector of size 2, for the starting values of the duplication and loss rates. When equalbdrates=true, only the first component is used. startingq Vector of starting values for the retention rates at the WGD and WGT events.

MLEGeneCount 7 Value The tree needs to be in simmap format (version 1.1). This format is similar to the newick parenthetical format, except that branch lengths are given inside brackets where states are indicated at specific times along each branch. Along a given branch, the token "0,18" indicates state 0 for a duration of 18 time units. Tokens are separated with ":". State 0 is used to indicate branch segments where only the birth/death process applies for gene duplications and losses. Labels "wgd" or "WGT" are used for branch segments at WGD events, and "wgt" or "WGT" for segments at WGT events. Such segments need to have a length of 0. For WGT events, the 2 extra copies are assumed to be retained independently. With retention rate q, the probability to retain all 3 gene copies is then q 2, the probability to retain 2 gene copies is 2 q (1 q), and the probability to retain the original gene only is (1 q) 2. Four types of conditional likelihoods are implemented. The option conditioning should match the data filtering process: use conditioning="oneormore" if all families with one or more gene copies are included in the data, use "twoormore" to condition on families having two of more genes, "oneinbothclades" if the data set was filtered to include only families with at least one gene copy in each of the two main clades stemming from the root. Unconditional likelihoods are used with conditioning="none". The geommean, dirac and userootstatemle options are incompatible. By default, mmax is set to the maximum family size for an exact likelihood calculation. For data sets with one or more very large families, this can cause mmax to be very large and calculation to be very slow. In such cases, the user can set mmax to a lower value to speed up calculations, at the cost of an approximation to the likelihood of families with a larger family size. birthrate birth or duplication rate deathrate death or loss rate loglikelihood log of the likelihood WGDtable phylomat call convergence mmax Author(s) Tram Ta, Charles-Elie Rabier References a WGD table with 5 columns: node before WGD/WGT, event type, and probabilities that 1, 2 or 3 gene copies are retained. The number of rows is the number of WGD/WGT events. data frame with 5 columns to describe the phylogeny: parent (ancestor node), child (descendant node), time (branch length), species names and edge type (e.g. BD or WGD). The number of rows is the number of nodes in the tree. initial call to the function optimization convergence flag from the optim call. 0 means successful convergence. mmax value used for the likelihood calculations Bailey, N. (1964) The Elements of Stochastic Processes. New York: John Wiley \& Sons Bollback J. P. (2006) SIMMAP: Stochastic character mapping of discrete traits on phylogenies. Bioinformatics. 7:88

8 processinput De Bie, T. and Cristianini, N. and Demuth, J.P. and Hahn, M.W. (2006) CAFE: a computational tool for the study of gene family evolution. Bioinformatics. 22:1269 1271 Hahn, M.W. and De Bie, T. and Stajich, J.E. and Nguyen, C. and Cristianini, N. (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res.. 15:1153 1160 Crawford, F., Suchard, M. (2012) Transition probabilities for general birth-death processes with applications in ecology, genetics, and evolution. J Math Biol. 65:553-580 Rabier, C., Ta, T. and Ané, C. (2013) Detecting and Locating Whole Genome Duplications on a phylogeny: a probabilistic approach. Molecular Biology and Evolution. 31(3):750-762. See Also sampledata1, sampledata2 for more examples. tre.string = "(D:{0,18.03},(C:{0,12.06},(B:{0,7.06}, A:{0,7.06}):{0,2.49:wgd,0:0,2.50}):{0, 5.97});" # tree with a single hypothesized WGD event, along the # internal edge leading to the MRCA of species A and B tre.phylo4d = read.simmap(text=tre.string) tre.phylo = as(tre.phylo4d, "phylo") ## Not run: plot(collapse.singles(tre.phylo)) dat = data.frame(a=c(2,2,3,1), B=c(3,0,2,1), C=c(1,0,2,2), D=c(2,1,1,1)) MLEGeneCount(tre.phylo4d, dat, geommean=1.5, conditioning="oneormore", fixedretentionrates=true) processinput Preprocessing function Checking arguments and preparing data for future optimization processinput(tr, equalbdrates=false, fixedretentionrates=true, startingbdrates=c(0.01, 0.02),startingQ=NULL) Arguments tr a species tree in SIMMAP format (see of function MLEGeneCount). equalbdrates if TRUE, the duplication and loss rates are equal. fixedretentionrates if TRUE, retention rates will be fixed to startingq during the future optimization. If FALSE, retention rates will be considered as parameters and will be estimated by maximum likelihood. startingbdrates Vector of size 2 as starting values for the duplication and loss rates. When equalbdrates=true only the first component is used. startingq Vector of starting values for retention rates. Default is 0.5 for all WGD events.

rgenecount 9 Value The vector para of starting values for the parameters to be optimized is of size 1+number of WGDs if the birth and death rates are assumed equal, or 2+number of WGDs otherwise. It starts with log(startingbdrates[1]) if equalbdrates is TRUE, with log(startingbdrates) otherwise, and the remaining components (corresponding to the retention rates) are startingq if startingq is provided, 0.5 otherwise. For WGT events, the 2 extra copies are assumed to be retained independently. With retention rate q, the probability to retain all 3 gene copies is then q 2, the probability to retain 2 gene copies is 2 q (1 q), and the probability to retain the original gene only is (1 q) 2. lower and upper are vectors whose sizes correspond to the number of parameters for the lower and upper bounds of the different parameters in a subsequent optimization search. The log of the duplication and loss rates are unconstrained, while duplicate retention rates are constrained in [0,1]. phylomat nleaf nnode wgdtab para lower upper data frame to represent the phylogeny. The number of rows is the number of nodes in the species tree. There are 5 columns (Parent, Child, Time, Species, type). number of present-day species (i.e. number of leaves) number of nodes in the species tree data frame with 5 columns. Each row corresponds to a WGD(s) or WGT(s). The first column gives the node just before the WGD/T. The second column type says if the event is a WGD or WGT. The remaining columns contain the probabilities that only the original gene is retained, or if 2 (or 3) gene copies are retained. Vector of parameters to be optimized. see Lower bounds for later optimization. see Upper bounds for later optimization. see tre.string = "(D:{0,18.03},(C:{0,12.06},(B:{0,7.06}, A:{0,7.06}):{0,2.49:wgd,0:0,2.50}):{0, 5.97});" tre.phylo4d = read.simmap(text=tre.string) processinput(tre.phylo4d) rgenecount Random generation of family sizes Generates gene count data for multiple families along a phylogeny, using background rates of duplication and loss and possible whole genome duplication (WGD) or triplication (WGT) event(s), each with its own retention rate. rgenecount(nfam, tre, lambdamu, retention, geommean=null, dirac=null, conditioning=c("none"))

10 sampledata1 Arguments nfam tre lambdamu retention geommean dirac number of families to simulate a species tree in SIMMAP format. vector of size 1 or 2, for the duplication rate (λ) and loss rate (µ). A vector of size 1 sets λ=µ. vector of length the number of WGD/WGT events in the tree, giving the retention rate at each event. the mean of the prior geometric distribution for the number of genes at the root. value for the number of genes at the root, if fixed to the same value for all families. conditioning type of filtering. No filtering implemented yet. For the simmap format, see MLEGeneCount. For WGT events, the 2 extra copies are assumed to be retained independently with the same retention rate. With retention rate q, the probability to retain all 3 gene copies is then q 2, the probability to retain 2 gene copies is 2 q (1 q), and the probability to retain the original gene only is (1 q) 2. The geommean and dirac options are incompatible. Value matrix with nfam rows, one per simulated family, and one column per node in the tree (tips and internal nodes). Author(s) Cécile Ané # tree with 2 WGDs. The second is placed immediately after # the split between C and AB: tre.string <- "(D:{0,18.03},(C:{0,12.06},(B:{0,7.06}, A:{0,7.06}):{0,2.49:wgd,0:0,2.50:wgd,0:0,1e-10}):{0, 5.97});" tre.phylo4d = read.simmap(text=tre.string) # do this to see how edges and nodes are numbered, # which WGD is the first, which is the second: processinput(tre.phylo4d, startingq=c(.6,.2)) rgenecount(nfam=10,tre.phylo4d,lambdamu=c(.03,.04),retention=c(.6,.2),dirac=1) sampledata1 Simulated gene count data with 1 WGD event Sample gene count data simulated with 1 WGD, 4 species (A, B, C, D) and 6000 families.

sampledata1 11 data(sampledata1) Format A data frame with 6000 observations on the following 4 species as 4 named variables: A, B, C, D. These data were generated according to the following species tree (in simmap format version 1.1), with a single WGD event located on the internal edge leading to the MRCA of species A and B and retention rate 0.6: "(D:0,18.03, (C:0,12.06,(B:0,7.06,A:0,7.06):0,2.50 :wgd,0:0,2.50):0, 5.97);" The duplication and loss rates used for simulation were 0.02 and 0.03. Families with 0 or 1 copy were excluded. All families were started with only one ancestral gene at the root of the species tree. data(sampledata1) dat <- sampledata1[1:100,] # reducing data to run examples faster tree1wgd.str = "(D:{0,18.03}, (C:{0,12.06},(B:{0,7.06},A:{0,7.06}) :{0,2.50 :wgd,0:0,2.50}):{0, 5.97});" # tree with a single WGD event along the edge to MRCA of species A and B tree1wgd = read.simmap(text=tree1wgd.str) MLEGeneCount(tree1WGD, dat, dirac=1, conditioning="twoormore") # to estimate retention, duplication and loss rates MLEGeneCount(tree1WGD, dat, dirac=1, conditioning="twoormore", fixedretentionrates=true, startingq=0.6) # to estimate the duplication and loss rates only, # based on a hypothesized retention rate 0.6 at the WGD. filtered <- subset(dat, (A>0 B>0 C>0) & D>0 ) # families with at least one copy in both clades at the root MLEGeneCount(tree1WGD, filtered,dirac=1,conditioning="oneinbothclades") # uses the appropriate filtering ## Analysis under a tree with no WGD tree0wgd.str = "(D:{0,18.03}, (C:{0,12.06},(B:{0,7.06},A:{0,7.06}) :{0,5.00}):{0, 5.97});" tree0wgd = read.simmap(text=tree0wgd.str) MLEGeneCount(tree0WGD, dat, dirac=1, conditioning="twoormore", fixedretentionrates=true) ## Analysis under a tree with 2 events: one WGD and one WGT tree2events.str = "(D:{0,18.03}, (C:{0,12.06},(B:{0,7.06},A:{0,7.06}): {0,2.50 :wgt,0:0,2.50}):{0, 2.985: wgd,0:0,2.985});" # oldest event: WGD on edge to MRCA of species A, B and C. # recent event: WGT on edge to MRCA of species A, B tree2events = read.simmap(text=tree2events.str) MLEGeneCount(tree2events, dat, dirac=1, conditioning="twoormore")

12 sampledata2 sampledata2 Simulated gene count data with two WGD events Format Sample gene count data simulated with 2 WGDs on the same branch, 4 species (A, B, C, D) and 6000 families. data(sampledata2) A data frame with 6000 observations on the following 4 species as 4 named variables: A, B, C, D. These data were generated according to the following species tree (in simmap format version 1.1), with both WGD events located along the internal edge leading species D, with retention rate 0.6 for the oldest event and 0.2 for the most recent event: "(D:0,6.01:0.2,0:0,6.01:0.6,0:0,6.01, (C:0,12.06,(B:0,7.06,A:0,7.06):0,4.99):0,5.97);" The duplication and loss rates used for simulation were 0.02 and 0.03. Families with 0 or 1 copy were excluded. All families were started with only one ancestral gene at the root of the species tree. data(sampledata2) dat <- sampledata2[1:200,] # reducing data to run examples faster tree2wgd.str="(d:{0,6.01:wgd,0:0,6.01:wgd,0:0,6.01}, (C:{0,12.06}, (B:{0,7.06},A:{0,7.06}):{0,4.99}):{0,5.97});" # both WGD events are located on the edge leading to species D tree2wgd = read.simmap(text=tree2wgd.str) MLEGeneCount(tree2WGD, dat, dirac=1, conditioning="twoormore")

Index getedgeorder, 2, 5 getlikgenecount, 3 loglik_csurosmiklos, 4, 4 MLEGeneCount, 3, 4, 6, 10 processinput, 2, 5, 8 rgenecount, 9 sampledata1, 8, 10 sampledata2, 8, 12 13