Phylogenetic Graphical Models and RevBayes: Introduction. Fred(rik) Ronquist Swedish Museum of Natural History, Stockholm, Sweden

Similar documents
BIG4: Biosystematics, informatics and genomics of the big 4 insect groups- training tomorrow s researchers and entrepreneurs

Phylogenetic Inference using RevBayes

Advanced Statistical Modelling

Phylogenetic Inference using RevBayes

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

STA 4273H: Statistical Machine Learning

Probabilistic Machine Learning

Bayesian Graphical Models

Tutorial on Probabilistic Programming with PyMC3

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

36-463/663: Hierarchical Linear Models

Bayesian Networks to design optimal experiments. Davide De March

13: Variational inference II

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

VIBES: A Variational Inference Engine for Bayesian Networks

Biogeography. An ecological and evolutionary approach SEVENTH EDITION. C. Barry Cox MA, PhD, DSc and Peter D. Moore PhD

Phylogenetic Inference using RevBayes

Bayesian Learning in Undirected Graphical Models

Geography of Evolution

Biogeography expands:

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

AP Environmental Science I. Unit 1-2: Biodiversity & Evolution

Bayesian Learning in Undirected Graphical Models

Introduction to Restricted Boltzmann Machines

Climate, niche evolution, and diversification of the bird cage evening primroses (Oenothera, sections Anogra and Kleinia)

Bayesian Inference for Regression Parameters

Estimating Evolutionary Trees. Phylogenetic Methods

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Introduction to Probabilistic Graphical Models

Modelling and forecasting of offshore wind power fluctuations with Markov-Switching models

FAV i R This paper is produced mechanically as part of FAViR. See for more information.

Stat 544 Exam 2. May 5, 2008

Recent Advances in Bayesian Inference Techniques

Latent Dirichlet Allocation (LDA)

STA 4273H: Statistical Machine Learning

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference

Darw r i w n n a nd n t h t e e G ala l pa p gos Biolo l gy g L c e t c u t re r e 16 1 : 6 Ma M cr c o r ev e olu l ti t on

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Reading for Lecture 13 Release v10

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

EVOLUTIONARY DISTANCES

Sampling Methods (11/30/04)

Bayesian Nonparametrics for Speech and Signal Processing

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Time-Sensitive Dirichlet Process Mixture Models

A primer on phylogenetic biogeography and DEC models. March 13, 2017 Michael Landis Bodega Bay Workshop Sunny California

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University

What is the range of a taxon? A scaling problem at three levels: Spa9al scale Phylogene9c depth Time

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution?

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

Chapter 20. Deep Generative Models

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Infer relationships among three species: Outgroup:

Mathematical models in population genetics II


Intelligent Systems:

Lecture 6: Graphical Models: Learning

Hmms with variable dimension structures and extensions

Study Notes on the Latent Dirichlet Allocation

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Graphical Models 359

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Spheres of Life. Ecology. Chapter 52. Impact of Ecology as a Science. Ecology. Biotic Factors Competitors Predators / Parasites Food sources

Adaptive Radiations. Future of Molecular Systematics. Phylogenetic Ecology. Phylogenetic Ecology. ... Systematics meets Ecology...

Spatial Statistics with Image Analysis. Outline. A Statistical Approach. Johan Lindström 1. Lund October 6, 2016

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Dynamic Approaches: The Hidden Markov Model

Machine Learning Summer School

STA 4273H: Statistical Machine Learning

p L yi z n m x N n xi

Metropolis-Hastings Algorithm

Open projects for BSc & MSc

DIC: Deviance Information Criterion

Bayesian statistics and modelling in ecology

Metacommunities Spatial Ecology of Communities

Georgia Performance Standards for Urban Watch Restoration Field Trips

Historical Biogeography. Historical Biogeography. Historical Biogeography. Historical Biogeography

15 Darwin's Theory of Natural Selection 15-1 The Puzzle of Life's Diversity

Bayesian Nonparametrics

An Introduction to Bayesian Networks: Representation and Approximate Inference

CHAPTER 52: Ecology. Name: Question Set Define each of the following terms: a. ecology. b. biotic. c. abiotic. d. population. e.

IDENTIFICATION: Label each of the parts of the illustration below by identifying what the arrows are pointing at. Answer the questions that follow.

Constructing Evolutionary/Phylogenetic Trees

A Probabilistic Framework for solving Inverse Problems. Lambros S. Katafygiotis, Ph.D.

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Learning Bayesian network : Given structure and completely observed data

Advanced Machine Learning

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Probabilistic Models in Theoretical Neuroscience

Species diversification in space: biogeographic patterns

Non-Parametric Bayes

ECOLOGICAL PLANT GEOGRAPHY

Towards a formal description of models and workflows

Stochastic prediction of train delays with dynamic Bayesian networks. Author(s): Kecman, Pavle; Corman, Francesco; Peterson, Anders; Joborn, Martin

Taming the Beast Workshop

Lecture Notes: Markov chains

Transcription:

Phylogenetic Graphical Models and RevBayes: Introduction Fred(rik) Ronquist Swedish Museum of Natural History, Stockholm, Sweden

Statistical Phylogenetics Statistical approaches increasingly important: Difficult problems requiring accurate and unbiased inference (e.g., structure of rapid radiations) More aspects of molecular evolution being examined (structural dependencies, etc) Combination of background knowledge and sequence information (e.g., divergence time estimation) Modeling explosion, especially in the Bayesian context Challenging for empiricists to communicate and correctly understand models Challenging for developers of inference software

Probabilistic Graphical Models Theoretical framework for specifying dependencies in complex statistical models Allows a complex model to be broken down into conditionally independent distributions Closely related to standard statistical model formulae: x ~ Norm( μ, σ ) Extensive literature on generic algorithms that apply to model graphs

x ~ Norm(m,s ) m s m s factor Norm x Stochastic node x Graphical Model Compact Form Factor Graph

a b c Constant node m s x ~ Norm(m,s ) x Clamped stochastic node Hierarchical Graphical Model

a b c m s x ~ Norm(m,s ) x i Plate i =1,, N Hierarchical Graphical Model

General Time Reversible (GTR) substitution model ø ö ç ç ç ç ç è æ - - - - = GT G CT C AT A GT T CG C AG A CT T CG G AC A AT T AG G AC C r r r r r r r r r r r r Q p p p p p p p p p p p p p r Stationary state frequencies Exchangeability rates

p r Q Deterministic node

a a p r s i, j Q v s i, j i =1,, N

GTR Phylogeny Model

Tree Plate Representation Tree plate

Pivot variable Modular Representation

RevBayes Project Interactive computing environment intended primarily for Bayesian phylogenetic inference Uses a special language, Rev, for constructing probabilistic phylogenetic and evolutionary graphical models interactively, step by step Rev is similar to R and the BUGS modeling language RevBayes provides generic computing machinery for simulation, inference and model testing

Basic properties of the Rev language # There are three kinds of statements in the language # 1. Arrow assignment (value assignment, create constant nodes) > a <- 4 # Give a the value 4 > b <- sqrt(a) # Giva b the value of sqrt(a), that is, 2 > b # Print the value of b 2 # 2. Equation assignment (create deterministic nodes) > c := sqrt(a) # Make c a dynamic function node evaluating sqrt(a) > c 2 > a <- 9 # Give a the value 9 > b # Print the value of b 2 > c # Print the value of c 3

Basic properties of the Rev language # 3. Tilde assignment (create stochastic variables (nodes)) > a ~ dnexp( rate = x ) # a is drawn from exp dist with rate = x

Basic properties of the Rev language # -------------------------------- # Declaring and defining functions # -------------------------------- > function foo ( x ) { x * x } > foo( 2 ) 4 # If you wish, you can specify types as well > function PosReal foo ( Real x ) { x * x } # Without explicit types, RevObject is the assumed type # -------------------------------- # Declaring and defining new types # -------------------------------- > class myclass : Move { + Real mytuningparam; + procedure Real move( Real x ) { mytuningparam * x } + } # Inheritance, function overriding and overloading

A complete MCMC analysis in Rev a <- -1.0 b <- 1.0 mu ~ dnunif(a, b) sigma ~ dnexp(1.0) for (i in 1:10) { x[i] ~ dnnorm(mu, sigma) x[i].clamp(0.5) } mymodel = model(mu) # Any stochastic node in the model works mymcmc = mcmc(mymodel) mymcmc.run(1000)

# definition of the mygtr function ( Ziheng's favorite ) function model mygtr (CharacterMatrix data) { # describe Q matrix pi ~ dflatdir(4); r ~ dflatdir(6); Q := gtr(pi, r); # describe tree tau ~ dtopuni(data.taxa(), rooted=false); Definition of a new phylogenetic model # gamma shape alpha ~ dunif(0.0, 50.0); # discrete gamma mixture for (i in 1:4) catrate[i] := qgamma(i*0.25-0.125, alpha, alpha); for (i in 1:data.size()) ratecat[i] ~ dcat(simplex(0.25,0.25,0.25,0.25)); Appr. 20 lines # associate distributions with tree parts for (i in 1:data.size()) { for (n in 1:tau.numNodes()) { if (tau.isterminal(n)) { tau.length[n] ~ exp(1.0); tau.state[n] ~ ctmc(q, e.length*catrate[ratecat[i]], tau.state[tau.parent(n)); tau.state[n] <- data[i][tau.tipindex(n)]; } else { tau.length[n] ~ exp(10.0); tau.state[n] ~ ctmc(q, e.length*catrate[ratecat[i]], tau.state[n]); } } } } # return model return model( Q );

Complexity hidden from normal user # Read in data mydata <- read( data.nex ) # Apply model mymodel = zihenggtr( mydata ) # Construct mcmc mymcmc = mcmc( mymodel ) # Run mcmc mymcmc.run(10000)

RevBayes RevBayes is still experimental software Help is still incomplete There may be various bugs and other problems, for instance related to type conversion A number of practical features are still missing Post-hoc analysis very limited No support for convergence diagnostics on the fly

RevBayes Limitations Coarse-grained representation of tree plates -> some models are impossible to specify No ability to manipulate modules No default moves or monitors Some programming features are missing, notably specification of user-defined types Most additions need to be made in the back end using heavily templated C++ code

RevLang Project Rev -> probabilistic programming language Moves, distributions etc programmed in Rev itself Professional JIT compiler Fully supported interactive environment, like R Modular design makes it easy to extend the environment

RevBayes Exercises Download and install RevBayes according to instructions at the RevBayes web site (http://revbayes.com). Download material and follow tutorials of interest RevBayes code at github: https://github.com/revbayes/revbayes

The Canary Islands

Dolichoiulus (Diplopoda) Dolichoiulus (Diplopoda, Julida, Julidae, Pachyulinae) 46 endemic species

Organism groups Model distribution T 1 DNA data 1 GTR 1 μ 1 m 1 T 2 DNA data 2 GTR 2 μ 2 m 2 IM T 3 DNA data 3 GTR 3 μ 3 m 3 IM island model μ i mutation rate - dispersal rate m i Inference Bayesian inference using MCMC sampling, accommodating uncertainty in all model parameters

Canary Islands: 3-island model [2-1 Ma] [14-10 Ma] [20-15 Ma] 86 km Western Central 81 km Eastern 110 km 1007 km 2 3968 km 2 2521 km 2 Mainland

Instantaneous rate matrix to from Q [A] [B] [C] [D] [A] A A A r r r AB AC AD [B] r B B B r r AB BC BD [C] r C r C C r AC BC CD [D] D D D r r r AD BD CD i r ij Relative carrying capacity of island i Relative dispersal rate between islands i and j

Island GTR model Time reversible continuous time Markov chain D C B A Differing relative carrying capacities of islands Differing intensities of biotic exchange between islands

Canary Island flora 15 groups, 567 lineages Laura Martinez Javier Fuertes

Canary Island fauna 19 groups, 578 lineages Raquel Martin Javier Fuertes

Colonization patterns (plants) (0.203) W (0.279) C (0.451) E M Western Central Eastern (0.028) (0.074) (0.073) Continent (0.824)

Colonization patterns (animals) (0.182) W (0.061) (0.456) C (0.244) E Western Central Eastern (0.083) (0.205) (0.059) M Continent (0.653)

Colonization models Animals Animals Animals Plants Within-island diversification Niche evolution Inter-island colonization of similar biomes (niche conservatism)

I move, you change

Ecological zones Coastal belt Open habitat Thermophilous forest Laurisilva Pine forest Sub-alpine Ten island habitat types M1 Other Mainland E2 Eastern-Open C2 Central-Open W2 Western Open C3 Central-laurel forest W3 Western-laurel forest C4 Central-pine forest W4 Western-pine forest C5 Central-alpine vegetation W5 Western-alpine vegetation L. Martinez 2010

Separating island-hopping and niche-shift rates r r i r e r i r e Shift between islands Shift between niches Shift between islands and niches Standard biogeography model: r ~ dirichlet( 1, 1, 1,...) Islands-ecology model: mu ~ dirichlet( 1, 1 ) r := simplex( mu[1], mu[2], mu[1] * mu[2],...)

Posterior probability Density 0 2 4 6 8 10 density.default(x = data_plants_ecology$mu.1., bw = 0.015) Island hopping Niche shifts 0.0 0.2 0.4 0.6 0.8 1.0 N = 1001 Bandwidth = 0.015 Proportion of total rate

Separating area and ecology contributions to carrying capacity a, e βa + (1-β)e a β e (1-β) Area and ecology components Linear mix Power mix S = c A z z = 0.25 (0.22, 0.28) Standard model: pi ~ dirichlet( 1, 1, 1,...) Power-mix model: areak <- simplex( A_1 ^ z, A_2 ^z,...) ecok ~ dirichlet( 1, 1,...) prop ~ dirichlet( 1, 1 ) pi := powermix( areak, ecok, prop )

Posterior probability Carrying capacity factored into area and other factors (ecology etc) Ecology animals Ecology plants Area plants Area animals Proportion of total

Posterior probability Density 0.0 0.2 0.4 0.6 0.8 1.0 Comparison of migration rate in animals and plants density.default(x = data_powermix$log_mean_a, bw = 0.1) Plants Animals -4-3 -2-1 0 1 2 3 Ln migration rate N = 1001 Bandwidth = 0.1

Summary Carrying capacities and relative magnitudes of biotic exchange between islands are similar in animals and plants Nevertheless, animal dispersal between islands is an order of magnitude slower than plant dispersal Area effects are more important, relatively speaking, in determining island carrying capacities in animals Plants shift between islands much more readily than they adapt to new niches Phylogenetic graphical models are good tools in modeling and analyzing these phenomena