types of codon models

Similar documents
part 3: analysis of natural selection pressure

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab)

codon substitution models and the analysis of natural selection pressure

Practical Bioinformatics

Why do more divergent sequences produce smaller nonsynonymous/synonymous

Bayes Empirical Bayes Inference of Amino Acid Sites Under Positive Selection

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

7. Tests for selection

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Supplementary Information for

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Advanced topics in bioinformatics

Number-controlled spatial arrangement of gold nanoparticles with

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Crick s early Hypothesis Revisited

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

Bayesian Methods: Naïve Bayes

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Density Estimation. Seungjin Choi

Sequence Divergence & The Molecular Clock. Sequence Divergence

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Electronic supplementary material

Minimum Message Length Analysis of the Behrens Fisher Problem

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Non-Parametric Bayes

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Supplementary Information

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Lecture : Probabilistic Machine Learning

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

POPULATION GENETICS Biology 107/207L Winter 2005 Lab 5. Testing for positive Darwinian selection

Parameter Estimation. Industrial AI Lab.

Bayesian Decision Theory

Parametric Techniques Lecture 3

SENCA: A codon substitution model to better estimate evolutionary processes

1 Mixed effect models and longitudinal data analysis

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Supplemental Figure 1.

Parametric Techniques

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Bayesian Regression Linear and Logistic Regression

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Bayesian Networks in Educational Assessment

Supporting Information

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Linear Models A linear model is defined by the expression

Introduction to Molecular Phylogeny

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Maximum Likelihood Estimation. only training data is available to design a classifier

Curve Fitting Re-visited, Bishop1.2.5

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

An Introduction to Statistical and Probabilistic Linear Models

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Concepts and Methods in Molecular Divergence Time Estimation

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Expectation Propagation for Approximate Bayesian Inference

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Introduction to Probabilistic Machine Learning

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Probabilistic modeling and molecular phylogeny

Proceedings of the SMBE Tri-National Young Investigators Workshop 2005

Measurement error as missing data: the case of epidemiologic assays. Roderick J. Little

Statistical learning. Chapter 20, Sections 1 4 1

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Statistics: Learning models from data

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY DATA - 1 -

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

Naïve Bayes classification

Using algebraic geometry for phylogenetic reconstruction

Introduction to Hidden Markov Models (HMMs)

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CSC321 Lecture 18: Learning Probabilistic Models

Pattern Recognition. Parameter Estimation of Probability Density Functions

1 Data Arrays and Decompositions

CMU-Q Lecture 24:

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Mathematical statistics

CPSC 540: Machine Learning

Transcription:

part 3: analysis of natural selection pressure omega models! types of codon models if i and j differ by > π j for synonymous tv. Q ij = κπ j for synonymous ts. ωπ j for non-synonymous tv. ωκπ j for non-synonymous ts. Goldman(and(Yang((994)( Muse(and(Gaut((994)(

this codon model M omega models! if i and j differ by > π j for synonymous tv. Q ij = κπ j for synonymous ts. ωπ j for non-synonymous tv. ωκπ j for non-synonymous ts. Goldman(and(Yang((994)( Muse(and(Gaut((994)( x x 2! x 3 x 4 t :ω j t 2 :ω t 3 :ω t 4 :ω ω t 5 t:ω 4 k GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT......... G.C......... T....T...................................C..T............ A..... A.T.......AA... A.C.....C... G.A.AT.....A...... A..... AA. TG......G... A.......C..G GA...T........T C....G..A... AT......T.....G same ω for all branches same ω for all sites two basic types of models t :ω x x 2! x 3 x 4 j t 4 :ω t 5 :ω t 2 :ω t 3 :ω t 4 :ω k branch models (ω varies among branches) ω ω ω ω ω GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT......... G.C......... T....T...................................C..T............ A..... A.T.......AA... A.C.....C... G.A.AT.....A...... A..... AA. TG......G... A.......C..G GA...T........T C....G..A... AT......T.....G site models (ω varies among sites) 2

interpretation of a branch model x x 2! x 3 x 4 t :ω j t 2 :ω t 3 :ω t 4 :ω t 4 :ω t 5 :ω k episodic adaptive evolution of a novel function with ω > branch models* x x 2! x 3 x 4 t :ω j t 2 :ω t 3 :ω t 4 :ω t 4 :ω t 5 :ω k variation (ω) among branches: approach Yang, 998 fixed effects Bielawski and Yang, 23 Seo et al. 24 Kosakovsky Pond and Frost, 25 Dutheil et al. 22 fixed effects auto-correlated rates genetic algorithm clustering algorithm * these methods can be useful when selection pressure is strongly episodic 3

site models* GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC......... G.C......... T....T............................GC A.............C..T............ A..... A.T.......AA... A.C... AGC........C... G.A.AT.....A...... A..... AA. TG......G... A....T.GC..T.....C..G GA...T........T C....G..A... AT......T.....G..A.GC... variation (ω) among sites: Yang and Swanson, 22 Bao, Gu and Bielawski, 26 Massingham and Goldman, 25 Kosakovsky Pond and Frost, 25 Nielsen and Yang, 998 Kosakovsky Pond, Frost and Muse, 25 Huelsenbeck and Dyer, 24; Huelsenbeck et al. 26 Rubenstein et al. 2 Bao, Gu, Dunn and Bielawski 28 & 2 Murell et al. 23 approach fixed effects (ML) fixed effects (ML) site wise (LRT) site wise (LRT) mixture model (ML) mixture model (ML) mixture (Bayesian) mixture model (ML) mixture (LiBaC/MBC) mixture (Bayesian) useful when at some sites evolve under diversifying selection pressure over long periods of time this is not a comprehensive list site models: discrete model (M3) mixture-model likelihood! K P(x h ) = p i P(x h ω i ) i=.9.8.7.6.5.4.3.2. conditional likelihood calculation (see part ) ω =. ω =. ω 2 = 2. 4

interpretation of a sites-model.9.8.7.6.5.4.3.2. 5% of sites diversifying selection (frequency dependent) at 5% of sites with ω 2 = 2 ω =. ω =. ω 2 = 2. models for variation among branches & sites x x 2 x 3 x 4 ω ω ω ω ω t :ω t 2 :ω j t :ω t 3 :ω t 4 :ω GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT......... G.C......... T....T...................................C..T............ A..... A.T.......AA... A.C.....C... G.A.AT.....A...... A..... AA. TG......G... A.......C..G GA...T........T C....G..A... AT......T.....G k branch models (ω varies among branches) site models (ω varies among sites) branch-site models (combines the features of above models) 5

models for variation among branches & sites variation (ω) among branches & sites: approach Yang and Nielsen, 22 fixed+mixture (ML) Forsberg and Christiansen, 23 fixed+mixture (ML) Bielawski and Yang, 24 fixed+mixture (ML) Giundon et al., 24 Zhang et al. 25 Kosakovsky Pond et al. 2, 22 switching (ML) fixed+mixture (ML) full mixture (ML) * these methods can be useful when selection pressures change over time at just a fraction of sites * it can be a challenge to apply these methods properly (more about this later) branch-site Model B mixture-model likelihood! P( x h ) = K i= p P( x i h ω ) i.9.8.7.6.5.4.3.2. Foreground branch only ω ω ω =. =.9 = 5.55 ω for background branches are from site-classes and 2 (. or.9) 6

two scenarios can yield branch-sites with dn/ds >.9.8.7.6.5.4.3.2. % of sites Foreground (FG) branch only % of sites have shifting balance on a fixed peak (same function) ω =. ω =.9 ω FG = 5.55 episodic adaptive evolution at % of sites for novel function branch-site codon models cannot tell which scenario is correct without external information! Jones et al (26) MBE omega models! model-based inference if i and j differ by > π j for synonymous tv. Q ij = κπ j for synonymous ts. ωπ j for non-synonymous tv. ωκπ j for non-synonymous ts. Goldman(and(Yang((994)( Muse(and(Gaut((994)( 7

model based inference 3 analytical tasks task. parameter estimation (e.g., ω) task 2. hypothesis testing task 3. make predictions (e.g., sites having ω > ) task : parameter estimation t, κ, ω = unknown constants estimated by ML π s = empirical [GY: F3 4 or F6 in Lab] use a numerical hill-climbing algorithm to maximize the likelihood function 8

Sooner or later you ll get it 27-7-29 task : parameter estimation Parameters: t and ω Gene: acetylcholine α receptor mouse human common ancestor Sooner or later you ll get it lnl = -2399 task 2: statistical significance task. parameter estimation (e.g., ω) task 2. hypothesis testing LRT task 3. prediction / site identification 9

task 2: likelihood ratio test for positive selection H : variable selective pressure but NO positive selection (M) H : variable selective pressure with positive selection (M2) Compare 2Δl = 2(l - l ) with a χ 2 distribution Model a (Ma) Model 2a (M2a).7.6.5.4.3.2..9.8.7.6.5.4.3.2. ωˆ =.5 (ω = ) ωˆ =.5 (ω = ) ωˆ = 3.25 task 2: likelihood ratio test for positive selection H : Beta distributed variable selective pressure (M7) H : Beta plus positive selection (M8) Compare 2Δl = 2(l - l ) with a χ 2 distribution M7: beta M8: beta & ω sites sites.2.4.6.8 ω ratio.2.4.6.8 ω ratio >

task 3: identify the selected sites task. parameter estimation (e.g., ω) task 2. hypothesis testing task 3. prediction / site identification Bayes rule task 3: which sites have dn/ds > model: 9% have ω >.9.8.7.6.5.4.3.2. Bayes rule: site 4, 2 & 3 GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC......... G.C......... T....T............................GC A.............C..T............ A..... A.T.......AA... A.C... AGC........C... G.A.AT.....A...... A..... AA. TG......G... A....T.GC..T.....C..G GA...T........T C....G..A... AT......T.....G..A.GC... structure: sites are in contact

review the mixture likelihood (model M3).9.8.7.6.5.4.3.2. K P(x h ) = p(ω i )P(x h ω i ) i= Total probability Prior Likelihood ω ω ω 2 p p p 2 =.3 =.4 = 4. =.85 =. =.5 Bayes rule for identifying selected sites? Site class : ω =.3, 85% of codon sites Site class : ω =.4, % of codon sites Site class 2: ω 2 = 4, 5% of codon sites Prior probability of hypothesis (ω 2 ) Likelihood of hypothesis (ω 2 ) ( ) P(ω 2 x h ) = P(ω 2 )P x h ω 2 K i= P(ω i )P( x h ω i ) Posterior probability of hypothesis (ω 2 ) Marginal probability (Total probability) of the data 2

task 3: Bayes rule for which sites have dn/ds > (",-./2%3454/67%837/56% 956:3843%837/56% >5:;38/58%.85?-?//;2%!#'"!#&"!#%"!#$"!" (" &" ((" (&" $(" $&" )(" )&" %(" %&" *(" *&" &(" &&" +(" +&" '(" '&",(",&" (!(" (!&" (((" ((&" ($(" ($&" ()(" ()&" (%(" (%&" (*(" (*&" (&(" (&&" (+(" (+&" ('(" ('&" (,(" (,&" $!(" Site class : ω =.3 (strong purifying selection) Site class : ω =.4 (weak purifying selection) Site class 2: ω 2 = 4 (positive selection) NOTE: The posterior probability should NOT be interpreted as a P-value ; it can be interpreted as a measure of relative support, although there is rarely any attempt at calibration. task 3: Bayes rule for which sites have dn/ds > Bayes rule empirical Bayes bootstrap NEB Naive Empirical Bayes Nielsen and Yang, 998 assumes no MLE errors BEB Bayes Empirical Bayes Yang et al., 25 accommodate MLE errors for some model parameters via uniform priors SBA Smoothed bootstrap aggregation Mingrone et al., MBE, 33:2976-2989 accommodate MLE errors via bootstrapping ameliorates biases and MLE instabilities with kernel smoothing and aggregation 3

critical question: Have the requirements for maximum likelihood inference been met? (rarely addressed in real data analyses) regularity conditions have been met Frequency 5 5 2 25 3 Frequency 5 5 2...2 3 4 5 6 7 8 p ω> w ω> Normal MLE uncertainty (M2a) large sample size with regularity conditions MLEs approximately unbiased and minimum variance ( ( ) ) θ ˆ ~ N θ, I θ ˆ 4

regularity conditions have NOT been met Frequency 2 4 6 8 Frequency 5 5 bootstrapping can be used to diagnose this problem: Bielawski et al. (26) Curr. Protoc. Bioinf. 56:6.5. Mingrone et al., MBE, 33:2976-2989..4.8 5 5 p ω> w ω> MLE instabilities (M2a) small sample sizes and θ on boundary continuous θ has been discretized (e.g., M2a) non-gaussian, over-dispersed, divergence among datasets 5