Phylogenetic Gibbs Recursive Sampler. Locating Transcription Factor Binding Sites

Similar documents
A Phylogenetic Gibbs Recursive Sampler for Locating Transcription Factor Binding Sites

What Is Conservation?

Alignment. Peak Detection

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Prediction o f of cis -regulatory B inding Binding Sites and Regulons 11/ / /

Design of an Enterobacteriaceae Pan-genome Microarray Chip

Stat 516, Homework 1

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

What DNA sequence tells us about gene regulation

PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny

Fitness constraints on horizontal gene transfer

SI Materials and Methods

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Lecture Notes: Markov chains

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Stochastic processes and

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Comparison of 61 E. coli genomes

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Theoretical distribution of PSSM scores

Chapter 8. Regulatory Motif Discovery: from Decoding to Meta-Analysis. 1 Introduction. Qing Zhou Mayetri Gupta

Inferring Molecular Phylogeny

Matrix-based pattern discovery algorithms

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

The binding of transcription factors (TFs) to specific sites is a

Applications of hidden Markov models for comparative gene structure prediction

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

MCMC: Markov Chain Monte Carlo

Supporting Information

The Phylo- HMM approach to problems in comparative genomics, with examples.

Modeling Noise in Genetic Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

On the Monotonicity of the String Correction Factor for Words with Mismatches

2 Genome evolution: gene fusion versus gene fission

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Introduction to the SNP/ND concept - Phylogeny on WGS data

Getting statistical significance and Bayesian confidence limits for your hidden Markov model or score-maximizing dynamic programming algorithm,

Markov Chain Monte Carlo Lecture 6

CHAPTER : Prokaryotic Genetics

Whole Genome based Phylogeny

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

7 Multiple Genome Alignment

Estimating Evolutionary Trees. Phylogenetic Methods

Lab 9: Maximum Likelihood and Modeltest

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Motivating the need for optimal sequence alignments...

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

ydci GTC TGT TTG AAC GCG GGC GAC TGG GCG CGC AAT TAA CGG TGT GTA GGC TGG AGC TGC TTC

Network motifs in the transcriptional regulation network (of Escherichia coli):

BIOINFORMATICS. Neighbourhood Thresholding for Projection-Based Motif Discovery. James King, Warren Cheung and Holger H. Hoos

Phylogenetics: Parsimony

Stochastic processes and Markov chains (part II)

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Time-Sensitive Dirichlet Process Mixture Models

Molecular Evolution, course # Final Exam, May 3, 2006

A Cluster Refinement Algorithm For Motif Discovery

Reconstruire le passé biologique modèles, méthodes, performances, limites

Exercise 5. Sequence Profiles & BLAST

Comparative Studies of Transcriptional Regulation Mechanisms in a Group of Eight Gamma-proteobacterial Genomes

A Brief Overview of Gibbs Sampling

Mutation models I: basic nucleotide sequence mutation models

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Computational methods for predicting protein-protein interactions

Steven L. Scott. Presented by Ahmet Engin Ural

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Lecture 4. Models of DNA and protein change. Likelihood methods

Gibbs Sampling Methods for Multiple Sequence Alignment

Lecture 3: Markov chains.

Fitness landscapes and seascapes

Taming the Beast Workshop

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites

The Evolution of Infectious Disease

A Combined Motif Discovery Method

Introduction to Computation & Pairwise Alignment

Supplemental Material

Quantifying sequence similarity

Phylogenetic Inference using RevBayes

V19 Metabolic Networks - Overview

Graph Alignment and Biological Networks

V14 extreme pathways

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Variable Selection in Structured High-dimensional Covariate Spaces

Topology. 1 Introduction. 2 Chromosomes Topology & Counts. 3 Genome size. 4 Replichores and gene orientation. 5 Chirochores.

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Microbial Typing by Machine Learned DNA Melt Signatures

Multiple Sequence Alignment using Profile HMM

Evolutionary Analysis of Viral Genomes

Comparing whole genomes

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

The accumulation of knowledge on control of transcription in

Transcription:

A for Locating Transcription Factor Binding Sites Sean P. Conlan 1 Lee Ann McCue 2 1,3 Thomas M. Smith 3 William Thompson 4 Charles E. Lawrence 4 1 Wadsworth Center, New York State Department of Health 2 Pacific Northwest National Laboratory 3 Department of Computer Science, Rensselaer Polytechnic Institute 4 Department of Applied Mathematics, Brown University International Conference in Phylogenomics 2006

Conclusions Introduction Goal: Finding Cis-Regulatory Elements De Novo Previous Work Take-Home Points Phylogenetic modeling (with full Felsenstein s Algorithm) helps Use of ensemble centroids helps New nucleotide selection model is intriguing

Goal: Finding Cis-Regulatory Elements De Novo Previous Work Goal: Finding Cis-Regulatory Elements De Novo What We re Looking For Seeking elements that are short: 6 30 bp Only partial conserved Isolated elements or multiple elements per module Single or multiple intergenic regions per genome Alignable and unalignable sequence data across genomes Measures of Success Sensitivity minimize false negatives Selectivity minimize false positives

Previous Work Introduction Goal: Finding Cis-Regulatory Elements De Novo Previous Work Non-Phylogenetic Algorithms Many good algorithms including Gibbs Recursive Sampler (Thompson et al., 2003) But need to be better when analyzing closely related species. Phylogenetic Algorithms Several good algorithms Non-statistical and/or two-species only PhyloGibbs (Siddharthan et al., 2005). Uses successive star-toplogy approximations, maximum likelihood But improvement is possible with full Felsenstein s Algorithm and with ensemble centroids

Gibbs Sampling Introduction Gibbs Sampling Felsenstein s Algorithm Ensemble Centroid Gibbs Sampling Overview Move from proposed solution to proposed solution via Gibbs Sampling. From any proposed set of sites Re-choose sites in one multi-sequence a, with probability conditioned on sites in remaining multi-sequences Iterate to explore parameter space. Explores each proposed set of sites with probability proportional to its likelihood. a An unalignable sequence or a set of aligned sequences

Gibbs Sampling Felsenstein s Algorithm Ensemble Centroid Probability Conditioned on Remaining Sites An slight oversimplification... Probability Calculation Current Iteration has a position-weight matrix, which gives current motif descriptions, and is built from counts & pseudocounts A position s weights parameterize a Dirichlet distribution, which is used to draw an equilibrium distribution The equilibrium is used to parameterize a nucleotide substitution model (e.g., HKY85). The substitution model is used to evaluate all positions posited to belong

Ensemble Centroid Introduction Gibbs Sampling Felsenstein s Algorithm Ensemble Centroid Computing the Ensemble Centroid With each sample from the Gibbs Sampler (after burn in iterations) For each sequence position record a 1 if it is part of a cis-element, record 0 otherwise. The vector of 0 s and 1 s is the corner of a hypercube Ensemble centroid = corner nearest to the center of mass of the collected samples

Advantages Introduction Gibbs Sampling Felsenstein s Algorithm Ensemble Centroid Advantages of Ensemble Centroid Expensive a posteriori probability calculation not needed Star-topology approximation unnecessary Gives entropic solutions their due

Synthesizing the Data Introduction Synthetic Escherichia coli Crp Binding Sites A data set: Eight species 500 bp sequences 2776 578 379 Expected number of substitutions 10 4 2645 Klebsiella pneumoniae 1470 60 Escherichia coli 247 Shigella flexneri 1268 693 Salmonella bongori 852 Salmonella enterica Typhi 1435 Citrobacter rodentium 5588 Proteus mirabilis 6619 Vibrio cholerae Gapless sequence data generated according to tree P. mirabilis and V. cholerae subsequently treated as not alignable

Synthesizing the Data Introduction Synthetic Escherichia coli Crp Binding Sites Five Collections of Data Sets Five collections of data sets: k {0, 1, 2, 3, 4} 100 data sets in each collection A data set is 8 sequences one for each species each of length 500 bp each with k planted Escherichia coli Crp binding sites related by phylogenetic tree Data Analysis Each data set run separately 500 runs total Accumulate results across data sets in each collection.

Sensitivity & Selectivity Synthetic Escherichia coli Crp Binding Sites E.g., second entry shows 116 E. coli sites found across 61 data sets. Our Algorithm (top) and PhyloGibbs (bottom) Data Collection #0 #1 #2 #3 #4 Sites Found 17/17 116/61 154/82 176/93 (True Positives) 0/0 13/8 54/26 75/35 False Sites 3/3 5/4 2/2 0/0 0/0 (False Positives) 47/46 60/51 63/44 40/30 30/24 Sites Missed 83/83 84/45 146/100 224/100 (False Negatives) 100/100 187/95 246/89 325/97 BRASS implementation of our algorithm (Smith, 2006), configured to find up to two sites per multi-sequence

Modeling How Nucleotides Evolve Nucleotide Substitution Model Conclusions References Existing Models Arbitrary equilibria Transition/transversion rate ratio Mutation rate variation within a genome Selection effects via scaled fixation rates (Halpern & Bruno, 1998) Context sensitive: Di- and tri-nucleotide models Indel support, though difficult with Felsenstein s Algorithm A New Model for Selection Effects Newberg (2005) allows that SNPs are not improbable. (I.e., without the specious fixation on species fixation.)

Nucleotide Substitution Model Conclusions References Traditional Nucleotide Substitution Model Traditional Mutation (without Selection) M x = = Pr[A A] Pr[C A] Pr[G A] Pr[T A] Pr[A C] Pr[C C] Pr[G C] Pr[T C] Pr[A G] Pr[C G] Pr[G G] Pr[T G] Pr[A T] Pr[C T] Pr[G T] Pr[T T] 0.96 0.01 0.02 0.01 0.01 0.96 0.01 0.02 0.02 0.01 0.96 0.01 0.01 0.02 0.01 0.96 Each row sums to 1.0.

Nucleotide Substitution Model Conclusions References A New Nucleotide Substitution Model Population Model for Selection (without Mutation) M x = = Pr[A A] Pr[C A] Pr[G A] Pr[T A] Pr[A C] Pr[C C] Pr[G C] Pr[T C] Pr[A G] Pr[C G] Pr[G G] Pr[T G] Pr[A T] Pr[C T] Pr[G T] Pr[T T] 1.1 0 0 0 0 1.0 0 0 0 0 1.0 0 0 0 0 1.0 Each row no longer sums to 1.0 but...

Nucleotide Substitution Model Conclusions References A New Nucleotide Substitution Model Starting with 100 organisms of each type... Population Model for Selection (without Mutation) M x = 1.1 0 0 0 0 1.0 0 0 0 0 1.0 0 0 0 0 1.0 (100, 100, 100, 100)M x = (110, 100, 100, 100) (110, 100, 100, 100) 410 = (0.268, 0.244, 0.244, 0.244)

Nucleotide Substitution Model Conclusions References A New Nucleotide Substitution Model Combining the two... Mutation and Selection M x = = Pr[A A] Pr[C A] Pr[G A] Pr[T A] Pr[A C] Pr[C C] Pr[G C] Pr[T C] Pr[A G] Pr[C G] Pr[G G] Pr[T G] Pr[A T] Pr[C T] Pr[G T] Pr[T T] 1.056 0.01 0.02 0.01 0.011 0.96 0.01 0.02 0.022 0.01 0.96 0.01 0.011 0.02 0.01 0.96

Nucleotide Substitution Model Conclusions References A New Nucleotide Substitution Model Some Details Each generation: mutation at DNA replication; selection between replications. Instantaneous rate formalism M x = exp(xr) still applies, so generation length need not be known. 2x invocations of Felsenstein s Algorithm, because each row no longer sums to 1.0. Easily computed, one-to-one correspondence between nucleotide equilibria θ and diagonal selection matrix a a assuming, e.g., asymptotic population stability

Conclusions Introduction Nucleotide Substitution Model Conclusions References Take-Home Points Phylogenetic modeling (with full Felsenstein s Algorithm) helps Use of ensemble centroids helps New nucleotide selection model is intriguing

Nucleotide Substitution Model Conclusions References Contact Information lnewberg c wadsworth org www.rpi.edu/~newbel/ Felsenstein, J. (1981) PubMed 7288891. Halpern, A. L. & Bruno, W. J. (1998) PubMed 9656490. Hasegawa, M., Kishino, H. & Yano, T. (1985) PubMed 3934395. Newberg, L. A. (2005). http://www.cs.rpi.edu/research/pdf/05-08.pdf. Siddharthan, R., Siggia, E. D. & van Nimwegen, E. (2005) PubMed 16477324. Smith, T. M. (2006). PhD thesis, Rensselaer Polytechnic Institute Troy, NY. In preparation. Thompson, W., Rouchka, E. C. & Lawrence, C. E. (2003) PubMed 12824370.