Theoretical distribution of PSSM scores

Similar documents
Matrix-based pattern matching

Position-specific scoring matrices (PSSM)

Substitution matrices

Matrix-based pattern discovery algorithms

Assessing the quality of position-specific scoring matrices

Introduction to Bioinformatics

Bioinformatics. Transcriptome

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Evaluation of predicted regulatory elements

Quantitative Bioinformatics

Prediction o f of cis -regulatory B inding Binding Sites and Regulons 11/ / /

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

HMMs and biological sequence analysis

Alignment. Peak Detection

A genomic-scale search for regulatory binding sites in the integration host factor regulon of Escherichia coli K12

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Graph Alignment and Biological Networks

Markov Models & DNA Sequence Evolution

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

DNA Binding Proteins CSE 527 Autumn 2007

Stochastic processes and

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

How much non-coding DNA do eukaryotes require?

Whole-genome analysis of GCN4 binding in S.cerevisiae

MCMC: Markov Chain Monte Carlo

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Stochastic Processes

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Phylogenetic Gibbs Recursive Sampler. Locating Transcription Factor Binding Sites

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

On the Monotonicity of the String Correction Factor for Words with Mismatches

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Discovering MultipleLevels of Regulatory Networks

The Distribution of Short Word Match Counts Between Markovian Sequences

GS 559. Lecture 12a, 2/12/09 Larry Ruzzo. A little more about motifs

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Similarity searching summary (2)

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

UDC Comparative Analysis of Methods Recognizing Potential Transcription Factor Binding Sites

EECS730: Introduction to Bioinformatics

Lecture 4: Transcription networks basic concepts

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

THE EVOLUTIONARY DYNAMICS OF TRANSCRIPTION FACTORS, OPERATORS, AND THEIR TARGET GENES ACROSS PROKARYOTES

HMM : Viterbi algorithm - a toy example

Fundamentally different strategies for transcriptional regulation are revealed by information-theoretical analysis of binding motifs

Genome 559 Wi RNA Function, Search, Discovery

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

Sequence analysis and comparison

Detecting non-coding RNA in Genomic Sequences

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Microbial computational genomics of gene regulation*

Biological Networks. Gavin Conant 163B ASRC

1.5 Sequence alignment

Supplementary Information

Introduction to Bioinformatics

Three types of RNA polymerase in eukaryotic nuclei

Lecture 7 Sequence analysis. Hidden Markov Models

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

EECS730: Introduction to Bioinformatics

Protein Structure Prediction Using Neural Networks

Tools and Algorithms in Bioinformatics

Tutorial 4 Substitution matrices and PSI-BLAST

Bioinformatics. Macromolecular structure

Sequence Database Search Techniques I: Blast and PatternHunter tools

bioinformatics 1 -- lecture 7

Basic Local Alignment Search Tool

Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM)

Introduction to Evolutionary Concepts

Probabilistic Arithmetic Automata

Similarity of position frequency matrices for transcription factor binding sites

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

Motivating the need for optimal sequence alignments...

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Exercise 5. Sequence Profiles & BLAST

QB LECTURE #4: Motif Finding

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

O 3 O 4 O 5. q 3. q 4. Transition

Exact Algorithms for Planted Motif Problems CONTACT AUTHOR:

Mitochondrial Genome Annotation

Chapter 7: Regulatory Networks

Week 10: Homology Modelling (II) - HHpred

Welcome to HST.508/Biophysics 170

Quantitative modeling and data analysis of SELEX experiments

Measuring TF-DNA interactions

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Transcription:

Regulatory Sequence Analysis Theoretical distribution of PSSM scores Jacques van Helden Jacques.van-Helden@univ-amu.fr Aix-Marseille Université, France Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090) http://jacques.van-helden.perso.luminy.univmed.fr/ FORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/

Scanning a sequence with a position-specific scoring matrix P(S M) probability for site S to be generated as an instance of the motif. P(S B) probability for site S to be generated as an instance of the background. W weight, i.e. the log ratio of the two above probabilities. A positive weight indicates that a site is more likely to be an instance of the motif than of the background. ( ) = f ' rj j P S M P S B w " w j=1 " ( ) = p rj j=1 " W S = ln$ P S M # P S B ( ) ( ) % ' & Sand, O., Turatsinze, J.V. and van Helden, J. (2008). Evaluating the prediction of cis-acting regulatory elements in genome sequences In Frishman, D. and Valencia, A. (eds.), Modern genome annotation: the BioSapiens network. Springer." 2

Score distribution: random expectation The theoretical distribution of probabilities for position-weight matrices has been discussed in several articles. Staden, R. (1989). Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci 5, 89-96. Hertz, G. Z. & Stormo, G. D. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563-77. The computation is based on the probability-generating function. This function can be used to compute the probability P(W) to obtain exactly a score value of W. Each position-weight matrix has its own probability distribution. G j ( x) = " f i xw ij 1. Staden, R. (1989). Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci 5, 89-96." 2. Bailey, T. L. & Gribskov, M. (1997). Score distributions for simultaneous matching to multiple motifs. J Comput Biol 4, 45-59." 3. Hertz, G. Z. & Stormo, G. D. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563-77." 3

Theoretical distribution of matrix scores The RSAT program matrix-distrib computes the distribution of score probabilities for a given PSSM. The distribution is completely determined by Values in the cells of the matrix Prior residue probabilities This method can be used to compute P(X=x M) The probability to obtain by chance a given score x, with a given matrix. P(X >= x M) The probability to obtain by chance a score higher or equal to x. The inverse cumulative distribution gives a P-value, which indicates the risk of false positive for a given score. Computing time increases exponentially with the number of columns, but by rounding values, it is asymptotically linear. The original method is based on a Bernoulli assumption for the background model, but we extended it to Markov chains. Computing time increases exponentially with Markov order. Turatsinze, J. V., Thomas-Chollier, M., Defrance, M. and van Helden, J. matrix-scan: predicting transcription factor binding sites and cis-regulatory modules in prep. 4

Theoretical distribution for the PHO matrix The program matrix-distrib (RSAT) computes the complete theoretical distribution of scores for a given PSSM, using the algorithm proposed by Staden (1989), and previously implemented in patser (Hertz, 1990, 1999) and MAST (Bailey, 1994, 1997). The theoretical distribution P(S) is quite erratic, because each possible value of score has its own probability, depending on the actual weight values in the matrix, and prior residue probabilities. Figure below: probability distribution of weight score according to a Bernoulli model. P(W=w M): probability to obtain by chance a precise weight score w, with a given matrix. Yeast PHO4 P(W >= w M): probability to obtain by chance a weight score higher than or equal to w. This inverse cumulative distribution gives a P-value, which indicates the risk of false positive for a given score. 1. Staden, R. (1989). Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci 5, 89-96." 2. Bailey, T. L. & Gribskov, M. (1997). Score distributions for simultaneous matching to multiple motifs. J Comput Biol 4, 45-59." 3. Hertz, G. Z. & Stormo, G. D. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563-77." 5

Theoretical distribution for the PHO matrix matrix-distrib also supports computation of P-values with Markov models of any order (algorithm adapted from Touzet & Varré, 2007). Computing time increases exponentially with the number of columns, but by rounding values, it is asymptotically linear. The original method is based on a Bernoulli assumption for the background model, but we extended it to Markov chains. Computing time increases exponentially with Markov order. Figures below: probability distribution of the weight score according to a Markov model of order 1. P(W=w M): probability to obtain by chance a precise weight score w, with a given matrix. Yeast PHO4 P(W >= w M): probability to obtain by chance a weight score higher than or equal to w. This inverse cumulative distribution gives a P-value, which indicates the risk of false positive for a given score. Touzet, H. and Varre, J.S. (2007) Efficient and accurate P-value computation for Position Weight Matrices. Algorithms Mol Biol, 2, 15. Turatsinze, J. V., Thomas-Chollier, M., Defrance, M. and van Helden, J. matrix-scan: predicting transcription factor binding sites and cis-regulatory modules in prep (2008). 6

Theoretical distributions We used matrix-distrib to analyse the theoretical distributions for some matrices according to various BG models. IID (independently and identically distributed) nucleotides (blue) Markov chains of orders 1 to 5, trained on the whole set of upstream sequences of the considered organism. Yeast PHO4 Drosophila Kr Mammalian Sp1 7

Impact of Markov order on the right tail of theoretical distributions In Escherichia coli, using higher-order background model has a weak effect on the core of the distribution, but affects its right tail, which corresponds to high-scoring sites. For TrpR, we observe differences for lower-scoring sites, due to the presence of a particular 4nt in the motif. 8