Position-specific scoring matrices (PSSM)

Similar documents
Theoretical distribution of PSSM scores

Substitution matrices

Matrix-based pattern matching

Matrix-based pattern discovery algorithms

Alignment. Peak Detection

Quantitative Bioinformatics

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

Bioinformatics. Transcriptome

Simulation of the Evolution of Information Content in Transcription Factor Binding Sites Using a Parallelized Genetic Algorithm

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Motivating the need for optimal sequence alignments...

Jianlin Cheng, PhD. Department of Computer Science University of Missouri, Columbia. Fall, 2014

RNALogo: a new approach to display structural RNA alignment

Pairwise sequence alignment

Sequence analysis and comparison

BLAST: Target frequencies and information content Dannie Durand

Computational Biology and Chemistry

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Introduction to Bioinformatics

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Gibbs Sampling Methods for Multiple Sequence Alignment

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

QB LECTURE #4: Motif Finding

Information Tutorial

Transcrip:on factor binding mo:fs

Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions)

On the Monotonicity of the String Correction Factor for Words with Mismatches

Introduction to Bioinformatics Online Course: IBT

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Lecture Notes: Markov chains

DNA Binding Proteins CSE 527 Autumn 2007

Introduction to Bioinformatics

Intro Protein structure Motifs Motif databases End. Last time. Probability based methods How find a good root? Reliability Reconciliation analysis

Computational Genomics and Molecular Biology, Fall

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Assessing the quality of position-specific scoring matrices

Stephen Scott.

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

2 Spial. Chapter 1. Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6. Pathway level. Atomic level. Cellular level. Proteome level.

Markov Models & DNA Sequence Evolution

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

Graph Alignment and Biological Networks

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

MCMC: Markov Chain Monte Carlo

A genomic-scale search for regulatory binding sites in the integration host factor regulon of Escherichia coli K12

Tools and Algorithms in Bioinformatics

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics Chapter 1. Introduction

Exercise 5. Sequence Profiles & BLAST

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Local Alignment Statistics

A Method for Aligning RNA Secondary Structures

Inferring Models of cis-regulatory Modules using Information Theory

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Advanced topics in bioinformatics

13 Comparative RNA analysis

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Pairwise sequence alignment and pair hidden Markov models

Sequence Database Search Techniques I: Blast and PatternHunter tools

Hidden Markov Models (HMMs) and Profiles

An Introduction to Sequence Similarity ( Homology ) Searching

Discovering MultipleLevels of Regulatory Networks

Inferring Models of cis-regulatory Modules using Information Theory

Basic Local Alignment Search Tool

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Computational Biology: Basics & Interesting Problems

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Computational Molecular Biology (

Lesson Overview The Structure of DNA

Quantifying sequence similarity

UDC Comparative Analysis of Methods Recognizing Potential Transcription Factor Binding Sites

Homologous proteins have similar structures and structural superposition means to rotate and translate the structures so that corresponding atoms are

Sequence variability,long-range. dependence and parametric entropy

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Similarity Analysis between Transcription Factor Binding Sites by Bayesian Hypothesis Test *

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Chapter 9 DNA recognition by eukaryotic transcription factors

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Hidden Markov Models and some applications

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Scoring Matrices. Shifra Ben-Dor Irit Orr

Week 10: Homology Modelling (II) - HHpred

Computational Genomics and Molecular Biology, Fall

Genome 559 Wi RNA Function, Search, Discovery

Chapter 7: Regulatory Networks

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

GS 559. Lecture 12a, 2/12/09 Larry Ruzzo. A little more about motifs

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

Information Theory Primer With an Appendix on Logarithms

DNA THE CODE OF LIFE 05 JULY 2014

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Transcription:

Regulatory Sequence nalysis Position-specific scoring matrices (PSSM) Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d ix-marseille, France Technological dvances for Genomics and Clinics (TGC, INSERM Unit U1090) http://jacques.van-helden.perso.luminy.univmed.fr/ FORMER DDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/ 1

Consensus representation The TRNSFC database contains 8 binding sites for the yeast transcription factor Pho4p 5/8 contain the core of high-affinity binding sites (CCGTG) 3/8 contain the core of medium-affinity binding sites (CCGTT) The IUPC ambigous nucleotide code allows to represent variable residues. 15 letters to represent any possible combination between the 4 nucleotides (2 4 1=15). This representation however gives a poor idea of the relative importance of residues. R06098 \TCCCGTGGG\! R06099 \GGCCCGTGCG\! R06100 \TGCCGTGGGT\! R06102 \CGCCGTGGGG\! R06103 \TTCCCGTGCG\! R06104 \CGCCGTTGGT\! R06097 \CGCCGTTTTC\! R06101 \TCCCGTTTTC\!! Cons nnvccgtkbdn! IUPC ambiguous nucleotide code denine C C Cytosine G G Guanine T T Thymine R or G purine Y C or T pyrimidine W or T Weak hydrogen bonding S G or C Strong hydrogen bonding M or C amino group at common position K G or T Keto group at common position H, C or T not G B G, C or T not V G,, C not T D G, or T not C N G,, C or T any TRNSFC pulic version: http://www.gene-regulation.com/cgi-bin/pub/databases/transfac/search.cgi 2

Regulatory Sequence nalysis From alignments to weights 3

Residue count matrix Tom Schneider s sequence logo (generated with Web Logo http://weblogo.berkeley.edu/logo.cgi) Schneider and Stephens. Sequence logos: a new way to display consensus sequences. Nucleic cids Res (1990) vol. 18 (20) pp. 6097-100. PMID 2172928. 4

Frequency matrix f i, j = " n i, j n i, j alphabet size (=4) n i,j, occurrences of residue i at position j p i prior residue probability for residue i f i,j relative frequency of residue i at position j Reference: Hertz (1999). Bioinformatics 15:563-577. PMID 10487864 5

Corrected frequency matrix 1st option: identically distributed pseudo-weight f i, j = n i, j + k / " n i, j + k 2nd option: pseudo-weighst distributed according to residue priors f i, j = n i, j + p i k " n i, j + k alphabet size (=4) n i,j, occurrences of residue i at position j p i prior residue probability for residue i f i,j relative frequency of residue i at position j k pseudo weight (arbitrary, 1 in this case) f i,j corrected frequency of residue i at position j Reference: Hertz (1999). Bioinformatics 15:563-577. PMID 10487864 6

Weight matrix (Bernoulli model) Prior Pos 1 2 3 4 5 6 7 8 9 10 11 12 0.325-0.79 0.13-0.23-2.20 1.05-2.20-2.20-2.20-2.20-2.20-0.79-0.23 0.175 C 0.32 0.32 0.70 1.65-2.20 1.65-2.20-2.20-2.20 0.32-2.20 0.32 0.175 G -0.29 0.32 0.70-2.20-2.20-2.20 1.65-2.20 1.19 0.97 1.19 0.32 0.325 T 0.39-0.79-2.20-2.20-2.20-2.20-2.20 1.05 0.13-0.23-0.23-0.23 1.000 Sum -0.37-0.02-1.02-4.94-5.55-4.94-4.94-5.55-3.08-1.13-2.03 0.19 f i, j = n + p k i, j i! n r, j + k r=1 " W i, j = ln f % i, j $ # p i & alphabet size (=4) n i,j, occurrences of residue i at position j p i prior residue probability for residue i f i,j relative frequency of residue i at position j k pseudo weight (arbitrary, 1 in this case) f i,j corrected frequency of residue i at position j W i,j weight of residue i at position j The use of a weight matrix relies on Bernoulli assumption If we assume, for the background model, an independent succession of nucleotides (Bernoulli model), the weight W S of a sequence segment S is simply the sum of weights of the nucleotides at successive positions of the matrix (W i,j ). In this case, it is convenient to convert the PSSM into a weight matrix, which can then be used to assign a score to each position of a given sequence. Reference: Hertz (1999). Bioinformatics 15:563-577. PMID 10487864 7

Properties of the weight function " W i, j = ln f i, j $ # p i % & f i, j = n i, j + p i k " n i, j + k " f i, j =1 The weight is positive when f i,j > p i (favourable positions for the binding of the transcription factor) negative when f i,j < p i (unfavourable positions) 8

Regulatory Sequence nalysis Information content 9

Shannon uncertainty Shannon uncertainty H s (j): uncertainty of a column of a PSSM Hg: uncertainty of the background (e.g. a genome) Special cases of uncertainty (for a 4 letter alphabet) min(h)=0 No uncertainty at all: the nucleotide is completely specified (e.g. p={1,0,0,0}) H=1 Uncertainty between two letters (e.g. p= {0.5,0,0,0.5}) R seq R * seq max(h) = 2 (Complete uncertainty) One bit of information is required to specify the choice between each alternative (e.g. p={0.25,0.25,0.25,0.25}). Two bits are required to specify a letter in a 4-letter alphabet. Schneider (1986) defines an information content based on Shannons uncertainty. For skewed genomes (i.e. unequal residue probabilities), Schneider recommends an alternative formula for the information content. This is the formula that is nowadays used. H s # ( j)= " f i, j log 2 ( f i, j ) # H g = " p i log 2 ( p i ) R seq * R seq ( j) = H g " H s j ( ) R seq = R seq ( j) p i w # j=1 $ f ( j) = f i, j log i, j w * * # 2 & ) R seq = # R seq % ( Shannon j=1 ( j) dapted from Schneider et al. Information content of binding sites on nucleotide sequences. J Mol Biol (1986) vol. 188 (3) pp. 415-31. PMID 3525846. 10

Information content of a PSSM f i, j = n i, j + p i k " n i, j + k I i, j = f i, j " ln f i, j $ # p i % I j = " I i, j & I matrix = w "" j=1 I i, j alphabet size (=4) n i,j, occurrences of residue i at position j w matrix width (=12) p i prior residue probability for residue i f i,j relative frequency of residue i at position j k pseudo weight (arbitrary, 1 in this case) f i,j corrected frequency of residue i at position j W i,j weight of residue i at position j I i,j information of residue i at position j Reference: Hertz (1999). Bioinformatics 15:563-577. PMID 10487864 11

Information content I ij of a cell of the matrix For a given cell of the matrix I ij is positive when f ij > p i (i.e. when residue i is more frequent at position j than expected by chance) I ij is negative when f ij < p i I ij tends towards 0 when f ij -> 0 because limit x->0 (x ln(x)) = 0 Information content as a function of residue frequency Information content as a function of residue frequency (log scale) 12

Information content of a column of the matrix For a given column i of the matrix The information of the column (I j ) is the sum of information of its cells. I j is always positive I j is 0 when the frequency of all residues equal their prior probability (f ij =p i ) I j is maximal when the residue i m with the lowest prior probability has a frequency of 1 (all other residues have a frequency of 0) and the pseudo-weight is null (k=0). # I j = " I i, j = f i, j ln f & i, j " % ( $ p i i m = argmin i ( p i ) k = 0 # max(i j ) =1" ln 1 & % ( = )ln(p $ p i ) i 13

Schneider logos Schneider & Stephens(1990) propose a graphical representation based on his previous entropy (H) for representing the importance of each residue at each position of an alignment. He provides a new formula for Rseq H s (j) uncertainty of column j R seq (j) information content of column j (beware, this definition differs from Hertz information content) e(n) correction for small samples (pseudo-weight) Remarks This information content does not include any correction for the prior residue probabilities (pi) This information content is expressed in bits. Boundaries min(r seq )=0 equiprobable residues max(r seq )=2 perfect conservation of 1 residue with a pseudo-weight of 0, Sequence logos can be generated from aligned sequences on the Weblogo server http://weblogo.berkeley.edu/logo.cgi From matrices or sequences on enologos http://www.benoslab.pitt.edu/cgi-bin/enologos/enologos.cgi H s R seq # ( j)= " f ij log 2 ( f ij ) ( j) = 2 " H s j h ij = f ij R seq ( j) ( ) + e( n) Pho4p binding motif Schneider and Stephens. Sequence logos: a new way to display consensus sequences. Nucleic cids Res (1990) vol. 18 (20) pp. 6097-100. PMID 2172928. 14

Information content of the matrix The total information content represents the capability of the matrix to make the distinction between a binding site (represented by the matrix) and the background model. The information content also allows to estimate an upper limit for the expected frequency of the binding sites in random sequences. The pattern discovery program consensus (developed by Jerry Hertz) optimises the information content in order to detect over-represented motifs. Note that this is not the case of all pattern discovery programs: the gibbs sampler algorithm optimizes a log-likelihood. I matrix = w "" j=1 I i, j P( site) " e #I matrix Reference: Hertz (1999). Bioinformatics 15:563-577. PMID 10487864. 15

Information content: effect of prior probabilities The upper bound of I j increases when p i decreases I j -> Inf when p i -> 0 The information content, as defined by Gerald Hertz, has thus no upper bound. 16

References - PSSM information content Seminal articles by Tom Schneider Schneider, T.D., G.D. Stormo, L. Gold, and. Ehrenfeucht. 1986. Information content of binding sites on nucleotide sequences. J Mol Biol 188: 415-431. Schneider, T.D. and R.M. Stephens. 1990. Sequence logos: a new way to display consensus sequences. Nucleic cids Res 18: 6097-6100. Tom Schneiders publications online http://www.lecb.ncifcrf.gov/~toms/paper/index.html Seminal article by Gerald Hertz Hertz, G.Z. and G.D. Stormo. 1999. Identifying DN and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15: 563-577. Software tools to draw sequence logos Weblogo http://weblogo.berkeley.edu/logo.cgi Enologos http://biodev.hgen.pitt.edu/cgi-bin/enologos/enologos.cgi 17