MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

Similar documents
Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

The value of prior knowledge in discovering motifs with MEME

Matrix-based pattern discovery algorithms

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Objectives. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain 1,2 Mentor Dr.

Scoring Matrices. Shifra Ben-Dor Irit Orr

Position-specific scoring matrices (PSSM)

Whole-genome analysis of GCN4 binding in S.cerevisiae

Sequence analysis and comparison

BLAST: Basic Local Alignment Search Tool

Pairwise sequence alignment

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute

Chapter 7: Rapid alignment methods: FASTA and BLAST

Tree Building Activity

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Algorithms in Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Comparing whole genomes

Large-Scale Genomic Surveys

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

TMHMM2.0 User's guide

Computational Biology

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Alignment. Peak Detection

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Week 10: Homology Modelling (II) - HHpred

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Quantitative Bioinformatics

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Introduction to Bioinformatics Online Course: IBT

Homework 9: Protein Folding & Simulated Annealing : Programming for Scientists Due: Thursday, April 14, 2016 at 11:59 PM

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Exercise 5. Sequence Profiles & BLAST

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Transcrip:on factor binding mo:fs

Quantifying sequence similarity

Bioinformatics and BLAST

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Advanced topics in bioinformatics

Overview Multiple Sequence Alignment

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Pairwise sequence alignment and pair hidden Markov models

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Intelligent Systems for Molecular Biology. June, AAAI Press. The megaprior heuristic for discovering protein sequence patterns

Sequence Comparison. mouse human

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Gibbs Sampling Methods for Multiple Sequence Alignment

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

bioinformatics 1 -- lecture 7

The Select Command and Boolean Operators

Part 4 The Select Command and Boolean Operators

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

How to Perform a Site Based Plant Search

Today s Lecture: HMMs

Computational Molecular Biology (

Lecture 3: Markov chains.

Sequence motif analysis

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Multiple sequence alignment

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

EECS730: Introduction to Bioinformatics

Stephen Scott.

Basic Local Alignment Search Tool

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Computational Genomics and Molecular Biology, Fall

BIOINFORMATICS: An Introduction

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

MCMC: Markov Chain Monte Carlo

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Introduction to Bioinformatics

Evolutionary Models. Evolutionary Models

To Create a Simple Formula using the Point and Click Method:

HMMs and biological sequence analysis

Sequence Analysis '17 -- lecture 7

Tests for Two Coefficient Alphas

Math 370 Homework 2, Fall 2009

Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

Stephen Scott.

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models

Solutions to the Mathematics Masters Examination

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Sequence comparison: Score matrices

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Chapter 7: Correlation

Computational Analysis of the Fungal and Metazoan Groups of Heat Shock Proteins

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Sets. Alice E. Fischer. CSCI 1166 Discrete Mathematics for Computing Spring, Outline Sets An Algebra on Sets Summary

Transcription:

Command line Training Set First Motif Summary of Motifs Termination Explanation MEME - Motif discovery tool MEME version 3.0 (Release date: 2002/04/02 00:11:59) For further information on how to interpret these results or to get a copy of the MEME software please access http://meme.sdsc.edu. This file may be used as input to the MAST algorithm for searching sequence databases for matches to groups of motifs. MAST is available for interactive use and downloading at http://meme.sdsc.edu. REFERENCE If you use this program in your research, please cite: Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994. TRAINING SET DATAFILE= training_set_full.fasta ALPHABET= ACDEFGHIKLMNPQRSTVWY Sequence name Weight Length Sequence name Weight Length ------------- ------ ------ ------------- ------ ------ Pfa3D7 chr7_000020.gen_1 1.0000 100 Pfa3D7 pfal_chr2 PFB0095 1.0000 100 Pfa3D7 pfal_chr2 PFB0100 1.0000 100 Pfa3D7 chr10 PF10_0159 A 1.0000 100 Pfa3D7 pfal_chr5 PFE0040 1.0000 100 COMMAND LINE SUMMARY This information can also be useful in the event you wish to report a problem with the MEME software. command: meme training_set_full.fasta -mod oops -w 5 model: mod= oops nmotifs= 1 evt= inf object function= E-value of product of p-values width: minw= 5 maxw= 5 minic= 0.00 width: wg= 11 ws= 1 endgaps= yes file:///c /Documents%20and%20Settings/haldarlab/Desktop/Supplemental/meme1.html (1 of 8) [7/11/2004 4:40:13 PM]

nsites: minsites= 5 maxsites= 5 wnsites= 0.8 theta: prob= 1 spmap= pam spfuzz= 120 em: prior= dmix b= 0 maxiter= 50 distance= 1e-05 data: n= 500 N= 5 sample: seed= 0 seqfrac= 1 Dirichlet mixture priors file: prior30.plib Letter frequencies in dataset: A 0.098 C 0.010 D 0.060 E 0.078 F 0.016 G 0.028 H 0.136 I 0.030 K 0.106 L 0.052 M 0.006 N 0.070 P 0.024 Q 0.068 R 0.052 S 0.046 T 0.046 V 0.050 W 0.002 Y 0.022 Background letter frequencies (from dataset with add-one prior applied): A 0.096 C 0.012 D 0.060 E 0.077 F 0.017 G 0.029 H 0.133 I 0.031 K 0.104 L 0.052 M 0.008 N 0.069 P 0.025 Q 0.067 R 0.052 S 0.046 T 0.046 V 0.050 W 0.004 Y 0.023 P N MOTIF 1 width = 5 sites = 5 llr = 56 E-value = 1.3e+002. Simplified A : : : 6 : pos.-specific C : : : : : probability D : : : : : matrix E : : : : 6 F : : : : : G : : : : : H : : : 2 : I : 4 : : : K : : : : : L : 2 a : : M : : : : : N : : : : : P : : : : : Q : : : : 4 R a : : : : S : 2 : 2 : T : 2 : : : V : : : : : W : : : : : Y : : : : : bits 8.0 7.2 6.4 5.6 Information 4.8 content 4.0 file:///c /Documents%20and%20Settings/haldarlab/Desktop/Supplemental/meme1.html (2 of 8) [7/11/2004 4:40:13 PM]

(16.2 bits) 3.2 2.4 1.6 0.8 0.0. Multilevel RI L A E consensus L H Q sequence S S T. NAME START P-VALUE SITES Pfa3D7 chr10 PF10_0159 A 15 6.14e-07 KAVDYGFRES RI L A E GEDTCARKEK Pfa3D7 pfal_chr5 PFE0040 19 1.44e-06 SIKNNNDRYV RI L S E TEPPMSLEEI Pfa3D7 pfal_chr2 PFB0100 20 8.53e-06 SGDSFDFRNK RT L A Q KQHEHHHHHH Pfa3D7 pfal_chr2 PFB0095 25 9.65e-06 VYNIFEIRLK RS L A Q VLGNTRLSSR Pfa3D7 chr7_000020.gen_1 18 4.92e-05 KNAKGLNLNK RL L H E TQAHVDDAHH Motif 1 block diagrams Name Lowest p-value Motifs Pfa3D7 chr10 PF10_0159 A 6.1e-07 1 Pfa3D7 pfal_chr5 PFE0040 1.4e-06 1 Pfa3D7 pfal_chr2 PFB0100 8.5e-06 1 Pfa3D7 pfal_chr2 PFB0095 9.7e-06 1 Pfa3D7 chr7_000020.gen_1 4.9e-05 1 SCALE 1 25 50 75 Motif 1 in BLOCKS format BL MOTIF 1 width=5 seqs=5 Pfa3D7 chr10 PF10_0159 A ( 15) RILAE 1 Pfa3D7 pfal_chr5 PFE0040 ( 19) RILSE 1 Pfa3D7 pfal_chr2 PFB0100 ( 20) RTLAQ 1 Pfa3D7 pfal_chr2 PFB0095 ( 25) RSLAQ 1 Pfa3D7 chr7_000020.gen_1 ( 18) RLLHE 1 // file:///c /Documents%20and%20Settings/haldarlab/Desktop/Supplemental/meme1.html (3 of 8) [7/11/2004 4:40:13 PM]

Motif 1 position-specific scoring matrix log-odds matrix: alength= 20 w= 5 n= 480 bayes= 6.56986 E= 1.3e+002-378 -203-440 -456-324 -270-435 -306-200 -309-194 -369-254 -293 411-293 -328-416 -106-340 -165-65 -328-293 -1-162 -454 367-312 144 156-286 -196-283 -235 98 127 93-20 -112-354 -210-486 -436-17 -326-584 5-463 402 158-473 -290-380 -340-322 -272-106 -102-230 268-11 -258-247 -148-21 -102-154 -291-174 -13-259 -214-278 -230 125-110 -80-112 -230-206 -389-18 279-329 -166-439 -273-192 -266-106 -239-174 193-203 -153-190 -259-257 -313 Motif 1 position-specific probability matrix letter-probability matrix: alength= 20 w= 5 n= 480 E= 1.3e+002 0.007020 0.002817 0.002823 0.003256 0.001836 0.004443 0.006508 0.003699 0.025876 0.006119 0.002011 0.005356 0.004313 0.008859 0.897413 0.006070 0.004739 0.002801 0.001850 0.002190 0.030571 0.007371 0.006155 0.010107 0.017184 0.009415 0.005699 0.390698 0.011973 0.140988 0.022756 0.009565 0.006425 0.009474 0.010151 0.091188 0.111069 0.095261 0.003356 0.010594 0.008279 0.002685 0.002049 0.003747 0.015422 0.003007 0.002312 0.031813 0.004183 0.845274 0.022920 0.002618 0.003358 0.004824 0.004909 0.004967 0.007026 0.024032 0.001899 0.004678 0.618100 0.010671 0.009978 0.013906 0.006220 0.024888 0.065502 0.010565 0.013773 0.015552 0.007015 0.011480 0.005689 0.009767 0.010564 0.109558 0.021604 0.028694 0.001774 0.004700 0.022985 0.000778 0.052449 0.532784 0.001769 0.009124 0.006349 0.004638 0.027404 0.008211 0.003699 0.013226 0.007480 0.256397 0.012746 0.016011 0.012341 0.008315 0.000649 0.002645 Time 0.19 secs. P N SUMMARY OF MOTIFS Combined block diagrams: non-overlapping sites with p-value < 0.0001 file:///c /Documents%20and%20Settings/haldarlab/Desktop/Supplemental/meme1.html (4 of 8) [7/11/2004 4:40:13 PM]

Name Combined p-value Motifs Pfa3D7 chr7_000020.gen_1 4.72e-03 1 Pfa3D7 pfal_chr2 PFB0095 9.27e-04 1 Pfa3D7 pfal_chr2 PFB0100 8.19e-04 1 Pfa3D7 chr10 PF10_0159 A 5.89e-05 1 Pfa3D7 pfal_chr5 PFE0040 1.39e-04 1 SCALE 1 25 50 75 Stopped because nmotifs = 1 reached. CPU: bigbird EXPLANATION OF MEME RESULTS The MEME results consist of: The version of MEME and the date it was released. The reference to cite if you use MEME in your research. A description of the sequences you submitted (the "training set") showing the name, "weight" and length of each sequence. The command line summary detailing the parameters with which you ran MEME. Information on each of the motifs MEME discovered, including: 1. A summary line showing the width, number of occurrences, log likelihood ratio and statistical significance of the motif. 2. A simplified position-specific probability matrix. 3. A diagram showing the degree of conservation at each motif position. 4. A multilevel consensus sequence showing the most conserved letter(s) at each motif position. 5. The occurrences of the motif sorted by p-value and aligned with each other. 6. Block diagrams of the occurrences of the motif within each sequence in the training set. 7. The motif in BLOCKS or FASTA format. 8. A position-specific scoring matrix (PSSM) for use by the MAST database search program. 9. The position specific probability matrix (PSPM) describing the motif. A summary of motifs showing an optimized (non-overlapping) tiling of all of the motifs onto each of the sequences in the training set. The reason why MEME stopped and the name of the CPU on which it ran. file:///c /Documents%20and%20Settings/haldarlab/Desktop/Supplemental/meme1.html (5 of 8) [7/11/2004 4:40:13 PM]

This explanation of how to interpret MEME results. MOTIFS For each motif that it discovers in the training set, MEME prints the following information: Summary Line This line gives the width (`width'), number of occurrences in the training set (`sites'), log likelihood ratio (`llr') and E- value of the motif. Each motif describes a pattern of a fixed width--no gaps are allowed in MEME motifs. MEME numbers the motifs consecutively from one as it finds them. MEME usually finds the most statistically significant (low E-value) motifs first. The statistical significance of a motif is based on its log likelihood ratio, its width and number of occurrences, the background letter frequencies (given in the command line summary), and the size of the training set. The E-value is an estimate of the expected number of motifs with the given log likelihood ratio (or higher), and with the same width and number of occurrences, that one would find in a similarly sized set of random sequences. (In random sequences each position is independent with letters chosen according to the background letter frequencies.) The log likelihood ratio is the logarithm of the ratio of the probability of the occurrences of the motif given the motif model (likelihood given the motif) versus their probability given the background model (likelihood given the null model). (Normally the background model is a 0-order Markov model using the background letter frequencies, but higher order Markov models may be specified via the -bfile option to MEME.) Clicking on the buttons to the left of the motif summary line takes you to the previous motif (P) or next motif (N). Simplified Position-Specific Probability Matrix MEME motifs are represented by position-specific probability matrices that specify the probability of each possible letter appearing at each possible position in an occurrence of the motif. In order to make it easier to see which letters are most likely in each of the columns of the motif, the simplified motif shows the letter probabilities multiplied by 10 rounded to the nearest integer. Zeros are replaced by ":" (the colon) for readability. Information Content Diagram The information content diagram provides an idea of which positions in the motif are most highly conserved. Each column (position) in a motif can be characterized by the amount of information it contains (measured in bits). Highly conserved positions in the motif have high information; positions where all letters are equally likely have low information. (The information content is relative to the background letter frequencies which are given in the command line summary section.) The diagram is printed so that each column lines up with the same column in the simplified position-specific probability matrix above it. Columns in the information content diagram are colored according to the majority category of the letters occurring in that column of the alignment. If no letter category has frequency above 0.5, the column in the diagram is colored black. For DNA sequences, the letter categories contain one letter each. For proteins, the categories are based on the biochemical properties of the various amino acids. The categories and their colors are: file:///c /Documents%20and%20Settings/haldarlab/Desktop/Supplemental/meme1.html (6 of 8) [7/11/2004 4:40:13 PM]

NUCLEIC ACIDS COLOR A C G T RED BLUE ORANGE GREEN AMINO ACIDS COLOR ACFILVM NQST DE KR H G P Y BLUE GREEN MAGENTA RED PINK ORANGE YELLOW TURQUOISE Summing the information content for each position in the motif gives the total information content of the motif (shown in parentheses to the left of the diagram). The total information content is approximately equal to the log likelihood ratio divided by the number of occurrences times ln(2). The total information content gives a measure of the usefulness of the motif for database searches. For a motif to be useful for database searches, it must as a rule contain at least log_2(n) bits of information where N is the number of sequences in the database being searched. For example, to effectively search a database containing 100,000 sequences for occurrences of a single motif, the motif should have an IC of at least 16.6 bits. Motifs with lower information content are still useful when a family of sequences shares more than one motif since they can be combined in multiple motif searches (using MAST). Multilevel Consensus Sequence The multilevel consensus sequence corresponding to the motif is an aid in remembering and understanding the motif. It is calculated from the motif position-specific probability matrix as follows. Separately for each column of the motif, the letters in the alphabet are sorted in decreasing order by the probability with which they are expected to occur in that position of motif occurrences. The sorted letters are then printed vertically with the most probable letter on top. Only letters with probabilities of 0.2 or higher at that position in the motif are printed. As an example, the multilevel consensus sequence of motif 1 in the sample output is: Multilevel TTATGTGAACGACGTCACACT consensus AA T A G A GA AA sequence T C TT T This multilevel consensus sequence says several things about the motif. First, the most likely form of the motif can be read from the top line as TTATGTGAACGACGTCACACT. Second, that only letter A has probability more than 0.2 in position 3 of the motif, both T and A have probability greater than 0.2 in position 1, etc. Third, a rough approximation of the motif can be made by converting the multilevel consensus sequence into the Prosite signature [TA]-[TA]-A-T-[GT]-[T]-[GA]-A-[AGT]-C-[GAC]-A-[CGT]-[GAT]-T-C-A-C-A-[CAT]- [TA]. Occurrences of the Motif MEME displays the occurrences (sites) of the motif in the training set. The sites are shown aligned with each other, and the ten sequence positions preceding and following each site are also shown. Each site is identified by the name of the sequence where it occurs, the strand (if both strands of DNA sequences are being used), and the position in the sequence where the site begins. When the DNA strand is specified, `+' means the sequence in the training set, and `-' means the reverse complement of the training set sequence. (For `-' strands, the `start' position is actually the position on the positive strand where the site ends.) The sites are listed in order of increasing statistical significance (p-value). The p-value of a site is computed from the the match score of the site with the position specific scoring matrix for the file:///c /Documents%20and%20Settings/haldarlab/Desktop/Supplemental/meme1.html (7 of 8) [7/11/2004 4:40:13 PM]

motif. The p-value gives the probability of a random string (generated from the background letter frequencies) having the same match score or higher. (This is referred to as the position p-value by the MAST algorithm.) Block Diagrams of Motif Occurrences The occurrences of the motif in the training set sequences are shown with MAST-style block diagrams. One diagram is printed for each sequence showing all the occurrences of the motif in that sequence. The sequences are sorted by the lowest p-value among all occurrences of the motif in a given sequence. (The p-value of an occurrence is the probability of a single random subsequence the length of the motif, generated according to the 0-order background model, having a score at least as high as the score of the occurrence.) When the DNA strand is specified, `+' means the motif appears from left to right on the sequence, and `-' means the motif appears from right to left on the complementary strand. A sequence position scale is shown at the end of each table of block diagrams. Very long sequences are shown with thick lines connecting the motifs and are not drawn to scale. Motif in BLOCKS format or FASTA format For use with BLOCKS tools, MEME prints the occurrences of the motif in BLOCKS format. You can convert these blocks to PSSMs (position-specific scoring matrices), LOGOS (color representations of the motifs), phylogeny trees and search them against a database of other blocks by pasting everything from the "BL" line to the "//" line (inclusive) into the Multiple Alignment Processor. If you include the -print_fasta switch on the command line, MEME prints the motif sites in FASTA format instead of BLOCKS format. Position-Specific Scoring Matrix The position-specific scoring matrix corresponding to the motif is printed for use by database search programs such as MAST. This matrix is a log-odds matrix calculated by taking the log (base 2) of the ratio p/f at each position in the motif where p is the probability of a particular letter at that position in the motif, and f is the background frequency of the letter (given in the command line summary section.) This is the same matrix that is used above in computing the p- values of the occurrences of the motif in the Occurrences of the Motif and Block Diagrams of Motif Occurrences sections. The scoring matrix is printed "sideways"--columns correspond to the letters in the alphabet (in the same order as shown in the simplified motif) and rows corresponding to the positions of the motif, position one first. The scoring matrix is preceded by a line starting with "log-odds matrix:" and containing the length of the alphabet, width of the motif, number of characters in the training set and the scoring threshold used in the list of possible motif examples. Position-Specific Probability Matrix The motif itself is a position-specific probability matrix giving, for each position in the pattern, the probabilities of each possible letter occurring there. The probability matrix is printed "sideways"--columns correspond to the letters in the alphabet (in the same order as shown in the simplified motif) and rows corresponding to the positions of the motif, position one first. The motif is preceded by a line starting with "letter-probability matrix:" and containing the length of the alphabet, width of the motif and number of characters in the training set. Go to top file:///c /Documents%20and%20Settings/haldarlab/Desktop/Supplemental/meme1.html (8 of 8) [7/11/2004 4:40:13 PM]