Sequence Database Search Techniques I: Blast and PatternHunter tools

Similar documents
Local Alignment Statistics

In-Depth Assessment of Local Sequence Alignment

EECS730: Introduction to Bioinformatics

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

L3: Blast: Keyword match basics

Tools and Algorithms in Bioinformatics

Basic Local Alignment Search Tool

Bioinformatics for Biologists

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Sequence Alignment Techniques and Their Uses

Optimization of a New Score Function for the Detection of Remote Homologs

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Tools and Algorithms in Bioinformatics

Substitution matrices

Single alignment: Substitution Matrix. 16 march 2017

Algorithms in Bioinformatics

An Introduction to Sequence Similarity ( Homology ) Searching

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Introduction to Bioinformatics

BLAST: Target frequencies and information content Dannie Durand

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Practical considerations of working with sequencing data

Bioinformatics and BLAST

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Computational Biology

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Whole Genome Alignments and Synteny Maps

Exercise 5. Sequence Profiles & BLAST

Scoring Matrices. Shifra Ben-Dor Irit Orr

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Sequence Comparison. mouse human

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Fundamentals of database searching

Foreword by. Stephen Altschul. An Essential Guide to the Basic Local Alignment Search Tool BLAST. Ian Korf, Mark Yandell & Joseph Bedell

Sequence analysis and comparison

Chapter 7: Rapid alignment methods: FASTA and BLAST

Large-Scale Genomic Surveys

BLAST: Basic Local Alignment Search Tool

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

BLAST. Varieties of BLAST

SUPPLEMENTARY INFORMATION

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Genomics and bioinformatics summary. Finding genes -- computer searches

Biologically significant sequence alignments using Boltzmann probabilities

Advanced topics in bioinformatics

Introduction to protein alignments

Heuristic Alignment and Searching

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

A New Similarity Measure among Protein Sequences

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

SEQUENCE alignment is an underlying application in the

Similarity searching summary (2)

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Quantifying sequence similarity

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Performing local similarity searches with variable length seeds

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Pairwise & Multiple sequence alignments

Pairwise sequence alignment

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Comparative genomics: Overview & Tools + MUMmer algorithm

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

Improved hit criteria for DNA local alignment

bioinformatics 1 -- lecture 7

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Collected Works of Charles Dickens

Protein function prediction based on sequence analysis

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

Tutorial 4 Substitution matrices and PSI-BLAST

CSE 549: Computational Biology. Substitution Matrices

BIOINFORMATICS: An Introduction

Week 10: Homology Modelling (II) - HHpred

BLAT The BLAST-Like Alignment Tool

Text Algorithms (4AP) Biological measures of similarity. Jaak Vilo 2008 fall. MTAT Text Algorithms. Complete genomes

Scoring Matrices. Shifra Ben Dor Irit Orr

Sequence analysis and Genomics

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

BMC Bioinformatics. Open Access. Abstract

Generalized Affine Gap Costs for Protein Sequence Alignment

1.5 Sequence alignment

A Browser for Pig Genome Data

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics Chapter 1. Introduction

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

G4120: Introduction to Computational Biology

Homology Modeling. Roberto Lins EPFL - summer semester 2005

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Handling Rearrangements in DNA Sequence Alignment

On Spaced Seeds for Similarity Search

Transcription:

Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore

Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered with spaced seeds) 4. Good spaced seeds

. Sequence database search Problem: Find all highly similar segments (called homologies) in the query sequence and sequences in a database, which are listed as local alignments. >gi 9785 gb AC060809.7 Homo sapiens chromosome 5, clone RP-404B3, complete sequence Length = 7223 Query: 6 g a a -- t t t t a c a c t t t c a a a -- g 36 Sbjct: 3078 g a a a t t t g a c a c t t t c a a a g g 3098

Local alignment Mathematically, the local alignment problem is to find a local alignment with maximum score. Smith-Waterman Algorithm --- dynamic programming algorithm --- output optimal local alignments --- quadratic time O(mn), and so not scalable.

Scalability is critical The genetic data grows exponentially. Nucleotides(billion) 8 7 6 5 4 3 2 0 980 985 990 995 2000 Years Genomes: Human, Mouse, Fly, and etc. To meet this demand, many programs were created: Blast (MegaBlast, WU- Blast, psi-blast), FSATA,SENSEI, MUMmer, BLAT, etc. 30 billions in 2005.

2. Blast Family Based on filtration technique.. Filtering stage: identify short matches of length k (=) in both query and target sequences. 2. Alignment stage: extend each match found in Stage into a gapped alignment, and report it if significance. Its running time is linear time c (m+n), where constant factor c depends on k.

ACTCATCGCTGATGCCCATCCTCACTTTAAAAATATATAGACTAGGGCATTGGGA GCAAAGGATTTACGCATTGATGCCCATCCTGCAGGCGACTAGGGCATTGG

Dilemma Increasing match size k speeds up the program, but loses sensitivity (i.e. missing homology region that are highly similar but do not contain k consecutive base matches). Decreasing size k gains sensitivity but loses speed. >gi 9785 gb AC060809.7 Homo sapiens chromosome 5, clone RP-404B3, complete sequence Length = 7223 Can the dilemma be solved? We need to have both sensitivity and speed. Query: 6 g a a -- t t t t a c a c t t t c a a a -- g 36 Sbjct: 3078 g a a a t t t g a c a c t t t c a a a g g 3098

Dilemma Increasing match size k speeds up the program, but loses sensitivity (i.e. missing homology regions that are highly similar but do not contain k consecutive base matches). Decreasing match size k gains sensitivity but loses speed. Can the dilemma be solved? We need to have both sensitivity and speed.

3. Spacing out matching positions --- PatternHunter s approach (Ma, Tromp, and Li, Bioinformatics, 2002) Filtering stage: looks for matches in k(=) noncontinous positions specified by an optimal pattern, for example, *** or several patterns. Such a pattern is called a spaced seed. Alignment stage: same as Blast. G C A A T T G C C G G A T C T T I I I I I I G C G A T T G C T G G C T C T A G C A A T T G C C G G A T C T T I I I I I G C G A T T G C T G G C T C T A

Simple idea makes a big difference A good spaced seed not only increases hits in homology regions, but also reduces running time. In a region of length 64 with similarity 70%, PH has probability of 0.466 to hit vs Blast 0.3, 50% increase. Time reduction comes from that the average number of matches found in Stage decreases. Adopted by BLASTZ, MegaBlast progams. Used by Mouse Genome Consortium.

Simple idea makes a big difference A good spaced seed not only increases hits in homology regions, but also reduces running time. In a region of length 64 with similarity 70%, PH has probability of 0.466 to hit vs Blast 0.3, 50% increase. Time reduction comes from that the average number of matches found in Stage decreases. Adopted by BLASTZ, MegaBlast progams. Used by Mouse Genome Consortium.

Not just spaced seed PatternHunter uses a variety of advanced data structures including priority queues, red-black tree, queues, hash tables. Several other algorithmic improvements.

Comparison with Blastn, MegaBlast (A slide from M. Li) On Pentium III 700MH, GB Blastn MB PH E.coli vs H.inf 76s 5s/56M 4s/68M Arabidopsis 2 vs 4 -- 2720s/087M 498s/280M Human 2 vs 22 -- -- 5250s/47M 6M vs 6M 3hr37m/700M 00M vs 35M 6m Human vs Mouse 20 days All with filter off and identical parameters 6M reads of Mouse genome against Human genome for Whiteheads & UCSC. Best Blast program takes 9 years at the same sensitivity (seed length ).

Questions Why is the PH seed better than Blast consecutive seed ( for weight ) of the same weight? PH spaced seed is `less regular than Blast seed; A random sequence should contain more `less regular patterns Are all spaced seeds better than Blast seed of the same weight? ****** is worse than Which spaced seeds are optimal? A difficult problem. No polynomial-time algorithm is known for finding them

4. Identifying Good Spaced Seed --- Ungapped alignment model Given two DNA sequences S, S with similarity p, we assume the events that they have a base-match at each position are jointly independent, each with probability p. Under this model, an ungapped alignment between S and S corresponds to a 0- random sequence S in which 0 and appear in each position with probability -p and p respectively.

Example G C A A T T G C C G G A T C T T I I I I I I I I I I I I G C G A T T G C T G G C T C T A Translate a match to and a mismatch to 0 0 0 0 0 If spaced seed *** is used, there are two seed matches in the alignment. G C A A T T G C C G G A T C T T I I I I I I G C G A T T G C T G G C T C T A G C A A T T G C C G G A T C T T I I I I I G C G A T T G C T G G C T C T A 0 0 0 0 0 0 0 0

Definitions Under the model of similarity p, (the sensitivity of a spaced seed Q) (the prob. of Q hitting a random 0- sequence of a fixed length N=64) Sensitivity depends on the similarity p if N is fixed. Optimal spaced seed is the one with largest sensitivity, over all the seeds of same weight..

Computing Sensitivity n Q. Consecutive Seed B= of weight w. Let S be a random 0-sequence of length n Let be the event of the seed B occurring at position k <= n. For n>=w+, A i ) )( ( ] [ ] ) [( ] [ 2 2 2 2 + = + = = = w n w n n n n n n n n n Q p p Q A A A A P Q A A A A A A A P A A A P Q L L L L Let n n Q Q = ) ( = w n w n n Q p p Q Q

tr

(-Sensitivity) grows exponentially with length Q n βλ (Buhler et al.) n Consecutive seed

Expected Number of Exact Matches Let Q be a spaced seed of weight w and length L. Under our model, the expected number E of the exact matches found in an ungapped alignment of length n is equal to the expected value of times T the seed Q occurs in an 0- random sequence of length n. Consider a length-n ungapped alignment with similarity p and hence a length-n 0- random sequence in which appears with probability p in each A position. Let j denote the event that seed Q occurs at position j ( ) and be the indicator function of A : I j j L j N Then I I j j A i A i = if the event occurs =0 if the event does not occur T = E( T ) = L j N I j w E( I = = + L j N j ) P[ A L j N j ] ( N L ) p

Good Spaced Seeds We identified good spaced seeds of weight from 9 to 8 in terms of their optimum span (i.e. the similarity interval in which the seed is optimal).

Experimental Validation We conducted genomic sequence comparison with PatternHunter on (i) H. influenza (.83Mbp) and E. coli (4.63Mbp) (ii) A.7Mbp segment in mouse ChX and a Mbp segment in human ChX (iii) A.3Mbp segment in mouse Ch0 and a 2Mbp segment in human Ch9 We evaluate a spaced seed by relative performance. (setting the performance of Blast seed as )

H. Influenza vs E. coli Threshold 35 Threshold 50.3.2 C P.3.2. G. 0.9 0.9 0.8 8 9 0 2 3 4 5 0.8 8 9 0 2 3 4 5 Threshold 70 Threshold 00.3.3.2.2.. 0.9 0.9 0.8 8 9 0 2 3 4 5 0.8 8 9 0 2 3 4 5

Human X vs Mouse X Threshold 35 Threshold 50.5.4.3.2. C P G.5.4.3.2. 0.9 0.9 0.8 8 9 0 2 3 4 5 0.8 8 9 0 2 3 4 5 Threshold 70 Threshold 00.4.3.3.2. 0.9 0.8.2. 0.9 0.8 0.7 8 9 0 2 3 4 5 0.7 8 9 0 2 3 4 5

Human 9 vs Mouse 0 Threshold 35 Threshold 50.8.7.6.5.4.3.2. 0.9 C P G 8 9 0 2 3 4 5.8.7.6.5.4.3.2. 0.9 8 9 0 2 3 4 5 Threshold 70 Threshold 00.8.7.6.5.4.3.2. 0.9 8 9 0 2 3 4 5.8.7.6.5.4.3.2. 0.9 8 9 0 2 3 4 5

Some Recommendations There are two competing seeds of weight Optimum Interval ******* (PH seed) [6%, 73%] ******* (Buhler et al) [74%, 96%] PH seed is good for distant homology search, while the latter one for aligning sequences with high similarity.

Recommendations (con t) Spaced seed ****** of weight 2 is probably the best for fast genomic database search 0.9 0.8 0.7 ****** 0.6 () faster and more sensitive than the Blast default seed weight (2) widest optimum interval [59%, 96%] 0.2 (3) good for aligning coding regions since 0. it contains 4 repeats of * in its 6-codon 0 span in a reverse direction: 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 sensitivity 0.5 0.4 0.3 similarity * * ** * *

Recommendations (con t) The larger the weight of a spaced seed, the narrower its optimum interval. So, for database search, a larger weight spaced seed should be carefully selected.

References Altschul, S.F. et al. "Basic local alignment search tool." J. Mol. Biol. 990; 25:403-40. Altschul, S.F. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 997; 25:3389-3402. Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002;8:440-5 Choi, K.P. Zeng F. and Zhang L.X., Good Spaced Seeds for Homology search, Bioinformatics 2004 (to appear). http://www.math.nus.edu.sg/~matzlx/papers/bio2003_05.pdf

Filtration-based Homology Search Algorithm Filtering stage: look for matches in k(=) noncontinous positions specified by a spaced seed such as *** or several seeds. Alignment stage: Extend matches found in above stage into an (ungapped or gapped) alignment. G C A A T T G C C G G A T C T T I I I I I I G C G A T T G C T G G C T C T A G C A A T T G C C G G A T C T T I I I I I G C G A T T G C T G G C T C T A Optimal spaced seeds have larger hitting prob., but smaller expected number of hits found in stage in an alignment.

. BLASTP Protein Sequence Database Search

Amino Acid Substitution Score Matrixes The theory is fully developed for scores used to find ungapped local alignments. Let F be a class of protein sequences in which amino acid i has q background frequency i, and each match of residue i verse residue j has target frequency. q i q ij q ij For finding local sequence comparison in F, up to a constant scaling factor, every appropriate substitution score matrix ( s ij ) is uniquely determined by {, }: s = ij qij ln( q q i j )

Idea: In scoring a local alignment, we would like to assess how strong the length-n alignment of x and y can be expected from chance alone Prob[ x and y have common ancestor] P= Prob[ x and y are aligned by chance] Let n y n y y y x x x x L L 2 2, = = = = n i y x y x n i y x n i y x i i i i i i i i q q q q q q P ) ( = n i y x y x i i i i q q q P ln ln

PAM and BLOSUM Substitution Matrices Hence, all substitution matrices are implicitly of log-odds form. But how to estimate the target frequencies? Different methods have been used for the task and result in two series of substitution matrices. PAM Matrices: the target frequencies were estimated from the observed residue replacements in closely related proteins within a given evolutionary distance (Dayhoff et al. 978). BLOSUM Matrices: the target frequencies were estimated from multiple alignments of distantly related protein regions directly (Henikoff & Henikoff, 992).

DNA vs. Protein Comparison If the sequences of interest are code for protein, it is almost always better to compare the protein translations than to compare the DNA sequences directly. The reason is () many changes in DNA sequences do not change protein, and (2) substitution matrices for amino acids represents more biochemical information.

Statistics of Local Ungapped Alignment It is well understood. The theory is based on the following simple alignment model: i). All the amino acid appear in each position independently with specific background probabilities. ii). the expected score for aligning a random pair of amino acid is required to be negative: p i, j 20 i p j s ij < 0

Statistics 2: E-value The BLAST program was designed to find all the maximal local ungapped alignments whose scores cannot be improved by extension or trimming. These are called high-scoring segment pairs (HSPs). In the limit of sufficiently large sequence lengths m and n, the expected number (E-value) of HSPs with score at least S is given by the formula: E = Kmne λs Where K and lambda can be considered as scales for the database size and the scoring system respectively.

Statistics 3: P-Value E-Value: E = Kmne λs The number of random HSPs with score >= S can be described by a Poisson distribution. This means that the probability of finding exactly x HSPs with score >=S is given by: e E x E x! By setting x=0, the probability of finding at least one such HSP is e E This is called the P-value associated with score S.

References. Karlin, S. & Altschul, S.F. (990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. 2. Dembo, A., Karlin, S. & Zeitouni, O. (994) "Limit distribution of maximal nonaligned two-sequence segmental score." Ann. Prob. 22:2022-2039. 3. Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (978) "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure," Vol. 5, Suppl. 3 (ed. M.O. Dayhoff), pp. 345-352. Natl. Biomed. Res. Found., Washington, DC. 4. Henikoff, S. & Henikoff, J.G. (992) "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA 89:095-099. (PubMed)