BLAST: Basic Local Alignment Search Tool

Similar documents
Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Chapter 7: Rapid alignment methods: FASTA and BLAST

Single alignment: Substitution Matrix. 16 march 2017

BLAST: Target frequencies and information content Dannie Durand

In-Depth Assessment of Local Sequence Alignment

Computational Biology

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

L3: Blast: Keyword match basics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Sequence Analysis '17 -- lecture 7

Sequence Database Search Techniques I: Blast and PatternHunter tools

Basic Local Alignment Search Tool

bioinformatics 1 -- lecture 7

Local Alignment Statistics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Introduction to Bioinformatics Algorithms Homework 3 Solution

Sequence Comparison. mouse human

Bioinformatics and BLAST

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Algorithms in Bioinformatics

Define M to be a binary n by m matrix such that:

BLAST. Varieties of BLAST

Sequence analysis and Genomics

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Alignment Techniques and Their Uses

On-line String Matching in Highly Similar DNA Sequences

Fast profile matching algorithms A survey

An Introduction to Sequence Similarity ( Homology ) Searching

Introduction to Bioinformatics

Compressed Index for Dynamic Text

Quantifying sequence similarity

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

String Matching Problem

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Lecture 18 April 26, 2012

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

String Matching with Variable Length Gaps

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Heuristic Alignment and Searching

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Large-Scale Genomic Surveys

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Subset seed automaton

Scoring Matrices. Shifra Ben-Dor Irit Orr

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Tools and Algorithms in Bioinformatics

SEQUENCE alignment is an underlying application in the

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Pattern Matching (Exact Matching) Overview

Lecture 1 : Data Compression and Entropy

Hierarchical Overlap Graph

EECS730: Introduction to Bioinformatics

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Bioinformatics for Biologists

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Module 9: Tries and String Matching

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

Pairwise sequence alignment

arxiv: v1 [cs.ds] 9 Apr 2018

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Small-Space Dictionary Matching (Dissertation Proposal)

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

Exercise 5. Sequence Profiles & BLAST

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Substitution matrices

Online Sorted Range Reporting and Approximating the Mode

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Alignment & BLAST. By: Hadi Mozafari KUMS

Motivating the need for optimal sequence alignments...

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Lecture 5,6 Local sequence alignment

Performing local similarity searches with variable length seeds

arxiv: v2 [cs.ds] 16 Mar 2015

Lecture Notes: Markov chains

Bloom Filters, Minhashes, and Other Random Stuff

arxiv: v1 [cs.ds] 15 Feb 2012

Variable-length Intervals in Homology Search

Tools and Algorithms in Bioinformatics

Collected Works of Charles Dickens

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Hidden Markov Models

Multiseed Lossless Filtration

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Reducing storage requirements for biological sequence comparison

Transcription:

.. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2]. BLAST is usually used to match a single DNA sequence S to a database D = {D 1,... D N } of DNA sequences. Different variants of BLAST produce alignments for S and D represented in either an alphabet of nucleotides or an alphabet of amino acids (in fact, S and D may be in different alphabets!). BLAST outputs two things: Good alignments between the query string S and strings from D; A p-value: the estimate of probability that the reported alignment can occur by chance. The output of BLAST is usually sorted in ascending order by p-value (i.e., the lower the probability of a chance match, the better the local alignment is). BLAST in a Nutshell BLAST consists of three key procedures: 1. Rapid search for seed matches. On this stage, for each string D i D, any locations that can have a good local match are rapidly identified. (the specifics of this part is one of two things what makes BLAST different from other methods). 2. Completion of local matches. Seed matches are extended to form local alignments. (Different variations of BLAST have used different strategies for extending seed matches.). 3. Estimation of p-values. For each local alignment, the probability that it may occur by chance is estimated. The produced matches are sorted in ascending order by the p-value. (This is the second specialty of BLAST). 1

Rapid Search for Seed Matches Inputs. The problem involves the following inputs: alphabet Σ = {a 1..., a M }; string S = s 1...s n called query string; string T = t 1... t m called database string; substitution matrix Score : Σ Σ R; value τ, a similarity threshold, for seed alignments; an integer k << min(m, n), the length of the seed alignments. Seed alignments. A pair of substrings S i = s i...s i+k 1 and T j = t j... t j+k 1 of length k is called a seed alignment iff k 1 Score(S i, T j ) = Score[s i+l, t j+l ] τ. l=0 Problem: given the inputs above, find all pairs (i, j), such that (S i, T j ) is a seed alignment. Do it fast! Idea. Σ is a constant-length alphabet. In BLAST, k - the length of a seed alignment is a constant. For Σ = {A, T, C, G} (alphabet of nucleotides), k is usually set to be in the range between 9 and 12 (a common value is 11). For Σ = the amino acid alphabet, k = 3. BLAST uses the following key observations: Observation 1: The number of all possible strings of size k in alphabet Σ, Σ k = M k is a constant! Observation 2: For each k-tuple V = v 1...v k, the set of all k-tuples W 1,...,W s, such that Score[V, W i ] τ can be precomputed in advance in constant time! Note: In fact, given k, BLAST precomputes the matrix Score k of similarity scores between all pairs of k-tuples from Σ : such that Score k : Σ k Σ k R, Score k [v 1... v k, w 1... w k ] = k Score[v i, w i ]. Then, based on the input value of τ, for each k-tuple V, BLAST computes the list Neighbors(V ) of all k-tuples W 1,...W s, such that Score k [V, W i ] τ. i=1 2

Rapid Search Procedure. String S = s 1... s n has n k + 1 k-tuples: S 1 = s 1...s k 1 S 2 = s 2...s k... S n k+1 = s n k+1...s n The rapid seed match search proceeds as follows: 1. For each S i, retrieve Neighbors(S i ). 2. Construct set Neighbors(S) = n k+1 i=1 Neighbors(S i ). 3. For each string V Neighbors(S), search for all occurrences of V in T. 4. Report a seed match between each S i, such that V Neighbors(S i ) and each T j, such that T j = V. Why this works. Step 3 in the procedure above is a repeated search for exact matches between a k-tuple V and substrings of T. This can be done in time linear in m (length of T). Because the size of Neighbors(S) is constant (it is less than Σ k ) 1, Step 3 takes O(m) time. In practice, the rapid matches can be done in a number of ways: Suffix trees. A suffix tree for T can be precomputed. If k is known in advance, a traversal of the suffix tree can be used to label all internal nodes with node-paths of size k (or the next closest size) with the list of leaf nodes in the subtree. This suffix tree can be used to efficiently produce the list of seed matches, when searches for k-tuples from Neighbors(S). Indexing. An index of k-tuple occurrences in T can be precomputed in advance and stored in an easy-to-access structure - e.g., in a hashmap. Given a k-tuple V from Neighbors(S), the list of all its occurrences is retrieved in O(1) time from the index structure. Aho-Corasick algorithm. Aho-Corasick algorithm[1] solves the problem of searching for a set V = {V 1,..., V s } of exact matches in a string T in time O(n + m + z) where, n is the length of all strings from V, m is the length of T and z is the total number of matches found. The algorithm works by efficiently representing the collection of strings V as a keyword tree with backlinks the provide for efficient navigation when mismatches are found. The algorithm itself uses T to traverse the keyword tree for V. Each time a match is found, the keyword tree is navigated following a tree path. Each time, there is a mismatch, a sequence of fail jumps occurs. Aho-Corasick algorithm can be used to match any string T and the keyword tree constructed from the set Neighbors(S). By the numbers. What does it take to precompute the necessary structures? 1 Although it can be a rather large number. 3

Amino Acid alphabet. For the alphabet of amino acids we have: Σ = 21 (20 amino acids plus the stop codon). k = 3. This is the usual length used in BLAST. Substitution matrices used. BLOSUM62 is typically used. Other matrices in the BLOSUM family can be used. PAM family is used but not as commonly. Total number of combinations: 21 3 = 9261. Size of the Score k matrix: 9261 9261 = 85, 766, 121 (85+ million). Nucleotide alphabet. Things are a bit more interesting : Σ = 4. k {9, 10, 11, 12}. Usualy value is k = 11. Substitution matrices used. Usually, Score[X, X] = 5, Score = [X, Y ] = 4 for X Y is used. Total number of combinations: 4 1 1 = 2 2 2 = 4, 194, 304 (over four million). Size of Score k matrix: 4 1 1 4 1 1 = 2 4 4 = 17, 592, 186, 044, 416 (over 17.5 trillion). Completion of Local Matches Step 2 of BLAST. Once seed matches are found, each of them needs to be extended to the best/longest possible match. Different BLAST versions differ on how this step is handled. In the original BLAST algorithm[2], the process of extension was as follows: 1. For each seed match (i, j) reported on Step 1 of BLAST: extend it in both directions for as long as the score of the new match is above the threshold τ. stop, when the match cannot be extended on either side without its score falling below τ. Report the computed match. Variants. The origial version did not create local alignments with gaps. Subsequent versions of BLAST improved on this process in a number of ways: Gapped alignments. When extending the seed matches use a substitution matrix and and indel score δ to score the alignments. Filter out seed matches. Only extend seed matches which show up in pairs on the same diagonal within a given number A of positions. This variant filtered out a large number of seed matches and improved the performance of BLAST. 4

Computing p-values of Alignments Poisson distribution. A discrete random variable X has Poisson distribution with the mean (expected) value λ if the probability P(X = k) is P(X = k) = λk e λ Intuition. Consider a certain, low-probability event α that can occur with probability p at each moment of time. Consider a sequence of n independent trials for α. Let X be a discrete random variable that counts, how many times α occurs. When p is reasonable and n is relatively small, the probability that α occurs exactly k n times can be described exactly using the binomial distibution: P(X = k) = p k (1 p) n k. However, when p is very small, but n is very large, binomial distibution is not convenient to use. Poisson distribution is an approximation of the binomial distibution in such a situation. Given n trials, the expected number of times α occurs is np. If n is very large and p is very small, np may be a mid-range number. Variable X will have Poisson distribution with the expected value λ = np. k! Match by chance. Consider a query string S and a database string T, for which BLAST returns a local alignment with a score τ τ. In general, two DNA sequences have a good local alignment if they are related and/or serve similar purposes. So, what is the probability that this local alignment of S and T occurred by chance? Simple example. Let (i, j) be a pair of random positions in S and T. Let p be the probability that two letters occurring at random positions of two strings match 2. An exact match of length k starting at position i in S and j in T, then has the following probability of happening by random chance: p = (1 p)p k. (Here, 1 p is responsible for the match starting at s i and t j. This means that s i 1 t j 1, which has the probability of 1 p). There are nm possible alignments of S and T. Therefore, the expected number of random alignments of two strings of length k in S and T is λ = nm(1 p)p k. The number of random alignments is a random variable with Poisson distribution with the expected value of λ. 2 For the nucleotide alphabet this probability, under the assumption of uniform distribution of nucleotides in the DNA strings is p = 1 4 4 = 1. For the alphabet of amino acids, this 16 probability is p = 1 21 21 = 1 441. 5

Altschul-Dembo-Karlin statistics. In BLAST, the match between two substrings does not have to be exact, in order to qualify for a good local alignment, so the math is a bit more complex. The Altschul-Dembo-Karlin statistic estimates the expected number of such matches in a pair of strings S = s 1... s n and T = t 1... t m as E(τ) = Kmne λ τ, where: K is constant. τ is the similarity threshold. λ is the positive root of the following equation: p s p t e λ Score(s,t) = 1. s Σ t Σ Here p s and p t are the frequencies of characters s and t. The probability that there is a match of a score greater than τ between two random subsequences of S and T is P = 1 e E(τ). From these statistics, the p-values for each alignment are computed. References [1] Alfred Aho, Margaret Corasick (1975) Efficient String Matching: an Aid to Bibliographic Search, Communications of the ACM, Vol. 18, No. 6, pp. 333-340. [2] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, David J. Lipman (1990), Journal of Molecular Biology, Vol. 215, pp. 403 410. 6