L3: Blast: Keyword match basics

Similar documents
CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Sequence Database Search Techniques I: Blast and PatternHunter tools

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

bioinformatics 1 -- lecture 7

BLAST: Target frequencies and information content Dannie Durand

Sequence Analysis '17 -- lecture 7

BLAST: Basic Local Alignment Search Tool

In-Depth Assessment of Local Sequence Alignment

EECS730: Introduction to Bioinformatics

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Heuristic Alignment and Searching

Local Alignment Statistics

Bioinformatics for Biologists

Computational Biology

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

SEQUENCE alignment is an underlying application in the

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

1.5 Sequence alignment

An Introduction to Sequence Similarity ( Homology ) Searching

Introduction to Bioinformatics

Introduction to Sequence Alignment. Manpreet S. Katari

Basic Local Alignment Search Tool

Algorithms in Bioinformatics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Sequence Comparison. mouse human

BLAST. Varieties of BLAST

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Exercise 5. Sequence Profiles & BLAST

Module 9: Tries and String Matching

11.3 Decoding Algorithm

Chapter 7: Rapid alignment methods: FASTA and BLAST

Tools and Algorithms in Bioinformatics

Lecture 2: Pairwise Alignment. CG Ron Shamir

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

Quantifying sequence similarity

X = X X n, + X 2

Bloom Filters, Minhashes, and Other Random Stuff

Single alignment: Substitution Matrix. 16 march 2017

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Bioinformatics and BLAST

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Statistics I: Homology, Statistical Significance, and Multiple Tests. Statistics and excess similarity Correcting for multiple tests I

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Tools and Algorithms in Bioinformatics

Extended Algorithms Courses COMP3821/9801

Analysis and Design of Algorithms Dynamic Programming

Computational Learning Theory

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Sequence analysis and Genomics

Random Walks and Quantum Walks

Stephen Scott.

Mathematical Foundations of Computer Science Lecture Outline October 18, 2018

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

Math 341: Probability Eighth Lecture (10/6/09)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Pattern Matching (Exact Matching) Overview

Random Models. Tusheng Zhang. February 14, 2013

Whole Genome Alignments and Synteny Maps

Question Points Score Total: 76

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

Searching Sear ( Sub- (Sub )Strings Ulf Leser

PRACTICE PROBLEMS FOR EXAM 2

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Models

Polytechnic Institute of NYU MA 2212 MIDTERM Feb 12, 2009

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

U.C. Berkeley CS174: Randomized Algorithms Lecture Note 12 Professor Luca Trevisan April 29, 2003

String Matching Problem

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Math 151. Rumbos Fall Solutions to Review Problems for Exam 2. Pr(X = 1) = ) = Pr(X = 2) = Pr(X = 3) = p X. (k) =

Bioinformatics I, WS 16/17, D. Huson, January 30,

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

Collected Works of Charles Dickens

Stephen Scott.

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Large-Scale Genomic Surveys

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

HMMs and biological sequence analysis

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Ways to make neural networks generalize better

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Seed-based sequence search: some theory and some applications

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Sequence-specific sequence comparison using pairwise statistical significance

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

6.842 Randomness and Computation Lecture 5

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

Comparative Network Analysis

B I O I N F O R M A T I C S

Transcription:

L3: Blast: Keyword match basics Fa05 CSE 182

Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with exactly the same number of hairs!

Assignment 1 is online Due 10/6 (Thursday) in class.

An O(nm) algorithm for score computation For i = 1 to n For j = 1 to m S[i 1, j 1] + C(s i,t j ) S[i, j] = max S[i 1, j] + C(s i, ) S[i, j 1] + C(,t j ) The iteration ensures that all values on the right are computed in earlier steps. How much space do you need?

Alignment? S[i 1, j 1] + C(s i,t j ) S[i, j] = max S[i 1, j] + C(s i, ) S[i, j 1] + C(,t j ) Is O(nm) space too much? What if the query and database are each 1Mbp?

Alignment (Linear Space) Score computation For i = 1 to n For j = 1 to m S[i 1, j 1] + C(s i,t j ) S[i, j] = max S[i 1, j] + C(s i, ) S[i, j 1] + C(,t j ) i 2 = i%2; i 1 = (i 1)%2; S[i 1, j 1] + C(s i,t j ) S[i 2, j] = max S[i 1, j] + C(s i, ) S[i 2, j 1] + C(,t j )

Linear Space In Linear Space, we can do each row of the D.P. We need to compute the optimum path from the origin (0,0) to (m,n)

Linear Space (cont d) At i=n/2, we know scores of all the optimal paths ending at that row. Define F[j] = S[n/2,j] One of these j is on the true path. Which one?

Backward alignment Let S [i,j] be the optimal score of aligning s[i..n] with t[j..m] Define B[j] = S [n/2,j] One of these j is on the true path. Which one?

Forward, Backward computation At the optimal coordinate, j F[j]+B[j]=S[n,m] In O(nm) time, and O(m) space, we can compute one of the coordinates on the optimum path.

Linear Space Alignment Align(1..n,1..m) For all 1<=j <= m Compute F[j]=S(n/2,j) For all 1<=j <= m Compute B[j]=S b (n/2,j) j* = max j {F[j]+B[j] } X = Align(1..n/2,1..j*) Y = Align(n/2..n,j*..m) Return X,j*,Y

Linear Space complexity T(nm) = c.nm + T(nm/2) = O(nm) Space = O(m)

Why is Blast Fast?

Large database search Database (n) Query (m) Database size n=10m, Querysize m=300. O(nm) = 3. 10 9 computations

Observations Much of the database is random from the query s perspective Consider a random DNA string of length n. Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 Assume for the moment that the query is all 1 s (length m). What is the probability that an exact match to the query can be found?

Basic probability Probability that there is a match starting at a fixed position i = 0.25 m What is the probability that some position i has a match. Dependencies confound probability estimates.

Basic Probability:Expectation Q: Toss a coin: each time it comes up heads, you get a dollar What is the money you expect to get after n tosses? Let X i be the amount earned in the i-th toss E(X i ) =1.p + 0.(1 p) = p Total money you expect to earn i i E( X i ) = E(X i ) = np

Expected number of matches Expected number of matches can still be computed. i Let X i =1 if there is a match starting at position i, X i =0 otherwise Pr(Match at Position i) = p i = 0.25 m E(X i ) = p i = 0.25 m Expected number of matches = E( X i ) = E(X i ) = n 1 i i 4 ( ) m

Expected number of exact Matches is small! Expected number of matches = n*0.25 m If n=10 7, m=10, Then, expected number of matches = 9.537 If n=10 7, m=11 expected number of hits = 2.38 n=10 7,m=12, Expected number of hits = 0.5 < 1 Bottom Line: An exact match to a substring of the query is unlikely just by chance.

Observation 2 What is the pigeonhole principle? Suppose we are looking for a database string with greater than 90% identity to the query (length 100) Partition the query into size 10 substrings. At least one much match the database string exactly

Why is this important? Suppose we are looking for sequences that are 80% identical to the query sequence of length 100. Assume that the mismatches are randomly distributed. What is the probability that there is no stretch of 10 bp, where the query and the subject match exactly? 1 8 10 ( ) 10 Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high. The above equation does not take dependencies into account Reality is better because the matches are not randomly distributed 90 = 0.000036

Just the Facts Consider the set of all substrings of the query string of fixed length W. Prob. of exact match to a random database string is very low. Prob. of exact match to a true homolog is very high. Keyword Search (exact matches) is MUCH faster than sequence alignment

BLAST Database (n) Consider all (m-w) query words of size W (Default = 11) Scan the database for exact match to all such words For all regions that hit, extend using a dynamic programming alignment. Can be many orders of magnitude faster than SW over the entire string

Why is BLAST fast? Assume that keyword searching does not consume any time and that alignment computation the expensive step. Query m=1000, random Db n=10 7, no TP SW = O(nm) = 1000*10 7 = 10 10 computations BLAST, W=11 E(#11-mer hits)= 1000* (1/4) 11 * 10 7 =2384 Number of computations = 2384*100*100=2.384*10 7 Ratio=10 10 /(2.384*10 7 )=420 Further speed improvements are possible

Keyword Matching How fast can we match keywords? Hash table/db index? What is the size of the hash table, for m=11 Suffix trees? What is the size of the suffix trees? Trie based search. We will do this in class. AATCA 567

Related notes How to choose the alignment region? Extend greedily until the score falls below a certain threshold What about protein sequences? Default word size = 3, and mismatches are allowed. Like sequences, BLAST has been evolving continuously Banded alignment Seed selection Scanning for exact matches, keyword search versus database indexing

P-value computation How significant is a score? What happens to significance when you change the score function A simple empirical method: Compute a distribution of scores against a random database. Use an estimate of the area under the curve to get the probability. OR, fit the distribution to one of the standard distributions.

Z-scores for alignment Initial assumption was that the scores followed a normal distribution. Z-score computation: For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation Z S = S µ σ Look up a table to get a P-value

Blast E-value Initial (and natural) assumption was that scores followed a Normal distribution 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution Practical consequence: Longer tail. Previously significant hits now not so significant

Exponential distribution Random Database, Pr(1) = p What is the expected number of hits to a sequence of k 1 s (n k)p k ne k ln p = ne k ln Instead, consider a random binary Matrix. Expected # of diagonals of k 1s Λ = (n k)(m k) p k nme k ln p = nme 1 p k ln 1 p

As you increase k, the number decreases exponentially. The number of diagonals of k runs can be approximated by a Poisson process Pr[u] = Λu e Λ u! Pr[u > 0] =1 e Λ In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul). As the score increases, the number of alignments that achieve the score decreases exponentially

Blast E-value Choose a score such that the expected score between a pair of residues < 0 Expected number of alignments with a particular score E = Kmne λs = mn2 λx Kmne Pr(S x) =1 e λs ln K ln 2 For small values, E-value and P-value are the same

Blast Variants 1. What is mega-blast? 2. What is discontiguous megablast? 3. Phi-Blast/Psi-Blast? 4. BLAT? 5. PatternHunter? Longer seeds. Seeds with don t care values Later Database pre-processing Seeds with don t care values

Keyword Matching P O T A S T P O T A T O l c cl c c c cl c c c c c c P O T A T O T S I U M A S T E