CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding 10-07 CSE182
Bell Labs Honors Pattern matching 10-07 CSE182
Just the Facts Consider the set of all substrings of the query string of fixed length W. Prob. of exact match to a random database string is very low. Prob. of exact match to a true homolog is very high. Keyword Search (exact matches) is MUCH faster than sequence alignment 10/28/14 CSE182
Speeding up via an exact match heuristics Consider a query string of length m A db string of length n Start by looking for exact matches of keywords of length W between the query and database string. Wherever, there is an exact match, perform a SW local alignment. 10/28/14 CSE182
Why is BLAST fast? Assume that keyword searching does not consume any time and that alignment computation the expensive step. Query m=1000, random Db n=10 7, no TP SW = O(nm) = 1000*10 7 = 10 10 computations BLAST, W=11 E(#11-mer hits)= 1000* (1/4) 11 * 10 7 =2384 Number of computations = 2384*100*100=2.384*10 7 Ratio=10 10 /(2.384*10 7 )=420 Further speed improvements are possible 10/28/14 CSE182
Keyword (Dictionary) Matching How fast can we match keywords? Hash table/db index? What is the size of the hash table, for m=11 Suffix trees? What is the size of the suffix trees? Trie based search. We will do this in class. 10/28/14 CSE182 AATCA 567
The last step in Blast We have discussed Alignments Db filtering using keywords Scoring matrices E-values and P-values The last step: Database filtering requires us to scan a large sequence fast for matching keywords 10/28/14 CSE182
Dictionary Matching 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O database dictionary Q: Given k words (s i has length l i ), and a database of size n, find all matches to these words in the database string. How fast can this be done? 10/28/14 CSE182
Dict. Matching & string matching How fast can you do it, if you only had one word of length m? Trivial algorithm O(nm) time Pre-processing O(m), Search O(n) time. Dictionary matching Trivial algorithm (l 1 +l 2 +l 3 )n Using a keyword tree, l p n (l p is the length of the longest pattern) Aho-Corasick: O(n) after preprocessing O(l 1 +l 2..) We will consider the most general case 10/28/14 CSE182
Direct Algorithm P O P O P O T A S T P O T A T O! P O P T O P A P T O O A T O! T A A O! T T O! O! P O T A T O! Observations: When we mismatch, we (should) know something about where the next match will be. When there is a mismatch, we (should) know something about other patterns in the dictionary as well. 10/28/14 CSE182
The Trie Automaton Construct an automaton A from the dictionary A[v,x] describes the transition from node v to a node w upon reading x. A[u, T ] = v, and A[u, S ] = w Special root node r Some nodes are terminal, and labeled with the index of the dictionary word. r P O T A T O T A S T E 10/28/14 CSE182 u S w 3 v S I 1 U M 1:POTATO 2:POTASSIUM 3:TASTE 2
An O(l p n) algorithm for keyword matching Start with the first position in the db, and the root node. If successful transition Else Increment current pointer Move to a new node If terminal node success Retract current pointer Increment start pointer Move to root & repeat 10/28/14 CSE182
Illustration: l c P O T A S T P O T A T O v P O T A T O 1 T S A S T E 10/28/14 CSE182 3 S I U M 2
Idea for improving the time Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) l P O T A S T P O T A T O P O T A S S I U M T A S T E 10/28/14 CSE182 c Pattern j Pattern i 1:POTATO 2:POTASSIUM 3:TASTE
An O(n) alg. For keyword matching Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node success Else (if at root) Increment current pointer Mv start pointer Move to root Else Move start pointer forward Move to failure node 10/28/14 CSE182
Failure function Every node v corresponds to a string s v that is a prefix of some pattern. Define F[v] to be the node u such that s u is the longest suffix of s v If we fail to match at v, we should jump to F[v], and commence matching from there Let lp[v] = s u n 1! P! O! T! A! T! O! v T! S! n 7! n 2! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!
Illustration What is F(n 10 )? What is F(n 5 )? F(n 3 )? Lp(n 10 )? n 1! P! O! T! A! T! O! v T! S! n 7! n 2! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!
Illustration P O T A S T P O T A T O! l = 1 c = 1 n 1! v n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!
Illustration P O T A S T P O T A T O! l = 1 c = 2 n 1! n 7! v P! n! 2 O! n! 3 T! n! 4 A! n! 5 T! n! 6 O! 1! T! S! A! S! T! E! n 10! 10/28/14 n! 8 n! 9 CSE182 S! I! U! M!
Illustration P O T A S T P O T A T O! l = 1 c = 6 n 1! n 7! P! O! T! A! T! O! T! n 2! A! n 3! n 4! S! T! E! n 5! vs! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!
Illustration P O T A S T P O T A T O! l = 3 c = 6 n 1! n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! v S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!
Illustration P O T A S T P O T A T O! l = 3 c = 7 n 1! n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! v S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 n 11! CSE182 1! S! I! U! M!
Illustration P O T A S T P O T A T O! l = 7 c = 7 v n 1! n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!
Illustration P O T A S T P O T A T O! l = 7 c = 8 n 1! n 7! v P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!
Illustration P O T A S T P O T A T O! l = 7 c = 7 n 1! n 7! v P! n! 2 O! n! 3 T! n! 4 A! n! 5 T! n! 6 O! 1! T! S! A! S! T! E! n 10! 10/28/14 n! 8 n! 9 CSE182 S! I! U! M!
Time analysis In each step, either c is incremented, or l is incremented Neither pointer is ever decremented (lp[v] < c-l). l and c do not exceed n Total time <= 2n l! c! P O T A S T P O T A T O! 10/28/14 CSE182
Blast: Putting it all together Input: Query of length m, database of size n Select word-size, scoring matrix, gap penalties, E-value cutoff Blast 10/28/14 CSE182
Blast Steps 1. Generate an automaton of all query keywords. 2. Scan database using a Dictionary Matching algorithm (O(n) time). Identify all hits. 3. Extend each hit using a variant of local alignment algorithm. Use the scoring matrix and gap penalties. 4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5. Output results. 10/28/14 CSE182
BLAST output 10/28/14 CSE182
Distant hits 10/28/14 CSE182
Family assignment question Query A has a distant match to B and C from the database. Is A similar to B, or to C? Should A inherit the function of B, or of C B A C 10-07 CSE182
Silly Quiz Skin patterns Facial Features Fa 07 CSE182
Not all features(residues) are important Skin patterns Facial Features Fa 07 CSE182
Diverged family members provide key features Fa 07 CSE182
Protein sequence motifs Premise: The sequence of a protein sequence gives clues about its structure and function. Not all residues are equally important in determining function. Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. How can we identify these key residues? B Fam(B) A C A C 10-07 CSE182
Regular expressions as Protein sequence motifs C-X-[DE]-X{10,12}-C-X-C--[STYLV] Fam(B) A C E V 10-07 CSE182
The sequence analysis perspective Zinc Finger motif (Prosite database) C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H 2 conserved C, and 2 conserved H How can we search a database using these motifs? The motif is described using a regular expression. What is a regular expression? Fa 07 CSE182
End of L7 10-07 CSE182
Regular Expressions Concise representation of a set of strings over alphabet. Described by a string over R is a r.e. if and only if { Σ,,,+ } R = {ε} R = {σ},σ Σ R = R 1 + R 2 R = R 1 R 2 * R = R 1 Base case Union of strings Concatenation 0 or more repetitions Fa 07 CSE182
Regular Expression Q: Let ={A,C,E} Is (A+C)*EEC* a regular expression? *(A+C)? AC*..E? Q: When is a string s in a regular expression? R =(A+C)*EEC* Is CEEC in R? AEC? ACEE? Fa 07 CSE182
Regular Expression & Automata Every R.E can be expressed by an automaton (a directed graph) with the following properties: The automaton has a start and end node Each edge is labeled with a symbol from, or ε Suppose R is described by automaton A S R if and only if there is a path from start to end in A, labeled with s. Fa 07 CSE182
Examples: Regular Expression & Automata (A+C)*EEC* A C start E E end C Fa 07 CSE182
Constructing automata from R.E R = {ε} R = {σ}, σ R = R 1 + R 2 ε σ ε R = R 1 R 2 R = R 1 * ε ε ε ε ε ε 10-07 CSE182 ε
Matching Regular expressions A string s belongs to R if and only if, there is a path from START to END in R A, labeled by s. Given a regular expression R (automaton R A ), and a database D, is there a string D[b..c] that matches R A (D[b..c] R) Simpler Q: Is D[1..c] accepted by the automaton of R? 10-07 CSE182
Alg. For matching R.E. If D[1..c] is accepted by the automaton R A There is a path labeled D[1] D[c] that goes from START to END in R A D[1] ε D[2] D[c] 10-07 CSE182
Alg. For matching R.E. If D[1..c] is accepted by the automaton R A There is a path labeled D[1] D[c] that goes from START to END in R A There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END D[1].. D[c-1] u D[c] 10-07 CSE182
D.P. to match regular expression Define: A[u,σ] = Automaton node reached from u after reading σ Eps(u): set of all nodes reachable from node u using epsilon transitions. N[c] = subset of nodes reachable from START node after reading D[1..c] Q: when is v N[c] u u ε σ v Eps(u) 10-07 CSE182
D.P. to match regular expression Q: when is v N[c]? A: If for some u N[c-1], w = A[u,D[c]], v {w}+ Eps(w) 10-07 CSE182
Algorithm 10-07 CSE182
The final step We have answered the question: Is D[1..c] accepted by R? Yes, if END N[c] We need to answer Is D[l..c] (for some l, and some c) accepted by R D[l..c] R D[1..c] Σ R 10-07 CSE182
END of L7 10-07 CSE182