CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Similar documents
L3: Blast: Keyword match basics

Pattern Matching (Exact Matching) Overview

BLAST: Basic Local Alignment Search Tool

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

String Matching with Variable Length Gaps

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Module 9: Tries and String Matching

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

String Regularities and Degenerate Strings

Algorithms in Bioinformatics

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence analysis and Genomics

Tools and Algorithms in Bioinformatics

Define M to be a binary n by m matrix such that:

EECS730: Introduction to Bioinformatics

arxiv: v1 [cs.ds] 9 Apr 2018

Small-Space Dictionary Matching (Dissertation Proposal)

In-Depth Assessment of Local Sequence Alignment

Automata and Languages

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Succinct 2D Dictionary Matching with No Slowdown

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

2. Exact String Matching

Given a string manipulating program, string analysis determines all possible values that a string expression can take during any program execution

Lecture 2: Pairwise Alignment. CG Ron Shamir

Fast profile matching algorithms A survey

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

Introduction to Sequence Alignment. Manpreet S. Katari

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

CSE182-L8. Mass Spectrometry

Theoretical Computer Science

Finite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine)

Subset seed automaton

Regular Expressions and Language Properties

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin

Week 10: Homology Modelling (II) - HHpred

String Search. 6th September 2018

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Deterministic Finite Automaton (DFA)

Computational Biology

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata.

CPSC 421: Tutorial #1

11.3 Decoding Algorithm

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

Size reduction of multitape automata

How do regular expressions work? CMSC 330: Organization of Programming Languages

INF 4130 / /8-2017

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Closure Properties of Regular Languages. Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism

EECS730: Introduction to Bioinformatics

String Matching Problem

A Unifying Framework for Compressed Pattern Matching

Chap. 1.2 NonDeterministic Finite Automata (NFA)

CS21 Decidability and Tractability

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Sri vidya college of engineering and technology

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

CS243, Logic and Computation Nondeterministic finite automata

CMSC 330: Organization of Programming Languages. Theory of Regular Expressions Finite Automata

Tools and Algorithms in Bioinformatics

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Theory of Computation

Languages, regular languages, finite automata

Advanced Automata Theory 7 Automatic Functions

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

CS 455/555: Finite automata

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

{a, b, c} {a, b} {a, c} {b, c} {a}

Bioinformatics and BLAST

Algorithms: COMP3121/3821/9101/9801

Deterministic Finite Automata (DFAs)

Efficient High-Similarity String Comparison: The Waterfall Algorithm

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

Peter Wood. Department of Computer Science and Information Systems Birkbeck, University of London Automata and Formal Languages

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Algorithms for Molecular Biology

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Lecture 4 : Adaptive source coding algorithms

CFG PSA Algorithm. Sequence Alignment Guided By Common Motifs Described By Context Free Grammars

Deterministic Finite Automata (DFAs)

Single alignment: Substitution Matrix. 16 march 2017

INF 4130 / /8-2014

Hashing Techniques For Finite Automata

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

CS21 Decidability and Tractability

Mining Emerging Substrings

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

CSE 311: Foundations of Computing. Lecture 23: Finite State Machine Minimization & NFAs

Bio nformatics. Lecture 3. Saad Mneimneh

Optimizing Finite Automata

Converting SLP to LZ78 in almost Linear Time

Hierarchical Overlap Graph

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Transcription:

CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding 10-07 CSE182

Bell Labs Honors Pattern matching 10-07 CSE182

Just the Facts Consider the set of all substrings of the query string of fixed length W. Prob. of exact match to a random database string is very low. Prob. of exact match to a true homolog is very high. Keyword Search (exact matches) is MUCH faster than sequence alignment 10/28/14 CSE182

Speeding up via an exact match heuristics Consider a query string of length m A db string of length n Start by looking for exact matches of keywords of length W between the query and database string. Wherever, there is an exact match, perform a SW local alignment. 10/28/14 CSE182

Why is BLAST fast? Assume that keyword searching does not consume any time and that alignment computation the expensive step. Query m=1000, random Db n=10 7, no TP SW = O(nm) = 1000*10 7 = 10 10 computations BLAST, W=11 E(#11-mer hits)= 1000* (1/4) 11 * 10 7 =2384 Number of computations = 2384*100*100=2.384*10 7 Ratio=10 10 /(2.384*10 7 )=420 Further speed improvements are possible 10/28/14 CSE182

Keyword (Dictionary) Matching How fast can we match keywords? Hash table/db index? What is the size of the hash table, for m=11 Suffix trees? What is the size of the suffix trees? Trie based search. We will do this in class. 10/28/14 CSE182 AATCA 567

The last step in Blast We have discussed Alignments Db filtering using keywords Scoring matrices E-values and P-values The last step: Database filtering requires us to scan a large sequence fast for matching keywords 10/28/14 CSE182

Dictionary Matching 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O database dictionary Q: Given k words (s i has length l i ), and a database of size n, find all matches to these words in the database string. How fast can this be done? 10/28/14 CSE182

Dict. Matching & string matching How fast can you do it, if you only had one word of length m? Trivial algorithm O(nm) time Pre-processing O(m), Search O(n) time. Dictionary matching Trivial algorithm (l 1 +l 2 +l 3 )n Using a keyword tree, l p n (l p is the length of the longest pattern) Aho-Corasick: O(n) after preprocessing O(l 1 +l 2..) We will consider the most general case 10/28/14 CSE182

Direct Algorithm P O P O P O T A S T P O T A T O! P O P T O P A P T O O A T O! T A A O! T T O! O! P O T A T O! Observations: When we mismatch, we (should) know something about where the next match will be. When there is a mismatch, we (should) know something about other patterns in the dictionary as well. 10/28/14 CSE182

The Trie Automaton Construct an automaton A from the dictionary A[v,x] describes the transition from node v to a node w upon reading x. A[u, T ] = v, and A[u, S ] = w Special root node r Some nodes are terminal, and labeled with the index of the dictionary word. r P O T A T O T A S T E 10/28/14 CSE182 u S w 3 v S I 1 U M 1:POTATO 2:POTASSIUM 3:TASTE 2

An O(l p n) algorithm for keyword matching Start with the first position in the db, and the root node. If successful transition Else Increment current pointer Move to a new node If terminal node success Retract current pointer Increment start pointer Move to root & repeat 10/28/14 CSE182

Illustration: l c P O T A S T P O T A T O v P O T A T O 1 T S A S T E 10/28/14 CSE182 3 S I U M 2

Idea for improving the time Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) l P O T A S T P O T A T O P O T A S S I U M T A S T E 10/28/14 CSE182 c Pattern j Pattern i 1:POTATO 2:POTASSIUM 3:TASTE

An O(n) alg. For keyword matching Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node success Else (if at root) Increment current pointer Mv start pointer Move to root Else Move start pointer forward Move to failure node 10/28/14 CSE182

Failure function Every node v corresponds to a string s v that is a prefix of some pattern. Define F[v] to be the node u such that s u is the longest suffix of s v If we fail to match at v, we should jump to F[v], and commence matching from there Let lp[v] = s u n 1! P! O! T! A! T! O! v T! S! n 7! n 2! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

Illustration What is F(n 10 )? What is F(n 5 )? F(n 3 )? Lp(n 10 )? n 1! P! O! T! A! T! O! v T! S! n 7! n 2! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

Illustration P O T A S T P O T A T O! l = 1 c = 1 n 1! v n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

Illustration P O T A S T P O T A T O! l = 1 c = 2 n 1! n 7! v P! n! 2 O! n! 3 T! n! 4 A! n! 5 T! n! 6 O! 1! T! S! A! S! T! E! n 10! 10/28/14 n! 8 n! 9 CSE182 S! I! U! M!

Illustration P O T A S T P O T A T O! l = 1 c = 6 n 1! n 7! P! O! T! A! T! O! T! n 2! A! n 3! n 4! S! T! E! n 5! vs! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

Illustration P O T A S T P O T A T O! l = 3 c = 6 n 1! n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! v S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

Illustration P O T A S T P O T A T O! l = 3 c = 7 n 1! n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! v S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 n 11! CSE182 1! S! I! U! M!

Illustration P O T A S T P O T A T O! l = 7 c = 7 v n 1! n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

Illustration P O T A S T P O T A T O! l = 7 c = 8 n 1! n 7! v P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

Illustration P O T A S T P O T A T O! l = 7 c = 7 n 1! n 7! v P! n! 2 O! n! 3 T! n! 4 A! n! 5 T! n! 6 O! 1! T! S! A! S! T! E! n 10! 10/28/14 n! 8 n! 9 CSE182 S! I! U! M!

Time analysis In each step, either c is incremented, or l is incremented Neither pointer is ever decremented (lp[v] < c-l). l and c do not exceed n Total time <= 2n l! c! P O T A S T P O T A T O! 10/28/14 CSE182

Blast: Putting it all together Input: Query of length m, database of size n Select word-size, scoring matrix, gap penalties, E-value cutoff Blast 10/28/14 CSE182

Blast Steps 1. Generate an automaton of all query keywords. 2. Scan database using a Dictionary Matching algorithm (O(n) time). Identify all hits. 3. Extend each hit using a variant of local alignment algorithm. Use the scoring matrix and gap penalties. 4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5. Output results. 10/28/14 CSE182

BLAST output 10/28/14 CSE182

Distant hits 10/28/14 CSE182

Family assignment question Query A has a distant match to B and C from the database. Is A similar to B, or to C? Should A inherit the function of B, or of C B A C 10-07 CSE182

Silly Quiz Skin patterns Facial Features Fa 07 CSE182

Not all features(residues) are important Skin patterns Facial Features Fa 07 CSE182

Diverged family members provide key features Fa 07 CSE182

Protein sequence motifs Premise: The sequence of a protein sequence gives clues about its structure and function. Not all residues are equally important in determining function. Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. How can we identify these key residues? B Fam(B) A C A C 10-07 CSE182

Regular expressions as Protein sequence motifs C-X-[DE]-X{10,12}-C-X-C--[STYLV] Fam(B) A C E V 10-07 CSE182

The sequence analysis perspective Zinc Finger motif (Prosite database) C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H 2 conserved C, and 2 conserved H How can we search a database using these motifs? The motif is described using a regular expression. What is a regular expression? Fa 07 CSE182

End of L7 10-07 CSE182

Regular Expressions Concise representation of a set of strings over alphabet. Described by a string over R is a r.e. if and only if { Σ,,,+ } R = {ε} R = {σ},σ Σ R = R 1 + R 2 R = R 1 R 2 * R = R 1 Base case Union of strings Concatenation 0 or more repetitions Fa 07 CSE182

Regular Expression Q: Let ={A,C,E} Is (A+C)*EEC* a regular expression? *(A+C)? AC*..E? Q: When is a string s in a regular expression? R =(A+C)*EEC* Is CEEC in R? AEC? ACEE? Fa 07 CSE182

Regular Expression & Automata Every R.E can be expressed by an automaton (a directed graph) with the following properties: The automaton has a start and end node Each edge is labeled with a symbol from, or ε Suppose R is described by automaton A S R if and only if there is a path from start to end in A, labeled with s. Fa 07 CSE182

Examples: Regular Expression & Automata (A+C)*EEC* A C start E E end C Fa 07 CSE182

Constructing automata from R.E R = {ε} R = {σ}, σ R = R 1 + R 2 ε σ ε R = R 1 R 2 R = R 1 * ε ε ε ε ε ε 10-07 CSE182 ε

Matching Regular expressions A string s belongs to R if and only if, there is a path from START to END in R A, labeled by s. Given a regular expression R (automaton R A ), and a database D, is there a string D[b..c] that matches R A (D[b..c] R) Simpler Q: Is D[1..c] accepted by the automaton of R? 10-07 CSE182

Alg. For matching R.E. If D[1..c] is accepted by the automaton R A There is a path labeled D[1] D[c] that goes from START to END in R A D[1] ε D[2] D[c] 10-07 CSE182

Alg. For matching R.E. If D[1..c] is accepted by the automaton R A There is a path labeled D[1] D[c] that goes from START to END in R A There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END D[1].. D[c-1] u D[c] 10-07 CSE182

D.P. to match regular expression Define: A[u,σ] = Automaton node reached from u after reading σ Eps(u): set of all nodes reachable from node u using epsilon transitions. N[c] = subset of nodes reachable from START node after reading D[1..c] Q: when is v N[c] u u ε σ v Eps(u) 10-07 CSE182

D.P. to match regular expression Q: when is v N[c]? A: If for some u N[c-1], w = A[u,D[c]], v {w}+ Eps(w) 10-07 CSE182

Algorithm 10-07 CSE182

The final step We have answered the question: Is D[1..c] accepted by R? Yes, if END N[c] We need to answer Is D[l..c] (for some l, and some c) accepted by R D[l..c] R D[1..c] Σ R 10-07 CSE182

END of L7 10-07 CSE182