CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Size: px
Start display at page:

Download "CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182"

Transcription

1 CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

2 Bell Labs Honors Pattern matching CSE182

3 Just the Facts Consider the set of all substrings of the query string of fixed length W. Prob. of exact match to a random database string is very low. Prob. of exact match to a true homolog is very high. Keyword Search (exact matches) is MUCH faster than sequence alignment 10/28/14 CSE182

4 Speeding up via an exact match heuristics Consider a query string of length m A db string of length n Start by looking for exact matches of keywords of length W between the query and database string. Wherever, there is an exact match, perform a SW local alignment. 10/28/14 CSE182

5 Why is BLAST fast? Assume that keyword searching does not consume any time and that alignment computation the expensive step. Query m=1000, random Db n=10 7, no TP SW = O(nm) = 1000*10 7 = computations BLAST, W=11 E(#11-mer hits)= 1000* (1/4) 11 * 10 7 =2384 Number of computations = 2384*100*100=2.384*10 7 Ratio=10 10 /(2.384*10 7 )=420 Further speed improvements are possible 10/28/14 CSE182

6 Keyword (Dictionary) Matching How fast can we match keywords? Hash table/db index? What is the size of the hash table, for m=11 Suffix trees? What is the size of the suffix trees? Trie based search. We will do this in class. 10/28/14 CSE182 AATCA 567

7 The last step in Blast We have discussed Alignments Db filtering using keywords Scoring matrices E-values and P-values The last step: Database filtering requires us to scan a large sequence fast for matching keywords 10/28/14 CSE182

8 Dictionary Matching 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O database dictionary Q: Given k words (s i has length l i ), and a database of size n, find all matches to these words in the database string. How fast can this be done? 10/28/14 CSE182

9 Dict. Matching & string matching How fast can you do it, if you only had one word of length m? Trivial algorithm O(nm) time Pre-processing O(m), Search O(n) time. Dictionary matching Trivial algorithm (l 1 +l 2 +l 3 )n Using a keyword tree, l p n (l p is the length of the longest pattern) Aho-Corasick: O(n) after preprocessing O(l 1 +l 2..) We will consider the most general case 10/28/14 CSE182

10 Direct Algorithm P O P O P O T A S T P O T A T O! P O P T O P A P T O O A T O! T A A O! T T O! O! P O T A T O! Observations: When we mismatch, we (should) know something about where the next match will be. When there is a mismatch, we (should) know something about other patterns in the dictionary as well. 10/28/14 CSE182

11 The Trie Automaton Construct an automaton A from the dictionary A[v,x] describes the transition from node v to a node w upon reading x. A[u, T ] = v, and A[u, S ] = w Special root node r Some nodes are terminal, and labeled with the index of the dictionary word. r P O T A T O T A S T E 10/28/14 CSE182 u S w 3 v S I 1 U M 1:POTATO 2:POTASSIUM 3:TASTE 2

12 An O(l p n) algorithm for keyword matching Start with the first position in the db, and the root node. If successful transition Else Increment current pointer Move to a new node If terminal node success Retract current pointer Increment start pointer Move to root & repeat 10/28/14 CSE182

13 Illustration: l c P O T A S T P O T A T O v P O T A T O 1 T S A S T E 10/28/14 CSE182 3 S I U M 2

14 Idea for improving the time Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) l P O T A S T P O T A T O P O T A S S I U M T A S T E 10/28/14 CSE182 c Pattern j Pattern i 1:POTATO 2:POTASSIUM 3:TASTE

15 An O(n) alg. For keyword matching Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node success Else (if at root) Increment current pointer Mv start pointer Move to root Else Move start pointer forward Move to failure node 10/28/14 CSE182

16 Failure function Every node v corresponds to a string s v that is a prefix of some pattern. Define F[v] to be the node u such that s u is the longest suffix of s v If we fail to match at v, we should jump to F[v], and commence matching from there Let lp[v] = s u n 1! P! O! T! A! T! O! v T! S! n 7! n 2! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

17 Illustration What is F(n 10 )? What is F(n 5 )? F(n 3 )? Lp(n 10 )? n 1! P! O! T! A! T! O! v T! S! n 7! n 2! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

18 Illustration P O T A S T P O T A T O! l = 1 c = 1 n 1! v n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

19 Illustration P O T A S T P O T A T O! l = 1 c = 2 n 1! n 7! v P! n! 2 O! n! 3 T! n! 4 A! n! 5 T! n! 6 O! 1! T! S! A! S! T! E! n 10! 10/28/14 n! 8 n! 9 CSE182 S! I! U! M!

20 Illustration P O T A S T P O T A T O! l = 1 c = 6 n 1! n 7! P! O! T! A! T! O! T! n 2! A! n 3! n 4! S! T! E! n 5! vs! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

21 Illustration P O T A S T P O T A T O! l = 3 c = 6 n 1! n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! v S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

22 Illustration P O T A S T P O T A T O! l = 3 c = 7 n 1! n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! v S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 n 11! CSE182 1! S! I! U! M!

23 Illustration P O T A S T P O T A T O! l = 7 c = 7 v n 1! n 7! P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

24 Illustration P O T A S T P O T A T O! l = 7 c = 8 n 1! n 7! v P! O! T! A! T! O! n 2! T! S! A! n 3! n 4! S! T! E! n 5! n 10! n 6! 10/28/14 n! 8 n! 9 CSE182 1! S! I! U! M!

25 Illustration P O T A S T P O T A T O! l = 7 c = 7 n 1! n 7! v P! n! 2 O! n! 3 T! n! 4 A! n! 5 T! n! 6 O! 1! T! S! A! S! T! E! n 10! 10/28/14 n! 8 n! 9 CSE182 S! I! U! M!

26 Time analysis In each step, either c is incremented, or l is incremented Neither pointer is ever decremented (lp[v] < c-l). l and c do not exceed n Total time <= 2n l! c! P O T A S T P O T A T O! 10/28/14 CSE182

27 Blast: Putting it all together Input: Query of length m, database of size n Select word-size, scoring matrix, gap penalties, E-value cutoff Blast 10/28/14 CSE182

28 Blast Steps 1. Generate an automaton of all query keywords. 2. Scan database using a Dictionary Matching algorithm (O(n) time). Identify all hits. 3. Extend each hit using a variant of local alignment algorithm. Use the scoring matrix and gap penalties. 4. For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5. Output results. 10/28/14 CSE182

29 BLAST output 10/28/14 CSE182

30 Distant hits 10/28/14 CSE182

31 Family assignment question Query A has a distant match to B and C from the database. Is A similar to B, or to C? Should A inherit the function of B, or of C B A C CSE182

32 Silly Quiz Skin patterns Facial Features Fa 07 CSE182

33 Not all features(residues) are important Skin patterns Facial Features Fa 07 CSE182

34 Diverged family members provide key features Fa 07 CSE182

35 Protein sequence motifs Premise: The sequence of a protein sequence gives clues about its structure and function. Not all residues are equally important in determining function. Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. How can we identify these key residues? B Fam(B) A C A C CSE182

36 Regular expressions as Protein sequence motifs C-X-[DE]-X{10,12}-C-X-C--[STYLV] Fam(B) A C E V CSE182

37 The sequence analysis perspective Zinc Finger motif (Prosite database) C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H 2 conserved C, and 2 conserved H How can we search a database using these motifs? The motif is described using a regular expression. What is a regular expression? Fa 07 CSE182

38 End of L CSE182

39 Regular Expressions Concise representation of a set of strings over alphabet. Described by a string over R is a r.e. if and only if { Σ,,,+ } R = {ε} R = {σ},σ Σ R = R 1 + R 2 R = R 1 R 2 * R = R 1 Base case Union of strings Concatenation 0 or more repetitions Fa 07 CSE182

40 Regular Expression Q: Let ={A,C,E} Is (A+C)*EEC* a regular expression? *(A+C)? AC*..E? Q: When is a string s in a regular expression? R =(A+C)*EEC* Is CEEC in R? AEC? ACEE? Fa 07 CSE182

41 Regular Expression & Automata Every R.E can be expressed by an automaton (a directed graph) with the following properties: The automaton has a start and end node Each edge is labeled with a symbol from, or ε Suppose R is described by automaton A S R if and only if there is a path from start to end in A, labeled with s. Fa 07 CSE182

42 Examples: Regular Expression & Automata (A+C)*EEC* A C start E E end C Fa 07 CSE182

43 Constructing automata from R.E R = {ε} R = {σ}, σ R = R 1 + R 2 ε σ ε R = R 1 R 2 R = R 1 * ε ε ε ε ε ε CSE182 ε

44 Matching Regular expressions A string s belongs to R if and only if, there is a path from START to END in R A, labeled by s. Given a regular expression R (automaton R A ), and a database D, is there a string D[b..c] that matches R A (D[b..c] R) Simpler Q: Is D[1..c] accepted by the automaton of R? CSE182

45 Alg. For matching R.E. If D[1..c] is accepted by the automaton R A There is a path labeled D[1] D[c] that goes from START to END in R A D[1] ε D[2] D[c] CSE182

46 Alg. For matching R.E. If D[1..c] is accepted by the automaton R A There is a path labeled D[1] D[c] that goes from START to END in R A There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END D[1].. D[c-1] u D[c] CSE182

47 D.P. to match regular expression Define: A[u,σ] = Automaton node reached from u after reading σ Eps(u): set of all nodes reachable from node u using epsilon transitions. N[c] = subset of nodes reachable from START node after reading D[1..c] Q: when is v N[c] u u ε σ v Eps(u) CSE182

48 D.P. to match regular expression Q: when is v N[c]? A: If for some u N[c-1], w = A[u,D[c]], v {w}+ Eps(w) CSE182

49 Algorithm CSE182

50 The final step We have answered the question: Is D[1..c] accepted by R? Yes, if END N[c] We need to answer Is D[l..c] (for some l, and some c) accepted by R D[l..c] R D[1..c] Σ R CSE182

51 END of L CSE182

L3: Blast: Keyword match basics

L3: Blast: Keyword match basics L3: Blast: Keyword match basics Fa05 CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with exactly the same number of hairs! Assignment 1 is online Due 10/6

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases

More information

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length

More information

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

String Regularities and Degenerate Strings

String Regularities and Degenerate Strings M. Sc. Thesis Defense Md. Faizul Bari (100705050P) Supervisor: Dr. M. Sohel Rahman String Regularities and Degenerate Strings Department of Computer Science and Engineering Bangladesh University of Engineering

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Define M to be a binary n by m matrix such that:

Define M to be a binary n by m matrix such that: The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

arxiv: v1 [cs.ds] 9 Apr 2018

arxiv: v1 [cs.ds] 9 Apr 2018 From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Automata and Languages

Automata and Languages Automata and Languages Prof. Mohamed Hamada Software Engineering Lab. The University of Aizu Japan Nondeterministic Finite Automata with empty moves (-NFA) Definition A nondeterministic finite automaton

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

Succinct 2D Dictionary Matching with No Slowdown

Succinct 2D Dictionary Matching with No Slowdown Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

Given a string manipulating program, string analysis determines all possible values that a string expression can take during any program execution

Given a string manipulating program, string analysis determines all possible values that a string expression can take during any program execution l Given a string manipulating program, string analysis determines all possible values that a string expression can take during any program execution l Using string analysis we can verify properties of

More information

Lecture 2: Pairwise Alignment. CG Ron Shamir

Lecture 2: Pairwise Alignment. CG Ron Shamir Lecture 2: Pairwise Alignment 1 Main source 2 Why compare sequences? Human hexosaminidase A vs Mouse hexosaminidase A 3 www.mathworks.com/.../jan04/bio_genome.html Sequence Alignment עימוד רצפים The problem:

More information

Fast profile matching algorithms A survey

Fast profile matching algorithms A survey Theoretical Computer Science 395 (2008) 137 157 www.elsevier.com/locate/tcs Fast profile matching algorithms A survey Cinzia Pizzi a,,1, Esko Ukkonen b a Department of Computer Science, University of Helsinki,

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.

More information

Introduction to Sequence Alignment. Manpreet S. Katari

Introduction to Sequence Alignment. Manpreet S. Katari Introduction to Sequence Alignment Manpreet S. Katari 1 Outline 1. Global vs. local approaches to aligning sequences 1. Dot Plots 2. BLAST 1. Dynamic Programming 3. Hash Tables 1. BLAT 4. BWT (Burrow Wheeler

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

CSE182-L8. Mass Spectrometry

CSE182-L8. Mass Spectrometry CSE182-L8 Mass Spectrometry Project Notes Implement a few tools for proteomics C1:11/2/04 Answer MS questions to get started, select project partner, select a project. C2:11/15/04 (All but web-team) Plan

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 443 (2012) 25 34 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs String matching with variable

More information

Finite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine)

Finite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine) Finite Automata - Deterministic Finite Automata Deterministic Finite Automaton (DFA) (or Finite State Machine) M = (K, Σ, δ, s, A), where K is a finite set of states Σ is an input alphabet s K is a distinguished

More information

Subset seed automaton

Subset seed automaton Subset seed automaton Gregory Kucherov, Laurent Noé, and Mikhail Roytberg 2 LIFL/CNRS/INRIA, Bât. M3 Cité Scientifique, 59655, Villeneuve d Ascq cedex, France, {Gregory.Kucherov,Laurent.Noe}@lifl.fr 2

More information

Regular Expressions and Language Properties

Regular Expressions and Language Properties Regular Expressions and Language Properties Mridul Aanjaneya Stanford University July 3, 2012 Mridul Aanjaneya Automata Theory 1/ 47 Tentative Schedule HW #1: Out (07/03), Due (07/11) HW #2: Out (07/10),

More information

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin Chapter 0 Introduction Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin October 2014 Automata Theory 2 of 22 Automata theory deals

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

String Search. 6th September 2018

String Search. 6th September 2018 String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)

More information

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi Contents List of Code Challenges xvii About the Textbook xix Meet the Authors................................... xix Meet the Development Team............................ xx Acknowledgments..................................

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

Deterministic Finite Automaton (DFA)

Deterministic Finite Automaton (DFA) 1 Lecture Overview Deterministic Finite Automata (DFA) o accepting a string o defining a language Nondeterministic Finite Automata (NFA) o converting to DFA (subset construction) o constructed from a regular

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r Syllabus R9 Regulation UNIT-II NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: In the automata theory, a nondeterministic finite automaton (NFA) or nondeterministic finite state machine is a finite

More information

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata.

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata. CMSC 330: Organization of Programming Languages Last Lecture Languages Sets of strings Operations on languages Finite Automata Regular expressions Constants Operators Precedence CMSC 330 2 Clarifications

More information

CPSC 421: Tutorial #1

CPSC 421: Tutorial #1 CPSC 421: Tutorial #1 October 14, 2016 Set Theory. 1. Let A be an arbitrary set, and let B = {x A : x / x}. That is, B contains all sets in A that do not contain themselves: For all y, ( ) y B if and only

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever.  ETH Zürich (D-ITET) September, Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu ETH Zürich (D-ITET) September, 24 2015 Last week was all about Deterministic Finite Automaton We saw three main

More information

Size reduction of multitape automata

Size reduction of multitape automata Literature: Size reduction of multitape automata Hellis Tamm Tamm, H. On minimality and size reduction of one-tape and multitape finite automata. PhD thesis, Department of Computer Science, University

More information

How do regular expressions work? CMSC 330: Organization of Programming Languages

How do regular expressions work? CMSC 330: Organization of Programming Languages How do regular expressions work? CMSC 330: Organization of Programming Languages Regular Expressions and Finite Automata What we ve learned What regular expressions are What they can express, and cannot

More information

INF 4130 / /8-2017

INF 4130 / /8-2017 INF 4130 / 9135 28/8-2017 Algorithms, efficiency, and complexity Problem classes Problems can be divided into sets (classes). Problem classes are defined by the type of algorithm that can (or cannot) solve

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Closure Properties of Regular Languages. Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism

Closure Properties of Regular Languages. Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism Closure Properties of Regular Languages Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism Closure Properties Recall a closure property is a statement

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

String Matching Problem

String Matching Problem String Matching Problem Pattern P Text T Set of Locations L 9/2/23 CAP/CGS 5991: Lecture 2 Computer Science Fundamentals Specify an input-output description of the problem. Design a conceptual algorithm

More information

A Unifying Framework for Compressed Pattern Matching

A Unifying Framework for Compressed Pattern Matching A Unifying Framework for Compressed Pattern Matching Takuya Kida Yusuke Shibata Masayuki Takeda Ayumi Shinohara Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {

More information

Chap. 1.2 NonDeterministic Finite Automata (NFA)

Chap. 1.2 NonDeterministic Finite Automata (NFA) Chap. 1.2 NonDeterministic Finite Automata (NFA) DFAs: exactly 1 new state for any state & next char NFA: machine may not work same each time More than 1 transition rule for same state & input Any one

More information

CS21 Decidability and Tractability

CS21 Decidability and Tractability CS21 Decidability and Tractability Lecture 3 January 9, 2017 January 9, 2017 CS21 Lecture 3 1 Outline NFA, FA equivalence Regular Expressions FA and Regular Expressions January 9, 2017 CS21 Lecture 3 2

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

Sri vidya college of engineering and technology

Sri vidya college of engineering and technology Unit I FINITE AUTOMATA 1. Define hypothesis. The formal proof can be using deductive proof and inductive proof. The deductive proof consists of sequence of statements given with logical reasoning in order

More information

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200 Fast String Kernels Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200 Alex.Smola@anu.edu.au joint work with S.V.N. Vishwanathan Slides (soon) available

More information

CS243, Logic and Computation Nondeterministic finite automata

CS243, Logic and Computation Nondeterministic finite automata CS243, Prof. Alvarez NONDETERMINISTIC FINITE AUTOMATA (NFA) Prof. Sergio A. Alvarez http://www.cs.bc.edu/ alvarez/ Maloney Hall, room 569 alvarez@cs.bc.edu Computer Science Department voice: (67) 552-4333

More information

CMSC 330: Organization of Programming Languages. Theory of Regular Expressions Finite Automata

CMSC 330: Organization of Programming Languages. Theory of Regular Expressions Finite Automata : Organization of Programming Languages Theory of Regular Expressions Finite Automata Previous Course Review {s s defined} means the set of string s such that s is chosen or defined as given s A means

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha

More information

Theory of Computation

Theory of Computation Theory of Computation (Feodor F. Dragan) Department of Computer Science Kent State University Spring, 2018 Theory of Computation, Feodor F. Dragan, Kent State University 1 Before we go into details, what

More information

Languages, regular languages, finite automata

Languages, regular languages, finite automata Notes on Computer Theory Last updated: January, 2018 Languages, regular languages, finite automata Content largely taken from Richards [1] and Sipser [2] 1 Languages An alphabet is a finite set of characters,

More information

Advanced Automata Theory 7 Automatic Functions

Advanced Automata Theory 7 Automatic Functions Advanced Automata Theory 7 Automatic Functions Frank Stephan Department of Computer Science Department of Mathematics National University of Singapore fstephan@comp.nus.edu.sg Advanced Automata Theory

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

CS 455/555: Finite automata

CS 455/555: Finite automata CS 455/555: Finite automata Stefan D. Bruda Winter 2019 AUTOMATA (FINITE OR NOT) Generally any automaton Has a finite-state control Scans the input one symbol at a time Takes an action based on the currently

More information

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17: Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:

More information

{a, b, c} {a, b} {a, c} {b, c} {a}

{a, b, c} {a, b} {a, c} {b, c} {a} Section 4.3 Order Relations A binary relation is an partial order if it transitive and antisymmetric. If R is a partial order over the set S, we also say, S is a partially ordered set or S is a poset.

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Algorithms: COMP3121/3821/9101/9801

Algorithms: COMP3121/3821/9101/9801 NEW SOUTH WALES Algorithms: COMP3121/3821/9101/9801 Aleks Ignjatović School of Computer Science and Engineering University of New South Wales LECTURE 8: STRING MATCHING ALGORITHMS COMP3121/3821/9101/9801

More information

Deterministic Finite Automata (DFAs)

Deterministic Finite Automata (DFAs) CS/ECE 374: Algorithms & Models of Computation, Fall 28 Deterministic Finite Automata (DFAs) Lecture 3 September 4, 28 Chandra Chekuri (UIUC) CS/ECE 374 Fall 28 / 33 Part I DFA Introduction Chandra Chekuri

More information

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Efficient High-Similarity String Comparison: The Waterfall Algorithm Efficient High-Similarity String Comparison: The Waterfall Algorithm Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick)

More information

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language Part I. Theory and Algorithms Overview. Preliminaries Semirings Weighted Automata and Transducers.

More information

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata CISC 4090: Theory of Computation Chapter Regular Languages Xiaolan Zhang, adapted from slides by Prof. Werschulz Section.: Finite Automata Fordham University Department of Computer and Information Sciences

More information

Peter Wood. Department of Computer Science and Information Systems Birkbeck, University of London Automata and Formal Languages

Peter Wood. Department of Computer Science and Information Systems Birkbeck, University of London Automata and Formal Languages and and Department of Computer Science and Information Systems Birkbeck, University of London ptw@dcs.bbk.ac.uk Outline and Doing and analysing problems/languages computability/solvability/decidability

More information

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12 Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out

More information

Algorithms for Molecular Biology

Algorithms for Molecular Biology Algorithms for Molecular Biology BioMed Central Research A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series Sara C Madeira* 1,2,3 and Arlindo

More information

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

More information

Lecture 4 : Adaptive source coding algorithms

Lecture 4 : Adaptive source coding algorithms Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv

More information

CFG PSA Algorithm. Sequence Alignment Guided By Common Motifs Described By Context Free Grammars

CFG PSA Algorithm. Sequence Alignment Guided By Common Motifs Described By Context Free Grammars FG PS lgorithm Sequence lignment Guided By ommon Motifs Described By ontext Free Grammars motivation Find motifs- conserved regions that indicate a biological function or signature. Other algorithm do

More information

Deterministic Finite Automata (DFAs)

Deterministic Finite Automata (DFAs) Algorithms & Models of Computation CS/ECE 374, Fall 27 Deterministic Finite Automata (DFAs) Lecture 3 Tuesday, September 5, 27 Sariel Har-Peled (UIUC) CS374 Fall 27 / 36 Part I DFA Introduction Sariel

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

INF 4130 / /8-2014

INF 4130 / /8-2014 INF 4130 / 9135 26/8-2014 Mandatory assignments («Oblig-1», «-2», and «-3»): All three must be approved Deadlines around: 25. sept, 25. oct, and 15. nov Other courses on similar themes: INF-MAT 3370 INF-MAT

More information

Hashing Techniques For Finite Automata

Hashing Techniques For Finite Automata Hashing Techniques For Finite Automata Hady Zeineddine Logic Synthesis Course Project - Spring 2007 Professor Adnan Aziz 1. Abstract This report presents two hashing techniques - Alphabet and Power-Set

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

CS21 Decidability and Tractability

CS21 Decidability and Tractability CS21 Decidability and Tractability Lecture 2 January 5, 2018 January 5, 2018 CS21 Lecture 2 1 Outline Finite Automata Nondeterministic Finite Automata Closure under regular operations NFA, FA equivalence

More information

Mining Emerging Substrings

Mining Emerging Substrings Mining Emerging Substrings Sarah Chan Ben Kao C.L. Yip Michael Tang Department of Computer Science and Information Systems The University of Hong Kong {wyschan, kao, clyip, fmtang}@csis.hku.hk Abstract.

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

CSE 311: Foundations of Computing. Lecture 23: Finite State Machine Minimization & NFAs

CSE 311: Foundations of Computing. Lecture 23: Finite State Machine Minimization & NFAs CSE : Foundations of Computing Lecture : Finite State Machine Minimization & NFAs State Minimization Many different FSMs (DFAs) for the same problem Take a given FSM and try to reduce its state set by

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

Optimizing Finite Automata

Optimizing Finite Automata Optimizing Finite Automata We can improve the DFA created by MakeDeterministic. Sometimes a DFA will have more states than necessary. For every DFA there is a unique smallest equivalent DFA (fewest states

More information

Converting SLP to LZ78 in almost Linear Time

Converting SLP to LZ78 in almost Linear Time CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP

More information

Hierarchical Overlap Graph

Hierarchical Overlap Graph Hierarchical Overlap Graph B. Cazaux and E. Rivals LIRMM & IBC, Montpellier 8. Feb. 2018 arxiv:1802.04632 2018 B. Cazaux & E. Rivals 1 / 29 Overlap Graph for a set of words Consider the set P := {abaa,

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information