Approximate Matching

Size: px
Start display at page:

Download "Approximate Matching"

Transcription

1 Approximate Matching Ben Langmead Department of Computer Science lease sign guestbook ( to tell me briefly how you are using the slides. For original Keynote files, me

2 Approximate matching Strings Online algorithms Naive exact matching Offline algorithms Index-assisted Boyer-Moore k-mer index We have focused on exact matching in reality, we have to deal with differences

3 Read CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC Reference GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA GGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGGTTGTCCAAGATAGAATCTTAG TTCAACTTTAAATTTGCCCACAGAACCCTCTAAATCCCCTTGTAAATTTAACTGTTAGTC CAAAGAGGAACAGCTCTTTGGACACTAGGAAAAAACCTTGTAGAGAGAGTAAAAAATTTA ACACCCATAGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTTCAAGCTCAACACCCA CTACCTAAAAAATCCCAAACATATAACTGAACTCCTCACACCCAATTGGACCAATCTATC ACCCTATAGAAGAACTAATGTTAGTATAAGTAACATGAAAACATTCTCCTCCGCATAAGC CTGCGTCAGATTAAAACACTGAACTGACAATTAACAGCCCAATATCTACAATCAACCAAC AAGTCATTATTACCCTCACTGTCAACCCAACACAGGCATGCTCATAAGGAAAGGTTAAAA AAAGTAAAAGGAACTCGGCAAATCTTACCCCGCCTGTTTACCAAAAACATCACCTCTAGC ATCACCAGTATTAGAGGCACCGCCTGCCCAGTGACACATGTTTAACGGCCGCGGTACCCT Sequence differences occur because of Sequencing error 2. Genetic variation

4 Approximate matching T: GGAAAAAGAGGTAGCGGCGTTTAACAGTAG : GTAACGGCG Mismatch (Substitution)

5 Approximate matching T: GGAAAAAGAGGTAGC-GCGTTTAACAGTAG GTAGCGGCG : Insertion

6 Approximate matching T: GGAAAAAGAGGTAGCGGCGTTTAACAGTAG GT-GCGGCG : Deletion

7 Hamming distance For X & Y where X = Y, hamming distance = minimum # substitutions needed to turn one into the other X: Y: GAGGTAGCGGCGTT GGTAG GTT GTGGTAACGGGGTT GGTAA GTT Hamming distance = 3

8 Edit distance (AKA Levenshtein distance) For X & Y, edit distance = minimum # edits (substitutions, insertions, deletions) needed to turn one into the other X: Y: X: Y: TGGCCGCGCAAAAACAGC CAGC TGACCGCGCAAAACAGC CCGCGCAAAA-CAGC GCGTATGCGGCTAACGC TATGCGGCTA-ACGC GCTATGCGGCTATACGC GC- ACGC Edit distance = 2 Edit distance = 2

9 Approximate matching Like exact matching, but pattern may be within a certain distance (usually Hamming or edit) of T. Each such place is an approximate match. Allowing edits is more challenging than just allowing mismatches We ll return to edits

10 Approximate matching def naive(p, t): occurrences = [] for i in range(len(t) - len(p) + 1): # loop over alignments match = True for j in range(len(p)): # loop over characters if t[i+j]!= p[j]: # compare characters match = False break if match: occurrences.append(i) return occurrences # mismatch; reject alignment # all chars matched; record

11 Approximate matching def naive_approx_hamming(p, t, maxdistance): occurrences = [] for i in range(len(t) - len(p) + 1): # loop over alignments nmm = 0 for j in range(len(p)): # loop over characters if t[i+j]!= p[j]: # compare characters nmm += 1 # mismatch if nmm > maxdistance: break # exceeded max hamming dist if nmm <= maxdistance: occurrences.append(i) # approximate match return occurrences

12 Approximate matching Wanted: way to apply exact matching algorithms to approximate matching problems

13 Approximate matching u v If occurs in T with 1 edit, then u or v appears with no edits u v u v : T: CGC GCA CGA GCA : AATTGG T: AATTAG

14 Approximate matching... p 1 p 2 p 3 p 4 p k+1 If occurs in T with up to k edits

15 Approximate matching... p 1 p 2 p 3 p 4 p k+1 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits

16 Approximate matching XXX X 5 partitions 4 edits (X) XX X X X X X X

17 Approximate matching igeonhole principle: k+1 pigeons, k holes. At least one has >1 pigeon!

18 Approximate matching We have k pigeons, k+1 holes, at least one......is empty

19 igeonhole principle p 1 p 2 p 3 p 4 p 5 Exact matching T What algorithm can we use? Any exact matching algorithm If we have a k-mer index, we can use that Naive exact matching Boyer-Moore

20 igeonhole principle p 4 match T

21 igeonhole principle p 1 p 2 p 3 p 4 p 5 match verify verify verify verify T For Hamming distance, verification is essentially just the inner loop of naive_approx_hamming from before

22 igeonhole principle... p 1 p 2 p 3 p 4 p k+1 Advantages Reuse favorite exact matching algos; fast and easy Flexible; works for Hamming and edit distance* Disadvantages Large k yields small partitions matching many times by chance; lots of verification work k+1 exact matching problems, one per partition * we don t know how to do edit distance verification yet

23 Implementation of pigeonhole principle with Boyer-Moore as exact matching algorithm: Boyer-Moore, exact Boyer-Moore, 1 mismatch with pigeonhole Boyer-Moore, 2 mismatches with pigeonhole # character comparisons wall clock time # matches # character comparisons wall clock time # matches # character comparisons wall clock time # matches : tomorrow T: Shakespeare s complete works 786 K 1.91s M 7.73 s M s 382 : 50 nt string from Alu repeat* T: Human reference (hg19) chromosome M s M 209 s 1, M 328 s 2,798 * GCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGG

24 Generalizing pigeonhole, part 1 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits... p 1 p 2 p 3 p 4 p k+1 But doesn't have to be at least one of what would we have to change for at least two of? If occurs in T with up to k edits, then at least two of must appear with 0 edits

25 Generalizing pigeonhole, part 1 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits... p 1 p 2 p 3 p 4 p k+1 But doesn't have to be at least one of what would we have to change for at least two of? If occurs in T with up to k edits, then at least two of p 1, p 2,..., p k+2 must appear with 0 edits

26 Generalizing pigeonhole, part 2 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits... p 1 p 2 p 3 p 4 p k+1 Let p1, p2,..., pj be a partitioning of. If occurs with up to k edits, then at least one of p1, p2,..., pj must occur with floor(k / j) edits.???... p 1 p 2 p 3 p 4 p j

27 Generalizing pigeonhole, part 2 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits... p 1 p 2 p 3 p 4 p k+1 Let p1, p2,..., pj be a partitioning of. If occurs with up to k edits, then at least one of p1, p2,..., pj must occur with floor(k / j) edits.... p 1 p 2 p 3 p 4 p j

28 Generalizing pigeonhole, part 2 At least one of p1, p2,..., p5 occurs with... X X X X k = 4 edits 0 edits X X X X X k = 5 edits 1 edits X X X X X X X X X k = 9 edits 1 edits X X X X X X X X X X k = 10 edits 2 edits etc

29 Generalizing pigeonhole, part 2 Let j = k + 1 Specific General igeonhole principle If occurs in T with up to k edits, then at least one of p 1, p 2,..., p j must appear with floor(k / j) edits igeonhole principle with j = k + 1 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits Why? Smallest value s.t. floor(k / j) = 0 Why make floor(k / j) = 0? So we can use exact matching Why is smaller j good? Yields fewer, longer partitions Why are long partitions good? Makes exact-matching filter more specific, minimizing # candidates

30 A different principle We partitioned into non-overlapping substrings p1 p2 p3 p4... pk+1 Now consider overlapping substrings 3 6 n-l n-l 1 4 n-l-1

31 Approximate string matching: more principles q x x x x... n n - q + 1 of these Say substrings are length q. There are n - q + 1 such substrings. 1 edit to changes at most q substrings kq is worst case; could be < kq Minimum # of length-q substrings unedited after k edits? n - q kq q-gram lemma: if occurs in T with up to k edits, alignment must contain t exact matches of length q, where t n - q kq

32 Approximate string matching: more principles If occurs in T with up to k edits, alignment contains an exact match of length q, where q floor(n / (k + 1)) Obtained by solving for q: n - q kq 1 Exact matching filter: find matches of length floor(n / (k + 1)) between T and any substring of. Check vicinity for full match.

33 Approximate matching principles Non-overlapping substrings igeonhole principle q-gram lemma Overlapping substrings Specific General p1, p2,..., pj is a partitioning of. If occurs with k edits, at least one partition matches with floor(k / j) edits. p1, p2,..., pk+1 is a partitioning of. If occurrs in T with k edits, at least one partition matches exactly. If occurs with k edits, alignment contains t exact matches of length q, where t n - q kq igeonhole principle with j = k + 1 q-gram lemma with t = 1 If occurs with k edits, alignment contains an exact match of length q where q floor(n / (k + 1))....

34 Sensitivity Sensitivity = fraction of true approximate matches discovered by the algorithm Lossless algorithm finds all of them, lossy algorithm doesn t necessarily We ve seen lossless algorithms. Most everyday tools are lossy. Lossy algorithms are usually much speedier & still acceptably sensitive... Example lossy algorithm: pick q > floor(n / (k + 1))

Introduction to second-generation sequencing

Introduction to second-generation sequencing Introduction to second-generation sequencing CMSC858B Spring 2012 Many slides courtesy of Ben Langmead Second-Generation Sequencing 2 1 2 Human Epigenome Project ENCODE project Methylation 3 http://www.genome.gov/10005107

More information

Dynamic programming in less time and space

Dynamic programming in less time and space Dynamic programming in less time and space Ben Langmead Department of omputer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or

More information

Hidden Markov Models

Hidden Markov Models s Ben Langmead Department of Computer Science Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me (ben.langmead@gmail.com).

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

FORMAL LANGUAGES, AUTOMATA AND COMPUTATION

FORMAL LANGUAGES, AUTOMATA AND COMPUTATION FORMAL LANGUAGES, AUTOMATA AND COMPUTATION IDENTIFYING NONREGULAR LANGUAGES PUMPING LEMMA Carnegie Mellon University in Qatar (CARNEGIE MELLON UNIVERSITY IN QATAR) SLIDES FOR 15-453 LECTURE 5 SPRING 2011

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

CSE 549 Lecture 3: Sequence Similarity & Alignment. slides (w/*) courtesy of Carl Kingsford

CSE 549 Lecture 3: Sequence Similarity & Alignment. slides (w/*) courtesy of Carl Kingsford CSE 549 Lecture 3: Sequence Similarity & Alignment slides (w/*) courtesy of Carl Kingsford Relatedness of Biological Sequence https://en.wikipedia.org/wiki/phylogenetic_tree Relatedness of Biological Sequence

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift

More information

Proving languages to be nonregular

Proving languages to be nonregular Proving languages to be nonregular We already know that there exist languages A Σ that are nonregular, for any choice of an alphabet Σ. This is because there are uncountably many languages in total and

More information

Mechanized Operational Semantics

Mechanized Operational Semantics Mechanized Operational Semantics J Strother Moore Department of Computer Sciences University of Texas at Austin Marktoberdorf Summer School 2008 (Lecture 5: Boyer-Moore Fast String Searching) 1 The Problem

More information

6.6 Sequence Alignment

6.6 Sequence Alignment 6.6 Sequence Alignment String Similarity How similar are two strings? ocurrance o c u r r a n c e - occurrence First model the problem Q. How can we measure the distance? o c c u r r e n c e 6 mismatches,

More information

Lecture 5: The Shift-And Method

Lecture 5: The Shift-And Method Biosequence Algorithms, Spring 2005 Lecture 5: The Shift-And Method Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 5: Shift-And p.1/19 Seminumerical String Matching Most

More information

Sequence comparison by compression

Sequence comparison by compression Sequence comparison by compression Motivation similarity as a marker for homology. And homology is used to infer function. Sometimes, we are only interested in a numerical distance between two sequences.

More information

Chapter 1. Pigeonhole Principle

Chapter 1. Pigeonhole Principle Chapter 1. Pigeonhole Principle Prof. Tesler Math 184A Fall 2017 Prof. Tesler Ch. 1. Pigeonhole Principle Math 184A / Fall 2017 1 / 16 Pigeonhole principle https://commons.wikimedia.org/wiki/file:toomanypigeons.jpg

More information

Non-regular languages

Non-regular languages Non-regular languages (Pumping Lemma) Fall 2017 Costas Busch - RPI 1 Non-regular languages { a n n : n 0} { vv R : v{ a, }*} Regular languages a* * c a c( a )* etc... Fall 2017 Costas Busch - RPI 2 How

More information

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser Algorithms and Data Structures Complexity of Algorithms Ulf Leser Content of this Lecture Efficiency of Algorithms Machine Model Complexity Examples Multiplication of two binary numbers (unit cost?) Exact

More information

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture

More information

Samson Zhou. Pattern Matching over Noisy Data Streams

Samson Zhou. Pattern Matching over Noisy Data Streams Samson Zhou Pattern Matching over Noisy Data Streams Finding Structure in Data Pattern Matching Finding all instances of a pattern within a string ABCD ABCAABCDAACAABCDBCABCDADDDEAEABCDA Knuth-Morris-Pratt

More information

Non-context-Free Languages. CS215, Lecture 5 c

Non-context-Free Languages. CS215, Lecture 5 c Non-context-Free Languages CS215, Lecture 5 c 2007 1 The Pumping Lemma Theorem. (Pumping Lemma) Let be context-free. There exists a positive integer divided into five pieces, Proof for for each, and..

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

Multiple Pattern Matching

Multiple Pattern Matching Multiple Pattern Matching Stephen Fulwider and Amar Mukherjee College of Engineering and Computer Science University of Central Florida Orlando, FL USA Email: {stephen,amar}@cs.ucf.edu Abstract In this

More information

String Matching II. Algorithm : Design & Analysis [19]

String Matching II. Algorithm : Design & Analysis [19] String Matching II Algorithm : Design & Analysis [19] In the last class Simple String Matching KMP Flowchart Construction Jump at Fail KMP Scan String Matching II Boyer-Moore s heuristics Skipping unnecessary

More information

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases

More information

CSE 105 Theory of Computation Professor Jeanne Ferrante

CSE 105 Theory of Computation   Professor Jeanne Ferrante CSE 105 Theory of Computation http://www.jflap.org/jflaptmp/ Professor Jeanne Ferrante 1 Announcement & Reminders HW 2: Grades not available yet, will be soon Solutions posted, read even if you did well

More information

Analysis of Algorithms Prof. Karen Daniels

Analysis of Algorithms Prof. Karen Daniels UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Spring, 2012 Tuesday, 4/24/2012 String Matching Algorithms Chapter 32* * Pseudocode uses 2 nd edition conventions 1 Chapter

More information

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm

Overview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm Knuth-Morris-Pratt & s by Robert C. St.Pierre Overview Notation review Knuth-Morris-Pratt algorithm Discussion of the Algorithm Example Boyer-Moore algorithm Discussion of the Algorithm Example Applications

More information

Fast String Searching. J Strother Moore Department of Computer Sciences University of Texas at Austin

Fast String Searching. J Strother Moore Department of Computer Sciences University of Texas at Austin Fast String Searching J Strother Moore Department of Computer Sciences University of Texas at Austin 1 The Problem One of the classic problems in computing is string searching: find the first occurrence

More information

Bio nformatics. Lecture 16. Saad Mneimneh

Bio nformatics. Lecture 16. Saad Mneimneh Bio nformatics Lecture 16 DNA sequencing To sequence a DNA is to obtain the string of bases that it contains. It is impossible to sequence the whole DNA molecule directly. We may however obtain a piece

More information

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003

String Matching. Jayadev Misra The University of Texas at Austin December 5, 2003 String Matching Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 Rabin-Karp Algorithm 3 3 Knuth-Morris-Pratt Algorithm 5 3.1 Informal Description.........................

More information

State Complexity of Neighbourhoods and Approximate Pattern Matching

State Complexity of Neighbourhoods and Approximate Pattern Matching State Complexity of Neighbourhoods and Approximate Pattern Matching Timothy Ng, David Rappaport, and Kai Salomaa School of Computing, Queen s University, Kingston, Ontario K7L 3N6, Canada {ng, daver, ksalomaa}@cs.queensu.ca

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

CSC 421: Algorithm Design & Analysis. Spring 2018

CSC 421: Algorithm Design & Analysis. Spring 2018 CSC 421: Algorithm Design & Analysis Spring 2018 Complexity & Computability complexity theory tractability, decidability P vs. NP, Turing machines NP-complete, reductions approximation algorithms, genetic

More information

String Search. 6th September 2018

String Search. 6th September 2018 String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)

More information

Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt Algorithm Knuth-Morris-Pratt Algorithm Jayadev Misra June 5, 2017 The Knuth-Morris-Pratt string matching algorithm (KMP) locates all occurrences of a pattern string in a text string in linear time (in the combined

More information

INF 4130 / /8-2017

INF 4130 / /8-2017 INF 4130 / 9135 28/8-2017 Algorithms, efficiency, and complexity Problem classes Problems can be divided into sets (classes). Problem classes are defined by the type of algorithm that can (or cannot) solve

More information

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then String Matching Thanks to Piotr Indyk String Matching Input: Two strings T[1 n] and P[1 m], containing symbols from alphabet Σ Goal: find all shifts 0 s n-m such that T[s+1 s+m]=p Example: Σ={,a,b,,z}

More information

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009 Outline Approximation: Theory and Algorithms The Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 1 Nikolaus Augsten (DIS) Approximation: Theory and

More information

The Pumping Lemma for Context Free Grammars

The Pumping Lemma for Context Free Grammars The Pumping Lemma for Context Free Grammars Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar is in the form A BC A a Where a is any terminal

More information

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /

More information

CSE 202 Dynamic Programming II

CSE 202 Dynamic Programming II CSE 202 Dynamic Programming II Chapter 6 Dynamic Programming Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 Algorithmic Paradigms Greed. Build up a solution incrementally,

More information

Before we show how languages can be proven not regular, first, how would we show a language is regular?

Before we show how languages can be proven not regular, first, how would we show a language is regular? CS35 Proving Languages not to be Regular Before we show how languages can be proven not regular, first, how would we show a language is regular? Although regular languages and automata are quite powerful

More information

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity

More information

CISC4090: Theory of Computation

CISC4090: Theory of Computation CISC4090: Theory of Computation Chapter 2 Context-Free Languages Courtesy of Prof. Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Spring, 2014 Overview In Chapter

More information

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) October,

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever.   ETH Zürich (D-ITET) October, Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu ETH Zürich (D-ITET) October, 5 2017 Part 3 out of 5 Last week, we learned about closure and equivalence of regular

More information

Part 3 out of 5. Automata & languages. A primer on the Theory of Computation. Last week, we learned about closure and equivalence of regular languages

Part 3 out of 5. Automata & languages. A primer on the Theory of Computation. Last week, we learned about closure and equivalence of regular languages Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu Part 3 out of 5 ETH Zürich (D-ITET) October, 5 2017 Last week, we learned about closure and equivalence of regular

More information

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms Simone Faro Thierry Lecroq University of Catania, Italy University of Rouen, LITIS EA 4108, France Symposium on Eperimental Algorithms

More information

Approximation: Theory and Algorithms

Approximation: Theory and Algorithms Approximation: Theory and Algorithms The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 Nikolaus Augsten (DIS) Approximation:

More information

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb Outline DP paradigm Discrete optimisation Viterbi algorithm DP: Knapsack Dynamic Programming Georgy Gimel farb (with basic contributions by Michael J. Dinneen) COMPSCI 69 Computational Science / Outline

More information

Lecture 4: More on Regexps, Non-Regular Languages

Lecture 4: More on Regexps, Non-Regular Languages 6.045 Lecture 4: More on Regexps, Non-Regular Languages 6.045 Announcements: - Pset 1 is on piazza (as of last night) - If you don t have piazza access but are registered for 6.045, send email to TAs with

More information

Text Analytics. Searching Terms. Ulf Leser

Text Analytics. Searching Terms. Ulf Leser Text Analytics Searching Terms Ulf Leser A Probabilistic Interpretation of Relevance We want to compute the probability that a doc d is relevant to query q The probabilistic model determines this probability

More information

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases

More information

Online Computation of Abelian Runs

Online Computation of Abelian Runs Online Computation of Abelian Runs Gabriele Fici 1, Thierry Lecroq 2, Arnaud Lefebvre 2, and Élise Prieur-Gaston2 1 Dipartimento di Matematica e Informatica, Università di Palermo, Italy Gabriele.Fici@unipa.it

More information

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching Ricardo Luis de Azevedo da Rocha 1, João José Neto 1 1 Laboratório de Linguagens

More information

Efficient Sequential Algorithms, Comp309

Efficient Sequential Algorithms, Comp309 Efficient Sequential Algorithms, Comp309 University of Liverpool 2010 2011 Module Organiser, Igor Potapov Part 2: Pattern Matching References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to

More information

PS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010

PS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010 University of Virginia - cs3102: Theory of Computation Spring 2010 PS2 - Comments Average: 77.4 (full credit for each question is 100 points) Distribution (of 54 submissions): 90, 12; 80 89, 11; 70-79,

More information

More on Regular Languages and Non-Regular Languages

More on Regular Languages and Non-Regular Languages More on Regular Languages and Non-Regular Languages CSCE A35 Decision Properties of Regular Languages Given a (representation, e.g., RE, FA, of a) regular language L, what can we tell about L? Since there

More information

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

More information

THE KNUTH RELATIONS HUAN VO

THE KNUTH RELATIONS HUAN VO THE KNUTH RELATIONS HUAN VO Abstract We define three equivalence relations on the set of words using different means It turns out that they are the same relation Motivation Recall the algorithm of row

More information

Languages, regular languages, finite automata

Languages, regular languages, finite automata Notes on Computer Theory Last updated: January, 2018 Languages, regular languages, finite automata Content largely taken from Richards [1] and Sipser [2] 1 Languages An alphabet is a finite set of characters,

More information

INF 4130 / /8-2014

INF 4130 / /8-2014 INF 4130 / 9135 26/8-2014 Mandatory assignments («Oblig-1», «-2», and «-3»): All three must be approved Deadlines around: 25. sept, 25. oct, and 15. nov Other courses on similar themes: INF-MAT 3370 INF-MAT

More information

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017 CPSC 340: Machine Learning and Data Mining Regularization Fall 2017 Assignment 2 Admin 2 late days to hand in tonight, answers posted tomorrow morning. Extra office hours Thursday at 4pm (ICICS 246). Midterm

More information

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18 /8/8 Objec&ves Dynamic Programming Ø Review Knapsack Ø Sequence Alignment Mar 8, 8 CSCI - Sprenkle Review What is the knapsack problem? What is our solu&on? Mar 8, 8 CSCI - Sprenkle /8/8 Dynamic Programming:

More information

N/4 + N/2 + N = 2N 2.

N/4 + N/2 + N = 2N 2. CS61B Summer 2006 Instructor: Erin Korber Lecture 24, 7 Aug. 1 Amortized Analysis For some of the data structures we ve discussed (namely hash tables and splay trees), it was claimed that the average time

More information

Lecture 23: Rice Theorem and Turing machine behavior properties 21 April 2009

Lecture 23: Rice Theorem and Turing machine behavior properties 21 April 2009 CS 373: Theory of Computation Sariel Har-Peled and Madhusudan Parthasarathy Lecture 23: Rice Theorem and Turing machine behavior properties 21 April 2009 This lecture covers Rice s theorem, as well as

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, Sofya Vorotnikova 1 1 University of Massachusetts Amherst 2 IBM Almaden Research Center Problem Description Take original string x of length

More information

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park Compressing Kinetic Data From Sensor Networks Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park Motivation Motivation Computer Science Graphics: Image and video

More information

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

Lecture 2: Pairwise Alignment. CG Ron Shamir

Lecture 2: Pairwise Alignment. CG Ron Shamir Lecture 2: Pairwise Alignment 1 Main source 2 Why compare sequences? Human hexosaminidase A vs Mouse hexosaminidase A 3 www.mathworks.com/.../jan04/bio_genome.html Sequence Alignment עימוד רצפים The problem:

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

(In)approximability Results for Pattern Matching Problems

(In)approximability Results for Pattern Matching Problems (In)approximability Results for Pattern Matching Problems Raphaël Clifford and Alexandru Popa Department of Computer Science University of Bristol Merchant Venturer s Building Woodland Road, Bristol, BS8

More information

Theoretical Computer Science. Efficient string-matching allowing for non-overlapping inversions

Theoretical Computer Science. Efficient string-matching allowing for non-overlapping inversions Theoretical Computer Science 483 (2013) 85 95 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Efficient string-matching allowing

More information

1. Determine (with proof) the number of ordered triples (A 1, A 2, A 3 ) of sets which satisfy

1. Determine (with proof) the number of ordered triples (A 1, A 2, A 3 ) of sets which satisfy UT Putnam Prep Problems, Oct 19 2016 I was very pleased that, between the whole gang of you, you solved almost every problem this week! Let me add a few comments here. 1. Determine (with proof) the number

More information

P is the class of problems for which there are algorithms that solve the problem in time O(n k ) for some constant k.

P is the class of problems for which there are algorithms that solve the problem in time O(n k ) for some constant k. Complexity Theory Problems are divided into complexity classes. Informally: So far in this course, almost all algorithms had polynomial running time, i.e., on inputs of size n, worst-case running time

More information

Analysis and Design of Algorithms Dynamic Programming

Analysis and Design of Algorithms Dynamic Programming Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................

More information

Automata Theory. Lecture on Discussion Course of CS120. Runzhe SJTU ACM CLASS

Automata Theory. Lecture on Discussion Course of CS120. Runzhe SJTU ACM CLASS Automata Theory Lecture on Discussion Course of CS2 This Lecture is about Mathematical Models of Computation. Why Should I Care? - Ways of thinking. - Theory can drive practice. - Don t be an Instrumentalist.

More information

Expressions that always have the same value. The Identity Property of Addition states that For any value a; a + 0 = a so = 3

Expressions that always have the same value. The Identity Property of Addition states that For any value a; a + 0 = a so = 3 Name Key Words/Topic 2.1 Identity and Zero Properties Topic 2 Guided Notes Equivalent Expressions Identity Property of Addition Identity Property of Multiplication Zero Property of Multiplication The sum

More information

Average Complexity of Exact and Approximate Multiple String Matching

Average Complexity of Exact and Approximate Multiple String Matching Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science

More information

Compilers. Lexical analysis. Yannis Smaragdakis, U. Athens (original slides by Sam

Compilers. Lexical analysis. Yannis Smaragdakis, U. Athens (original slides by Sam Compilers Lecture 3 Lexical analysis Yannis Smaragdakis, U. Athens (original slides by Sam Guyer@Tufts) Big picture Source code Front End IR Back End Machine code Errors Front end responsibilities Check

More information

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Markus L. Schmid Trier University, Fachbereich IV Abteilung Informatikwissenschaften, D-54286 Trier, Germany, MSchmid@uni-trier.de

More information

Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints

Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Efficient Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang,, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

UNIT-III REGULAR LANGUAGES

UNIT-III REGULAR LANGUAGES Syllabus R9 Regulation REGULAR EXPRESSIONS UNIT-III REGULAR LANGUAGES Regular expressions are useful for representing certain sets of strings in an algebraic fashion. In arithmetic we can use the operations

More information

Exact vs approximate search

Exact vs approximate search Text Algorithms (4AP) Lecture: Approximate matching Jaak Vilo 2008 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 Exact vs approximate search In exact search we searched for a string or set of strings in

More information

1 Randomized reduction: a first start

1 Randomized reduction: a first start 6.S078 Fine-Grained Algorithms and Complexity MIT Lectures 3 and 4 February 14 and 21, 2018 Today: In the previous two lectures, we mentioned ETH and SETH, about the time complexity of SAT. Today we want

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Lecture 5. 1 Review (Pairwise Independence and Derandomization) 6.842 Randomness and Computation September 20, 2017 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Tom Kolokotrones 1 Review (Pairwise Independence and Derandomization) As we discussed last time, we can

More information

Faster Algorithms for String Matching with k Mismatches

Faster Algorithms for String Matching with k Mismatches 794 Faster Algorithms for String Matching with k Mismatches Amihood Amir *t Bar-Ilan University and Georgia Tech Moshe Lewenstein* * Bar-Ilan University Ely Porat* Bar Ilan University and Weizmann Institute

More information

Nondeterministic finite automata

Nondeterministic finite automata Lecture 3 Nondeterministic finite automata This lecture is focused on the nondeterministic finite automata (NFA) model and its relationship to the DFA model. Nondeterminism is an important concept in the

More information

Lecture 15. Saad Mneimneh

Lecture 15. Saad Mneimneh Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string

More information

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F, Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string

More information

Graduate Algorithms CS F-20 String Matching

Graduate Algorithms CS F-20 String Matching Graduate Algorithms CS673-2016F-20 String Matching David Galles Department of Computer Science University of San Francisco 20-0: String Matching Given a source text, and a string to match, where does the

More information

Algorithm Design and Analysis

Algorithm Design and Analysis Algorithm Design and Analysis LECTURE 8 Greedy Algorithms V Huffman Codes Adam Smith Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: Let e be the

More information

Solutions to In Class Problems Week 4, Mon.

Solutions to In Class Problems Week 4, Mon. Massachusetts Institute of Technology 6.042J/18.062J, Fall 05: Mathematics for Computer Science September 26 Prof. Albert R. Meyer and Prof. Ronitt Rubinfeld revised September 26, 2005, 1050 minutes Solutions

More information