Approximate Matching
|
|
- Brendan Chapman
- 5 years ago
- Views:
Transcription
1 Approximate Matching Ben Langmead Department of Computer Science lease sign guestbook ( to tell me briefly how you are using the slides. For original Keynote files, me
2 Approximate matching Strings Online algorithms Naive exact matching Offline algorithms Index-assisted Boyer-Moore k-mer index We have focused on exact matching in reality, we have to deal with differences
3 Read CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC Reference GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT ACCTAAGAACAGCTAAAAGAGCACACCCGTCTATGTAGCAAAATAGTGGGAAGATTTATA GGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGGTTGTCCAAGATAGAATCTTAG TTCAACTTTAAATTTGCCCACAGAACCCTCTAAATCCCCTTGTAAATTTAACTGTTAGTC CAAAGAGGAACAGCTCTTTGGACACTAGGAAAAAACCTTGTAGAGAGAGTAAAAAATTTA ACACCCATAGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTTCAAGCTCAACACCCA CTACCTAAAAAATCCCAAACATATAACTGAACTCCTCACACCCAATTGGACCAATCTATC ACCCTATAGAAGAACTAATGTTAGTATAAGTAACATGAAAACATTCTCCTCCGCATAAGC CTGCGTCAGATTAAAACACTGAACTGACAATTAACAGCCCAATATCTACAATCAACCAAC AAGTCATTATTACCCTCACTGTCAACCCAACACAGGCATGCTCATAAGGAAAGGTTAAAA AAAGTAAAAGGAACTCGGCAAATCTTACCCCGCCTGTTTACCAAAAACATCACCTCTAGC ATCACCAGTATTAGAGGCACCGCCTGCCCAGTGACACATGTTTAACGGCCGCGGTACCCT Sequence differences occur because of Sequencing error 2. Genetic variation
4 Approximate matching T: GGAAAAAGAGGTAGCGGCGTTTAACAGTAG : GTAACGGCG Mismatch (Substitution)
5 Approximate matching T: GGAAAAAGAGGTAGC-GCGTTTAACAGTAG GTAGCGGCG : Insertion
6 Approximate matching T: GGAAAAAGAGGTAGCGGCGTTTAACAGTAG GT-GCGGCG : Deletion
7 Hamming distance For X & Y where X = Y, hamming distance = minimum # substitutions needed to turn one into the other X: Y: GAGGTAGCGGCGTT GGTAG GTT GTGGTAACGGGGTT GGTAA GTT Hamming distance = 3
8 Edit distance (AKA Levenshtein distance) For X & Y, edit distance = minimum # edits (substitutions, insertions, deletions) needed to turn one into the other X: Y: X: Y: TGGCCGCGCAAAAACAGC CAGC TGACCGCGCAAAACAGC CCGCGCAAAA-CAGC GCGTATGCGGCTAACGC TATGCGGCTA-ACGC GCTATGCGGCTATACGC GC- ACGC Edit distance = 2 Edit distance = 2
9 Approximate matching Like exact matching, but pattern may be within a certain distance (usually Hamming or edit) of T. Each such place is an approximate match. Allowing edits is more challenging than just allowing mismatches We ll return to edits
10 Approximate matching def naive(p, t): occurrences = [] for i in range(len(t) - len(p) + 1): # loop over alignments match = True for j in range(len(p)): # loop over characters if t[i+j]!= p[j]: # compare characters match = False break if match: occurrences.append(i) return occurrences # mismatch; reject alignment # all chars matched; record
11 Approximate matching def naive_approx_hamming(p, t, maxdistance): occurrences = [] for i in range(len(t) - len(p) + 1): # loop over alignments nmm = 0 for j in range(len(p)): # loop over characters if t[i+j]!= p[j]: # compare characters nmm += 1 # mismatch if nmm > maxdistance: break # exceeded max hamming dist if nmm <= maxdistance: occurrences.append(i) # approximate match return occurrences
12 Approximate matching Wanted: way to apply exact matching algorithms to approximate matching problems
13 Approximate matching u v If occurs in T with 1 edit, then u or v appears with no edits u v u v : T: CGC GCA CGA GCA : AATTGG T: AATTAG
14 Approximate matching... p 1 p 2 p 3 p 4 p k+1 If occurs in T with up to k edits
15 Approximate matching... p 1 p 2 p 3 p 4 p k+1 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits
16 Approximate matching XXX X 5 partitions 4 edits (X) XX X X X X X X
17 Approximate matching igeonhole principle: k+1 pigeons, k holes. At least one has >1 pigeon!
18 Approximate matching We have k pigeons, k+1 holes, at least one......is empty
19 igeonhole principle p 1 p 2 p 3 p 4 p 5 Exact matching T What algorithm can we use? Any exact matching algorithm If we have a k-mer index, we can use that Naive exact matching Boyer-Moore
20 igeonhole principle p 4 match T
21 igeonhole principle p 1 p 2 p 3 p 4 p 5 match verify verify verify verify T For Hamming distance, verification is essentially just the inner loop of naive_approx_hamming from before
22 igeonhole principle... p 1 p 2 p 3 p 4 p k+1 Advantages Reuse favorite exact matching algos; fast and easy Flexible; works for Hamming and edit distance* Disadvantages Large k yields small partitions matching many times by chance; lots of verification work k+1 exact matching problems, one per partition * we don t know how to do edit distance verification yet
23 Implementation of pigeonhole principle with Boyer-Moore as exact matching algorithm: Boyer-Moore, exact Boyer-Moore, 1 mismatch with pigeonhole Boyer-Moore, 2 mismatches with pigeonhole # character comparisons wall clock time # matches # character comparisons wall clock time # matches # character comparisons wall clock time # matches : tomorrow T: Shakespeare s complete works 786 K 1.91s M 7.73 s M s 382 : 50 nt string from Alu repeat* T: Human reference (hg19) chromosome M s M 209 s 1, M 328 s 2,798 * GCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGG
24 Generalizing pigeonhole, part 1 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits... p 1 p 2 p 3 p 4 p k+1 But doesn't have to be at least one of what would we have to change for at least two of? If occurs in T with up to k edits, then at least two of must appear with 0 edits
25 Generalizing pigeonhole, part 1 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits... p 1 p 2 p 3 p 4 p k+1 But doesn't have to be at least one of what would we have to change for at least two of? If occurs in T with up to k edits, then at least two of p 1, p 2,..., p k+2 must appear with 0 edits
26 Generalizing pigeonhole, part 2 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits... p 1 p 2 p 3 p 4 p k+1 Let p1, p2,..., pj be a partitioning of. If occurs with up to k edits, then at least one of p1, p2,..., pj must occur with floor(k / j) edits.???... p 1 p 2 p 3 p 4 p j
27 Generalizing pigeonhole, part 2 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits... p 1 p 2 p 3 p 4 p k+1 Let p1, p2,..., pj be a partitioning of. If occurs with up to k edits, then at least one of p1, p2,..., pj must occur with floor(k / j) edits.... p 1 p 2 p 3 p 4 p j
28 Generalizing pigeonhole, part 2 At least one of p1, p2,..., p5 occurs with... X X X X k = 4 edits 0 edits X X X X X k = 5 edits 1 edits X X X X X X X X X k = 9 edits 1 edits X X X X X X X X X X k = 10 edits 2 edits etc
29 Generalizing pigeonhole, part 2 Let j = k + 1 Specific General igeonhole principle If occurs in T with up to k edits, then at least one of p 1, p 2,..., p j must appear with floor(k / j) edits igeonhole principle with j = k + 1 If occurs in T with up to k edits, then at least one of p 1, p 2,..., p k+1 must appear with 0 edits Why? Smallest value s.t. floor(k / j) = 0 Why make floor(k / j) = 0? So we can use exact matching Why is smaller j good? Yields fewer, longer partitions Why are long partitions good? Makes exact-matching filter more specific, minimizing # candidates
30 A different principle We partitioned into non-overlapping substrings p1 p2 p3 p4... pk+1 Now consider overlapping substrings 3 6 n-l n-l 1 4 n-l-1
31 Approximate string matching: more principles q x x x x... n n - q + 1 of these Say substrings are length q. There are n - q + 1 such substrings. 1 edit to changes at most q substrings kq is worst case; could be < kq Minimum # of length-q substrings unedited after k edits? n - q kq q-gram lemma: if occurs in T with up to k edits, alignment must contain t exact matches of length q, where t n - q kq
32 Approximate string matching: more principles If occurs in T with up to k edits, alignment contains an exact match of length q, where q floor(n / (k + 1)) Obtained by solving for q: n - q kq 1 Exact matching filter: find matches of length floor(n / (k + 1)) between T and any substring of. Check vicinity for full match.
33 Approximate matching principles Non-overlapping substrings igeonhole principle q-gram lemma Overlapping substrings Specific General p1, p2,..., pj is a partitioning of. If occurs with k edits, at least one partition matches with floor(k / j) edits. p1, p2,..., pk+1 is a partitioning of. If occurrs in T with k edits, at least one partition matches exactly. If occurs with k edits, alignment contains t exact matches of length q, where t n - q kq igeonhole principle with j = k + 1 q-gram lemma with t = 1 If occurs with k edits, alignment contains an exact match of length q where q floor(n / (k + 1))....
34 Sensitivity Sensitivity = fraction of true approximate matches discovered by the algorithm Lossless algorithm finds all of them, lossy algorithm doesn t necessarily We ve seen lossless algorithms. Most everyday tools are lossy. Lossy algorithms are usually much speedier & still acceptably sensitive... Example lossy algorithm: pick q > floor(n / (k + 1))
Introduction to second-generation sequencing
Introduction to second-generation sequencing CMSC858B Spring 2012 Many slides courtesy of Ben Langmead Second-Generation Sequencing 2 1 2 Human Epigenome Project ENCODE project Methylation 3 http://www.genome.gov/10005107
More informationDynamic programming in less time and space
Dynamic programming in less time and space Ben Langmead Department of omputer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or
More informationHidden Markov Models
s Ben Langmead Department of Computer Science Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me (ben.langmead@gmail.com).
More informationOptimal spaced seeds for faster approximate string matching
Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching
More informationSearching Sear ( Sub- (Sub )Strings Ulf Leser
Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf
More informationOptimal spaced seeds for faster approximate string matching
Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching
More informationFORMAL LANGUAGES, AUTOMATA AND COMPUTATION
FORMAL LANGUAGES, AUTOMATA AND COMPUTATION IDENTIFYING NONREGULAR LANGUAGES PUMPING LEMMA Carnegie Mellon University in Qatar (CARNEGIE MELLON UNIVERSITY IN QATAR) SLIDES FOR 15-453 LECTURE 5 SPRING 2011
More informationPattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1
Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt
More informationImplementing Approximate Regularities
Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National
More informationCSE 549 Lecture 3: Sequence Similarity & Alignment. slides (w/*) courtesy of Carl Kingsford
CSE 549 Lecture 3: Sequence Similarity & Alignment slides (w/*) courtesy of Carl Kingsford Relatedness of Biological Sequence https://en.wikipedia.org/wiki/phylogenetic_tree Relatedness of Biological Sequence
More informationPattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia
Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift
More informationProving languages to be nonregular
Proving languages to be nonregular We already know that there exist languages A Σ that are nonregular, for any choice of an alphabet Σ. This is because there are uncountably many languages in total and
More informationMechanized Operational Semantics
Mechanized Operational Semantics J Strother Moore Department of Computer Sciences University of Texas at Austin Marktoberdorf Summer School 2008 (Lecture 5: Boyer-Moore Fast String Searching) 1 The Problem
More information6.6 Sequence Alignment
6.6 Sequence Alignment String Similarity How similar are two strings? ocurrance o c u r r a n c e - occurrence First model the problem Q. How can we measure the distance? o c c u r r e n c e 6 mismatches,
More informationLecture 5: The Shift-And Method
Biosequence Algorithms, Spring 2005 Lecture 5: The Shift-And Method Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 5: Shift-And p.1/19 Seminumerical String Matching Most
More informationSequence comparison by compression
Sequence comparison by compression Motivation similarity as a marker for homology. And homology is used to infer function. Sometimes, we are only interested in a numerical distance between two sequences.
More informationChapter 1. Pigeonhole Principle
Chapter 1. Pigeonhole Principle Prof. Tesler Math 184A Fall 2017 Prof. Tesler Ch. 1. Pigeonhole Principle Math 184A / Fall 2017 1 / 16 Pigeonhole principle https://commons.wikimedia.org/wiki/file:toomanypigeons.jpg
More informationNon-regular languages
Non-regular languages (Pumping Lemma) Fall 2017 Costas Busch - RPI 1 Non-regular languages { a n n : n 0} { vv R : v{ a, }*} Regular languages a* * c a c( a )* etc... Fall 2017 Costas Busch - RPI 2 How
More informationAlgorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser
Algorithms and Data Structures Complexity of Algorithms Ulf Leser Content of this Lecture Efficiency of Algorithms Machine Model Complexity Examples Multiplication of two binary numbers (unit cost?) Exact
More informationFoundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM
Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM Alex Lascarides (Slides from Alex Lascarides and Sharon Goldwater) 2 February 2019 Alex Lascarides FNLP Lecture
More informationSamson Zhou. Pattern Matching over Noisy Data Streams
Samson Zhou Pattern Matching over Noisy Data Streams Finding Structure in Data Pattern Matching Finding all instances of a pattern within a string ABCD ABCAABCDAACAABCDBCABCDADDDEAEABCDA Knuth-Morris-Pratt
More informationNon-context-Free Languages. CS215, Lecture 5 c
Non-context-Free Languages CS215, Lecture 5 c 2007 1 The Pumping Lemma Theorem. (Pumping Lemma) Let be context-free. There exists a positive integer divided into five pieces, Proof for for each, and..
More informationModule 9: Tries and String Matching
Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School
More informationMultiple Pattern Matching
Multiple Pattern Matching Stephen Fulwider and Amar Mukherjee College of Engineering and Computer Science University of Central Florida Orlando, FL USA Email: {stephen,amar}@cs.ucf.edu Abstract In this
More informationString Matching II. Algorithm : Design & Analysis [19]
String Matching II Algorithm : Design & Analysis [19] In the last class Simple String Matching KMP Flowchart Construction Jump at Fail KMP Scan String Matching II Boyer-Moore s heuristics Skipping unnecessary
More informationAlgorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer
Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases
More informationCSE 105 Theory of Computation Professor Jeanne Ferrante
CSE 105 Theory of Computation http://www.jflap.org/jflaptmp/ Professor Jeanne Ferrante 1 Announcement & Reminders HW 2: Grades not available yet, will be soon Solutions posted, read even if you did well
More informationAnalysis of Algorithms Prof. Karen Daniels
UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Spring, 2012 Tuesday, 4/24/2012 String Matching Algorithms Chapter 32* * Pseudocode uses 2 nd edition conventions 1 Chapter
More informationOverview. Knuth-Morris-Pratt & Boyer-Moore Algorithms. Notation Review (2) Notation Review (1) The Kunth-Morris-Pratt (KMP) Algorithm
Knuth-Morris-Pratt & s by Robert C. St.Pierre Overview Notation review Knuth-Morris-Pratt algorithm Discussion of the Algorithm Example Boyer-Moore algorithm Discussion of the Algorithm Example Applications
More informationFast String Searching. J Strother Moore Department of Computer Sciences University of Texas at Austin
Fast String Searching J Strother Moore Department of Computer Sciences University of Texas at Austin 1 The Problem One of the classic problems in computing is string searching: find the first occurrence
More informationBio nformatics. Lecture 16. Saad Mneimneh
Bio nformatics Lecture 16 DNA sequencing To sequence a DNA is to obtain the string of bases that it contains. It is impossible to sequence the whole DNA molecule directly. We may however obtain a piece
More informationString Matching. Jayadev Misra The University of Texas at Austin December 5, 2003
String Matching Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 Rabin-Karp Algorithm 3 3 Knuth-Morris-Pratt Algorithm 5 3.1 Informal Description.........................
More informationState Complexity of Neighbourhoods and Approximate Pattern Matching
State Complexity of Neighbourhoods and Approximate Pattern Matching Timothy Ng, David Rappaport, and Kai Salomaa School of Computing, Queen s University, Kingston, Ontario K7L 3N6, Canada {ng, daver, ksalomaa}@cs.queensu.ca
More informationProofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.
Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G
More informationCSC 421: Algorithm Design & Analysis. Spring 2018
CSC 421: Algorithm Design & Analysis Spring 2018 Complexity & Computability complexity theory tractability, decidability P vs. NP, Turing machines NP-complete, reductions approximation algorithms, genetic
More informationString Search. 6th September 2018
String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)
More informationKnuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt Algorithm Jayadev Misra June 5, 2017 The Knuth-Morris-Pratt string matching algorithm (KMP) locates all occurrences of a pattern string in a text string in linear time (in the combined
More informationINF 4130 / /8-2017
INF 4130 / 9135 28/8-2017 Algorithms, efficiency, and complexity Problem classes Problems can be divided into sets (classes). Problem classes are defined by the type of algorithm that can (or cannot) solve
More informationString Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then
String Matching Thanks to Piotr Indyk String Matching Input: Two strings T[1 n] and P[1 m], containing symbols from alphabet Σ Goal: find all shifts 0 s n-m such that T[s+1 s+m]=p Example: Σ={,a,b,,z}
More informationOutline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009
Outline Approximation: Theory and Algorithms The Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 1 Nikolaus Augsten (DIS) Approximation: Theory and
More informationThe Pumping Lemma for Context Free Grammars
The Pumping Lemma for Context Free Grammars Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar is in the form A BC A a Where a is any terminal
More informationAdapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts
Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /
More informationCSE 202 Dynamic Programming II
CSE 202 Dynamic Programming II Chapter 6 Dynamic Programming Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 Algorithmic Paradigms Greed. Build up a solution incrementally,
More informationBefore we show how languages can be proven not regular, first, how would we show a language is regular?
CS35 Proving Languages not to be Regular Before we show how languages can be proven not regular, first, how would we show a language is regular? Although regular languages and automata are quite powerful
More informationSTATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler
STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity
More informationCISC4090: Theory of Computation
CISC4090: Theory of Computation Chapter 2 Context-Free Languages Courtesy of Prof. Arthur G. Werschulz Fordham University Department of Computer and Information Sciences Spring, 2014 Overview In Chapter
More informationAutomata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) October,
Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu ETH Zürich (D-ITET) October, 5 2017 Part 3 out of 5 Last week, we learned about closure and equivalence of regular
More informationPart 3 out of 5. Automata & languages. A primer on the Theory of Computation. Last week, we learned about closure and equivalence of regular languages
Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu Part 3 out of 5 ETH Zürich (D-ITET) October, 5 2017 Last week, we learned about closure and equivalence of regular
More informationA Multiple Sliding Windows Approach to Speed Up String Matching Algorithms
A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms Simone Faro Thierry Lecroq University of Catania, Italy University of Rouen, LITIS EA 4108, France Symposium on Eperimental Algorithms
More informationApproximation: Theory and Algorithms
Approximation: Theory and Algorithms The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 Nikolaus Augsten (DIS) Approximation:
More informationOutline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb
Outline DP paradigm Discrete optimisation Viterbi algorithm DP: Knapsack Dynamic Programming Georgy Gimel farb (with basic contributions by Michael J. Dinneen) COMPSCI 69 Computational Science / Outline
More informationLecture 4: More on Regexps, Non-Regular Languages
6.045 Lecture 4: More on Regexps, Non-Regular Languages 6.045 Announcements: - Pset 1 is on piazza (as of last night) - If you don t have piazza access but are registered for 6.045, send email to TAs with
More informationText Analytics. Searching Terms. Ulf Leser
Text Analytics Searching Terms Ulf Leser A Probabilistic Interpretation of Relevance We want to compute the probability that a doc d is relevant to query q The probabilistic model determines this probability
More information15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12
Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases
More informationOnline Computation of Abelian Runs
Online Computation of Abelian Runs Gabriele Fici 1, Thierry Lecroq 2, Arnaud Lefebvre 2, and Élise Prieur-Gaston2 1 Dipartimento di Matematica e Informatica, Università di Palermo, Italy Gabriele.Fici@unipa.it
More informationAn Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching
An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching Ricardo Luis de Azevedo da Rocha 1, João José Neto 1 1 Laboratório de Linguagens
More informationEfficient Sequential Algorithms, Comp309
Efficient Sequential Algorithms, Comp309 University of Liverpool 2010 2011 Module Organiser, Igor Potapov Part 2: Pattern Matching References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to
More informationPS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010
University of Virginia - cs3102: Theory of Computation Spring 2010 PS2 - Comments Average: 77.4 (full credit for each question is 100 points) Distribution (of 54 submissions): 90, 12; 80 89, 11; 70-79,
More informationMore on Regular Languages and Non-Regular Languages
More on Regular Languages and Non-Regular Languages CSCE A35 Decision Properties of Regular Languages Given a (representation, e.g., RE, FA, of a) regular language L, what can we tell about L? Since there
More informationSEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA
SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.
More informationTHE KNUTH RELATIONS HUAN VO
THE KNUTH RELATIONS HUAN VO Abstract We define three equivalence relations on the set of words using different means It turns out that they are the same relation Motivation Recall the algorithm of row
More informationLanguages, regular languages, finite automata
Notes on Computer Theory Last updated: January, 2018 Languages, regular languages, finite automata Content largely taken from Richards [1] and Sipser [2] 1 Languages An alphabet is a finite set of characters,
More informationINF 4130 / /8-2014
INF 4130 / 9135 26/8-2014 Mandatory assignments («Oblig-1», «-2», and «-3»): All three must be approved Deadlines around: 25. sept, 25. oct, and 15. nov Other courses on similar themes: INF-MAT 3370 INF-MAT
More informationTheory of LSH. Distance Measures LS Families of Hash Functions S-Curves
Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes
More informationSequence Comparison. mouse human
Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity
More informationFinding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More informationCPSC 340: Machine Learning and Data Mining. Regularization Fall 2017
CPSC 340: Machine Learning and Data Mining Regularization Fall 2017 Assignment 2 Admin 2 late days to hand in tonight, answers posted tomorrow morning. Extra office hours Thursday at 4pm (ICICS 246). Midterm
More informationObjec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18
/8/8 Objec&ves Dynamic Programming Ø Review Knapsack Ø Sequence Alignment Mar 8, 8 CSCI - Sprenkle Review What is the knapsack problem? What is our solu&on? Mar 8, 8 CSCI - Sprenkle /8/8 Dynamic Programming:
More informationN/4 + N/2 + N = 2N 2.
CS61B Summer 2006 Instructor: Erin Korber Lecture 24, 7 Aug. 1 Amortized Analysis For some of the data structures we ve discussed (namely hash tables and splay trees), it was claimed that the average time
More informationLecture 23: Rice Theorem and Turing machine behavior properties 21 April 2009
CS 373: Theory of Computation Sariel Har-Peled and Madhusudan Parthasarathy Lecture 23: Rice Theorem and Turing machine behavior properties 21 April 2009 This lecture covers Rice s theorem, as well as
More informationTrace Reconstruction Revisited
Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, Sofya Vorotnikova 1 1 University of Massachusetts Amherst 2 IBM Almaden Research Center Problem Description Take original string x of length
More informationCompressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park
Compressing Kinetic Data From Sensor Networks Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park Motivation Motivation Computer Science Graphics: Image and video
More informationSimilarity Search. Stony Brook University CSE545, Fall 2016
Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings
More informationLecture 2: Pairwise Alignment. CG Ron Shamir
Lecture 2: Pairwise Alignment 1 Main source 2 Why compare sequences? Human hexosaminidase A vs Mouse hexosaminidase A 3 www.mathworks.com/.../jan04/bio_genome.html Sequence Alignment עימוד רצפים The problem:
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More information(In)approximability Results for Pattern Matching Problems
(In)approximability Results for Pattern Matching Problems Raphaël Clifford and Alexandru Popa Department of Computer Science University of Bristol Merchant Venturer s Building Woodland Road, Bristol, BS8
More informationTheoretical Computer Science. Efficient string-matching allowing for non-overlapping inversions
Theoretical Computer Science 483 (2013) 85 95 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Efficient string-matching allowing
More information1. Determine (with proof) the number of ordered triples (A 1, A 2, A 3 ) of sets which satisfy
UT Putnam Prep Problems, Oct 19 2016 I was very pleased that, between the whole gang of you, you solved almost every problem this week! Let me add a few comments here. 1. Determine (with proof) the number
More informationP is the class of problems for which there are algorithms that solve the problem in time O(n k ) for some constant k.
Complexity Theory Problems are divided into complexity classes. Informally: So far in this course, almost all algorithms had polynomial running time, i.e., on inputs of size n, worst-case running time
More informationAnalysis and Design of Algorithms Dynamic Programming
Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................
More informationAutomata Theory. Lecture on Discussion Course of CS120. Runzhe SJTU ACM CLASS
Automata Theory Lecture on Discussion Course of CS2 This Lecture is about Mathematical Models of Computation. Why Should I Care? - Ways of thinking. - Theory can drive practice. - Don t be an Instrumentalist.
More informationExpressions that always have the same value. The Identity Property of Addition states that For any value a; a + 0 = a so = 3
Name Key Words/Topic 2.1 Identity and Zero Properties Topic 2 Guided Notes Equivalent Expressions Identity Property of Addition Identity Property of Multiplication Zero Property of Multiplication The sum
More informationAverage Complexity of Exact and Approximate Multiple String Matching
Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science
More informationCompilers. Lexical analysis. Yannis Smaragdakis, U. Athens (original slides by Sam
Compilers Lecture 3 Lexical analysis Yannis Smaragdakis, U. Athens (original slides by Sam Guyer@Tufts) Big picture Source code Front End IR Back End Machine code Errors Front end responsibilities Check
More informationFinding Consensus Strings With Small Length Difference Between Input and Solution Strings
Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Markus L. Schmid Trier University, Fachbereich IV Abteilung Informatikwissenschaften, D-54286 Trier, Germany, MSchmid@uni-trier.de
More informationEfficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints
Efficient Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang,, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationUNIT-III REGULAR LANGUAGES
Syllabus R9 Regulation REGULAR EXPRESSIONS UNIT-III REGULAR LANGUAGES Regular expressions are useful for representing certain sets of strings in an algebraic fashion. In arithmetic we can use the operations
More informationExact vs approximate search
Text Algorithms (4AP) Lecture: Approximate matching Jaak Vilo 2008 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 Exact vs approximate search In exact search we searched for a string or set of strings in
More information1 Randomized reduction: a first start
6.S078 Fine-Grained Algorithms and Complexity MIT Lectures 3 and 4 February 14 and 21, 2018 Today: In the previous two lectures, we mentioned ETH and SETH, about the time complexity of SAT. Today we want
More informationSequence analysis and Genomics
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More informationLecture 5. 1 Review (Pairwise Independence and Derandomization)
6.842 Randomness and Computation September 20, 2017 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Tom Kolokotrones 1 Review (Pairwise Independence and Derandomization) As we discussed last time, we can
More informationFaster Algorithms for String Matching with k Mismatches
794 Faster Algorithms for String Matching with k Mismatches Amihood Amir *t Bar-Ilan University and Georgia Tech Moshe Lewenstein* * Bar-Ilan University Ely Porat* Bar Ilan University and Weizmann Institute
More informationNondeterministic finite automata
Lecture 3 Nondeterministic finite automata This lecture is focused on the nondeterministic finite automata (NFA) model and its relationship to the DFA model. Nondeterminism is an important concept in the
More informationLecture 15. Saad Mneimneh
Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string
More informationDNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,
Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string
More informationGraduate Algorithms CS F-20 String Matching
Graduate Algorithms CS673-2016F-20 String Matching David Galles Department of Computer Science University of San Francisco 20-0: String Matching Given a source text, and a string to match, where does the
More informationAlgorithm Design and Analysis
Algorithm Design and Analysis LECTURE 8 Greedy Algorithms V Huffman Codes Adam Smith Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: Let e be the
More informationSolutions to In Class Problems Week 4, Mon.
Massachusetts Institute of Technology 6.042J/18.062J, Fall 05: Mathematics for Computer Science September 26 Prof. Albert R. Meyer and Prof. Ronitt Rubinfeld revised September 26, 2005, 1050 minutes Solutions
More information