Algorithms for Approximate String Matching
|
|
- Joan Elizabeth Melton
- 6 years ago
- Views:
Transcription
1 Levenshtein Distance Algorithms for Approximate String Matching Part I Levenshtein Distance Hamming Distance Approximate String Matching with k Differences Longest Common Subsequences Part II A Fast and Practical Bit-Vector Algorithm for the Longest Common Subsequences Problem Yoan Pinzón Universidad Nacional de Colombia Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial ypinzon@unal.edu.co August 26 Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 965. If you cannot spell or pronounce Levenshtein, the metric is also called edit distance. The edit distance δ(p, t) between two strings p (pattern) and t (text) (m = p, n = t ) is the minimum number of insertions, deletions and replacements to make p equal to t. [Insertion] insert a new letter a into x. An insertion operation on the string x = vw consists in adding a letter a, converting x into x = vaw. [Deletion] delete a letter a from x. A deletion operation on the string x = vaw consists in removing a letter, converting x into x = vw. [Replacement] replace a letter a in x. A replacement operation on the string x = vaw consists in replacing a letter for another, converting x into x = vbw.
2 Example : p ="approximate matching" t ="appropiate meaning" String t: a p p r o p r i a t e m ǫ e a n i n g String p: a p p r o x i m a t e m a t c h i n g δ(t, p) = 7 Solution: Using Dynamic Programing (DP): We need to compute a matrix D[..m,..n], where D i,j represents the minimum number of operations needed to match p..i to t..j. This is computed as follows: D[i,] =i D[, j]=j D[i, j] =min{d[i, j] +, D[i, j ] +, D[i, j] + δ(p i, t j )} δ(p, t) = D[m, n] Pseudo-code: Example 2: p ="surgery" t ="survey" δ(t, p) = String t: s u r v e ǫ y String p: s u r g e r y procedure ED(p, t) {m = p, n = t } 2 begin 3 for i to m do D[i,] i 4 for j to n do D[, j] j 5 for i to m do 6 for j to n do 7 if p i = t j then D[i, j] D[i, j ] 8 else 9 D[i, j] min(d[i, j ], D[i, j], D[i, j ]) + od od 2 return D[m, n] 3 end Running time: O(nm) 2 3
3 A Graph Reformulation for the Edit Distance problem Example: p="survey", t="surgery" ǫ s u r g e r y ǫ s u r v e y String t: String p: s u r g e r y s u r v e ǫ y The Dynamic Programming matrix D can be seen as a graph where the nodes are the cells and the edges represent the operations. The cost (weight) of the edges corresponds to the cost of the operations. δ(t, p) = shortest-path from node [,] to the node [n,m]. Running time: O(nm log(nm)) Example: p="survey", t="surgery" [,] s u r v e y s u r g e r y [6,7] 4 5
4 Hamming Distance The Hamming distance H is defined only for strings of the same length. For two strings p and t, H(p, t) is the number of places in which the two strings differ, i.e., have different characters. Examples: H("pinzon", "pinion") = H("josh", "jose") = H("here", "hear") = 2 H("kelly", "belly") = H("AAT", "TAA") = 2 H("AGCAA", "ACATA") = 3 H("AGCACACA", "ACACACTA") = 6 Pseudo-code: too easy!! Approximate String Matching with k Differences Problem: The k-differences approximate string matching problem is to find all occurrences of the pattern string p in the text string t with at most k differences (substitution, insertions, deletions). Solution: Using DP D[i,] =i D[, j]= D[i, j] =min{d[i, j] +, D[i, j ] +, D[i, j] + δ(p i, t j )} if D[m, j] k then we say that p occurs at position j of t. Running time: O(n) 6 7
5 Example: Pseudo-code: procedure KDifferences(p, t, k) {m = p, n = t } 2 begin 3 for i to m do D[i,] i 4 for j to n do D[, j] 5 for i to m do 6 for j to n do 7 if p i = t j then D[i, j] D[i, j ] 8 else 9 D[i, j] min(d[i, j ], D[i, j], D[i, j ]) + od od 2 for j to n do 3 if D[m][j] k then Output(j) 4 od 5 end Running time: O(nm) p ="CDDA", t ="CADDACDACDBACBA" k = ǫ C A D D A C D A C D B A C B A ǫ C 2 D D A p occurs in t ending at positions 5, 8 and
6 Longest Common Subsequence Preliminaries For two sequences x = x x m and y y n (n m) we say that x is a subsequence of y and equivalently, y is a supersequence of x, if for some i < < i p, x j = y ij. Given a finite set of sequences, S, a longest common subsequence (LCS) of S is a longest possible sequence s such that each sequence in S is a supersequence of s. Example: y="longest", x="large" LCS(y,x)="lge" String y: l o n g e s t String x: l a r g e Longest Common Subsequence Problem: The Longest Common Subsequence (LCS) of two strings, p and t, is a subsequence of both p and of t of maximum possible length. Solution: Using Dynamic Programing: We need to compute a matrix L[..m,..n], where L i,j represent the LCS for p..i and t..j. This is computed as follows:, if either i = or j = L[i, j] = L[i, j ] +, if p i = t j max{l[i, j], L[i, j ]}, if p i t j Pseudo-code: procedure LCS(p, t) {m = p, n = t } 2 begin 3 for i to m do L[i,] 4 for j to n do L[, j] 5 for i to m do 6 for j to n do 7 if p i = t j then L[i, j] L[i, j ] + 8 else 9 if L[i, j ] > L[i, j] then L[i, j] L[i, j ] else L[i, j] L[i, j] od 2 od 3 return L[m][n] 4 end Running time: O(nm)
7 Example : p ="survey" and t ="surgery" ǫ s u r g e r y ǫ s 2 u r v e y String t: String p: s u r g e r y s u r v e y Example 2: p ="ttgatacatt" t ="gaataagacc" ǫ g a a t a a g a c c ǫ t 2 t 3 g a t a c a t LCS(p, t) ="surey" LLCS(p, t) = L[6,7] = 5 LCS(p, t) =? LLCS(p, t) = L[9,] = 5 2 3
8 Some More Definitions Part II A Fast and Practical Bit-Vector Algorithm for the Longest Common Subsequence Problem The ordered pair of positions i and j of L, denoted [i, j], is a match iff x i = y j. If [i, j] is a match, and an LCS s i,j of x x 2...x i and y y 2...y j has length k, then k is the rank of [i, j]. The match [i, j] is k-dominant if it has rank k and for any other pair [i, j ] of rank k, either i > i and j j or i i and j > j. Computing the k-dominant matches is all that is needed to solve the LCS problem, since the LCS of x and y has length p iff the maximum rank attained by a dominant match is p. A match [i, j] precedes a match [i, j ] if i < i and j < j. 4 5
9 k= k=2 k=3 k=4 k=5 Let r be the total number of matches points, and d be the total number of dominant points (all ranks). Then p d r nm. Let R denote a partial order relation on the set of matches in L. A set of matches such that in any pair one of the matches always precedes the other in R constitutes a chain relative to the partial order relation R. A set of matches such that in any pair neither element of the pair precedes the other in R constitutes an antichain. Sankoff and Sellers (973) observed that the LCS problem translates to finding a longest chain in the poset of matches induced by R. A decomposition of a poset into antichains partitions the poset into the minimum possible number of antichains. "aaagtgacctagcccg" "tccagatg" t = p = t a a a g t g a c c t a g c c c g t c c a g a t g p match chain k-dominat match anti-chain (LCS) k-dominat match & LCS symbol "tccagg" LCS(p,t) = k=
10 "aaagtgacctagcccg" "tccagatg" t a a a g t g a c c t a g c c c g t c c a g a t g A Simple Bit-Vector Algorithm Here we will make use of word-level parallelism in order to compute the matrix L more efficiently. The algorithm is based on the O()-time computation of each column in L by using a bitparallel formula under the assumption that m w, where w is the number of bits in a machine word or O(nm/w)-time for the general case. An interesting property of the LCS allows to represent each column in L by using O()-space. That is, the values in the columns (rows) of L increase by at most one. i.e. L[i, j] = L[i, j] L[i, j] {,} for any (i, j) {..m} {..n}. In other words L will use the relative encoding of the dynamic programming table L. L is defined as NOT L. p t = p =
11 Example: x= ttgatacatt and y= gaataagacc ǫ G A A T A A G A C C ǫ T 2 T 3 G A T A C A T (a) Matrix L G A A T A A G A C C T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T Matrix L G A A T A A G A C C ǫ G A A T A A G A C C ǫ T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T (b) Matrix L T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T L 6 =< > 2 2
12 First we compute the array M of the vectors that result for each possible text character. If both the strings x and y range over the alphabet Σ then M[Σ] is defined as M[α] i = if y i = α else. Example: x= ttgatacatt and y= gaataagacc. A C G T T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T A C G T T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T Basic steps of the algorithm. Computation of M and M 2. Computation of matrix L (L) as follows: { 2 L = m, for j = (L j + (L j AND M(y j ))) OR (L j AND M (y j )), for j {..n} 3. Let LLCS be the number of times a carry took place. (a) Matrix M (a) Matrix M 22 23
13 Illustration of L 4 Computation for x= gaataagacc and y= ttgatacatt. Pseudo-code LLCS(x, y) n = y, m = x, p = begin 2 Preprocessing 3 for i until m do 4 M[α](i) y i = α 5 M [α](i) y i α 6 Initialization 7 L = 2 m 8 TheMainStep 9 for j until n do L j (L j + (L j AND M[y j ])) OR (L j AND M [y j ]) if L j (m + ) = then p++ 2 return p 3 end ǫ G A A T A A G A C C T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T Bit-Carry (a) Matrix L 9 T A C A T A G T T A C G T T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T (b) Matrix M L' 3 M[x =T] 4 L'
14 (4) {}}{ (3) {}}{ () (2) L 4 ( L 3 + {}}{ (L 3 &M T ) ) {}}{ (L 3 &M T ) (4) {}}{ (3) {}}{ () (2) L 4 ( L 3 + {}}{ (L 3 &M T ) ) {}}{ (L 3 &M T ) L 3 & M T () L 3 & M T (2) L 3 + () (3) (3) (2) (4) L 3 & M T () L 3 & M T (2) L 3 + () (3) (3) (2) (4) 26 27
15 Automata for Addition Experimental Results carry states,, 8 6 AllisonDix This,, carry,, carry Time (in micro secs.) , 4, Text Length 28 29
Maximum sum contiguous subsequence Longest common subsequence Matrix chain multiplication All pair shortest path Kna. Dynamic Programming
Dynamic Programming Arijit Bishnu arijit@isical.ac.in Indian Statistical Institute, India. August 31, 2015 Outline 1 Maximum sum contiguous subsequence 2 Longest common subsequence 3 Matrix chain multiplication
More informationOutline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009
Outline Approximation: Theory and Algorithms The Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 1 Nikolaus Augsten (DIS) Approximation: Theory and
More informationOutline. Similarity Search. Outline. Motivation. The String Edit Distance
Outline Similarity Search The Nikolaus Augsten nikolaus.augsten@sbg.ac.at Department of Computer Sciences University of Salzburg 1 http://dbresearch.uni-salzburg.at WS 2017/2018 Version March 12, 2018
More informationA Multiobjective Approach to the Weighted Longest Common Subsequence Problem
A Multiobjective Approach to the Weighted Longest Common Subsequence Problem David Becerra, Juan Mendivelso, and Yoan Pinzón Universidad Nacional de Colombia Facultad de Ingeniería Department of Computer
More informationApproximation: Theory and Algorithms
Approximation: Theory and Algorithms The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 Nikolaus Augsten (DIS) Approximation:
More informationSimilarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012
Similarity Search The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 8, 2012 Nikolaus Augsten (DIS) Similarity Search Unit 2 March 8,
More information2. Exact String Matching
2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.
More informationSimilarity Search. The String Edit Distance. Nikolaus Augsten.
Similarity Search The String Edit Distance Nikolaus Augsten nikolaus.augsten@sbg.ac.at Dept. of Computer Sciences University of Salzburg http://dbresearch.uni-salzburg.at Version October 18, 2016 Wintersemester
More informationEfficient (δ, γ)-pattern-matching with Don t Cares
fficient (δ, γ)-pattern-matching with Don t Cares Yoan José Pinzón Ardila Costas S. Iliopoulos Manolis Christodoulakis Manal Mohamed King s College London, Department of Computer Science, London WC2R 2LS,
More informationData Structures in Java
Data Structures in Java Lecture 20: Algorithm Design Techniques 12/2/2015 Daniel Bauer 1 Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways of
More informationIncluding Interval Encoding into Edit Distance Based Music Comparison and Retrieval
Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval Kjell Lemström; Esko Ukkonen Department of Computer Science; P.O.Box 6, FIN-00014 University of Helsinki, Finland {klemstro,ukkonen}@cs.helsinki.fi
More informationTwo Algorithms for LCS Consecutive Suffix Alignment
Two Algorithms for LCS Consecutive Suffix Alignment Gad M. Landau Eugene Myers Michal Ziv-Ukelson Abstract The problem of aligning two sequences A and to determine their similarity is one of the fundamental
More informationInformation Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings
Information Processing Letters 108 (2008) 360 364 Contents lists available at ScienceDirect Information Processing Letters www.vier.com/locate/ipl A fast and simple algorithm for computing the longest
More information3515ICT: Theory of Computation. Regular languages
3515ICT: Theory of Computation Regular languages Notation and concepts concerning alphabets, strings and languages, and identification of languages with problems (H, 1.5). Regular expressions (H, 3.1,
More informationChapter 5. Finite Automata
Chapter 5 Finite Automata 5.1 Finite State Automata Capable of recognizing numerous symbol patterns, the class of regular languages Suitable for pattern-recognition type applications, such as the lexical
More informationEfficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism
Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche Department of Computer Science University of Warwick June 2006 Outline 1 Introduction Motivation The BSP
More informationDefinition: A binary relation R from a set A to a set B is a subset R A B. Example:
Chapter 9 1 Binary Relations Definition: A binary relation R from a set A to a set B is a subset R A B. Example: Let A = {0,1,2} and B = {a,b} {(0, a), (0, b), (1,a), (2, b)} is a relation from A to B.
More informationFinding all covers of an indeterminate string in O(n) time on average
Finding all covers of an indeterminate string in O(n) time on average Md. Faizul Bari, M. Sohel Rahman, and Rifat Shahriyar Department of Computer Science and Engineering Bangladesh University of Engineering
More informationSection Summary. Relations and Functions Properties of Relations. Combining Relations
Chapter 9 Chapter Summary Relations and Their Properties n-ary Relations and Their Applications (not currently included in overheads) Representing Relations Closures of Relations (not currently included
More information7 Distances. 7.1 Metrics. 7.2 Distances L p Distances
7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards
More informationText and multimedia languages and properties
Text and multimedia languages and properties (Modern IR, Ch. 6) I VP R 1 Introduction Metadata Text Markup Languages Multimedia Trends and Research Issues Bibliographical Discussion I VP R 2 Introduction
More informationPATTERN MATCHING WITH SWAPS IN PRACTICE
International Journal of Foundations of Computer Science c World Scientific Publishing Company PATTERN MATCHING WITH SWAPS IN PRACTICE MATTEO CAMPANELLI Università di Catania, Scuola Superiore di Catania
More informationDefine M to be a binary n by m matrix such that:
The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]
More informationThe Longest Common Subsequence Problem
Advanced Studies in Biology, Vol. 4, 2012, no. 8, 373-380 The Longest Common Subsequence Problem Anna Gorbenko Department of Intelligent Systems and Robotics Ural Federal University 620083 Ekaterinburg,
More informationProofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.
Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G
More informationModule 9: Tries and String Matching
Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School
More informationAn Algorithm for Computing the Invariant Distance from Word Position
An Algorithm for Computing the Invariant Distance from Word Position Sergio Luán-Mora 1 Departamento de Lenguaes y Sistemas Informáticos, Universidad de Alicante, Campus de San Vicente del Raspeig, Ap.
More informationAn Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching
An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching Ricardo Luis de Azevedo da Rocha 1, João José Neto 1 1 Laboratório de Linguagens
More informationDeterministic Finite Automata (DFAs)
CS/ECE 374: Algorithms & Models of Computation, Fall 28 Deterministic Finite Automata (DFAs) Lecture 3 September 4, 28 Chandra Chekuri (UIUC) CS/ECE 374 Fall 28 / 33 Part I DFA Introduction Chandra Chekuri
More informationOn Antichains in Product Posets
On Antichains in Product Posets Sergei L. Bezrukov Department of Math and Computer Science University of Wisconsin - Superior Superior, WI 54880, USA sb@mcs.uwsuper.edu Ian T. Roberts School of Engineering
More informationDeterministic Finite Automata (DFAs)
Algorithms & Models of Computation CS/ECE 374, Fall 27 Deterministic Finite Automata (DFAs) Lecture 3 Tuesday, September 5, 27 Sariel Har-Peled (UIUC) CS374 Fall 27 / 36 Part I DFA Introduction Sariel
More informationPattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1
Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt
More information6 Distances. 6.1 Metrics. 6.2 Distances L p Distances
6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 03: Edit distance and sequence alignment Slides adapted from Dr. Shaojie Zhang (University of Central Florida) KUMC visit How many of you would like to attend
More information1 Introduction to information theory
1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through
More informationAlgorithm Design and Analysis
Algorithm Design and Analysis LECTURE 8 Greedy Algorithms V Huffman Codes Adam Smith Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: Let e be the
More informationSequence Comparison. mouse human
Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity
More informationExact vs approximate search
Text Algorithms (4AP) Lecture: Approximate matching Jaak Vilo 2008 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 Exact vs approximate search In exact search we searched for a string or set of strings in
More informationDFA to Regular Expressions
DFA to Regular Expressions Proposition: If L is regular then there is a regular expression r such that L = L(r). Proof Idea: Let M = (Q,Σ, δ, q 1, F) be a DFA recognizing L, with Q = {q 1,q 2,...q n },
More informationEnumeration Schemes for Words Avoiding Permutations
Enumeration Schemes for Words Avoiding Permutations Lara Pudwell November 27, 2007 Abstract The enumeration of permutation classes has been accomplished with a variety of techniques. One wide-reaching
More information{a, b, c} {a, b} {a, c} {b, c} {a}
Section 4.3 Order Relations A binary relation is an partial order if it transitive and antisymmetric. If R is a partial order over the set S, we also say, S is a partially ordered set or S is a poset.
More informationAverage Complexity of Exact and Approximate Multiple String Matching
Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science
More informationA Survey of the Longest Common Subsequence Problem and Its. Related Problems
Survey of the Longest Common Subsequence Problem and Its Related Problems Survey of the Longest Common Subsequence Problem and Its Related Problems Thesis Submitted to the Faculty of Department of Computer
More informationChapter 2: Finite Automata
Chapter 2: Finite Automata Peter Cappello Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 cappello@cs.ucsb.edu Please read the corresponding chapter before
More informationMARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for
MARKOV CHAINS A finite state Markov chain is a sequence S 0,S 1,... of discrete cv s from a finite alphabet S where q 0 (s) is a pmf on S 0 and for n 1, Q(s s ) = Pr(S n =s S n 1 =s ) = Pr(S n =s S n 1
More informationSA-REPC - Sequence Alignment with a Regular Expression Path Constraint
SA-REPC - Sequence Alignment with a Regular Expression Path Constraint Nimrod Milo Tamar Pinhas Michal Ziv-Ukelson Ben-Gurion University of the Negev, Be er Sheva, Israel Graduate Seminar, BGU 2010 Milo,
More informationLecture 4 : Adaptive source coding algorithms
Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv
More informationAlgorithm Design CS 515 Fall 2015 Sample Final Exam Solutions
Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions Copyright c 2015 Andrew Klapper. All rights reserved. 1. For the functions satisfying the following three recurrences, determine which is the
More informationImplementing Approximate Regularities
Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National
More informationUNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r
Syllabus R9 Regulation UNIT-II NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: In the automata theory, a nondeterministic finite automaton (NFA) or nondeterministic finite state machine is a finite
More informationChapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code
Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average
More informationSource Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria
Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal
More informationPattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia
Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift
More informationCS 154, Lecture 3: DFA NFA, Regular Expressions
CS 154, Lecture 3: DFA NFA, Regular Expressions Homework 1 is coming out Deterministic Finite Automata Computation with finite memory Non-Deterministic Finite Automata Computation with finite memory and
More informationAlgorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer
Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases
More informationSamson Zhou. Pattern Matching over Noisy Data Streams
Samson Zhou Pattern Matching over Noisy Data Streams Finding Structure in Data Pattern Matching Finding all instances of a pattern within a string ABCD ABCAABCDAACAABCDBCABCDADDDEAEABCDA Knuth-Morris-Pratt
More informationSome useful tasks involving language. Finite-State Machines and Regular Languages. More useful tasks involving language. Regular expressions
Some useful tasks involving language Finite-State Machines and Regular Languages Find all phone numbers in a text, e.g., occurrences such as When you call (614) 292-8833, you reach the fax machine. Find
More informationSTATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler
STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity
More informationSmall-Space Dictionary Matching (Dissertation Proposal)
Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length
More informationData Compression. Limit of Information Compression. October, Examples of codes 1
Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality
More informationNP and Computational Intractability
NP and Computational Intractability 1 Polynomial-Time Reduction Desiderata'. Suppose we could solve X in polynomial-time. What else could we solve in polynomial time? don't confuse with reduces from Reduction.
More informationLongest Common Prefixes
Longest Common Prefixes The standard ordering for strings is the lexicographical order. It is induced by an order over the alphabet. We will use the same symbols (,
More informationA New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem
A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence
More informationCS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa
CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.
More informationRobustness Analysis of Networked Systems
Robustness Analysis of Networked Systems Roopsha Samanta The University of Texas at Austin Joint work with Jyotirmoy V. Deshmukh and Swarat Chaudhuri January, 0 Roopsha Samanta Robustness Analysis of Networked
More informationMore Dynamic Programming
CS 374: Algorithms & Models of Computation, Spring 2017 More Dynamic Programming Lecture 14 March 9, 2017 Chandra Chekuri (UIUC) CS374 1 Spring 2017 1 / 42 What is the running time of the following? Consider
More informationComputing Longest Common Substrings Using Suffix Arrays
Computing Longest Common Substrings Using Suffix Arrays Maxim A. Babenko, Tatiana A. Starikovskaya Moscow State University Computer Science in Russia, 2008 Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing
More informationFinite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine)
Finite Automata - Deterministic Finite Automata Deterministic Finite Automaton (DFA) (or Finite State Machine) M = (K, Σ, δ, s, A), where K is a finite set of states Σ is an input alphabet s K is a distinguished
More informationTheory of Computation
Thomas Zeugmann Hokkaido University Laboratory for Algorithmics http://www-alg.ist.hokudai.ac.jp/ thomas/toc/ Lecture 3: Finite State Automata Motivation In the previous lecture we learned how to formalize
More informationCS6901: review of Theory of Computation and Algorithms
CS6901: review of Theory of Computation and Algorithms Any mechanically (automatically) discretely computation of problem solving contains at least three components: - problem description - computational
More informationMore Dynamic Programming
Algorithms & Models of Computation CS/ECE 374, Fall 2017 More Dynamic Programming Lecture 14 Tuesday, October 17, 2017 Sariel Har-Peled (UIUC) CS374 1 Fall 2017 1 / 48 What is the running time of the following?
More informationPermutation distance measures for memetic algorithms with population management
MIC2005: The Sixth Metaheuristics International Conference??-1 Permutation distance measures for memetic algorithms with population management Marc Sevaux Kenneth Sörensen University of Valenciennes, CNRS,
More informationTowards a Theory of Information Flow in the Finitary Process Soup
Towards a Theory of in the Finitary Process Department of Computer Science and Complexity Sciences Center University of California at Davis June 1, 2010 Goals Analyze model of evolutionary self-organization
More informationX-MA2C01-1: Partial Worked Solutions
X-MAC01-1: Partial Worked Solutions David R. Wilkins May 013 1. (a) Let A, B and C be sets. Prove that (A \ (B C)) (B \ C) = (A B) \ C. [Venn Diagrams, by themselves without an accompanying logical argument,
More informationState Complexity of Neighbourhoods and Approximate Pattern Matching
State Complexity of Neighbourhoods and Approximate Pattern Matching Timothy Ng, David Rappaport, and Kai Salomaa School of Computing, Queen s University, Kingston, Ontario K7L 3N6, Canada {ng, daver, ksalomaa}@cs.queensu.ca
More informationTheoretical Computer Science
Theoretical Computer Science Zdeněk Sawa Department of Computer Science, FEI, Technical University of Ostrava 17. listopadu 15, Ostrava-Poruba 708 33 Czech republic September 22, 2017 Z. Sawa (TU Ostrava)
More informationDynamic Programming ACM Seminar in Algorithmics
Dynamic Programming ACM Seminar in Algorithmics Marko Genyk-Berezovskyj berezovs@fel.cvut.cz Tomáš Tunys tunystom@fel.cvut.cz CVUT FEL, K13133 February 27, 2013 Problem: Longest Increasing Subsequence
More informationEfficient High-Similarity String Comparison: The Waterfall Algorithm
Efficient High-Similarity String Comparison: The Waterfall Algorithm Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick)
More informationComputing repetitive structures in indeterminate strings
Computing repetitive structures in indeterminate strings Pavlos Antoniou 1, Costas S. Iliopoulos 1, Inuka Jayasekera 1, Wojciech Rytter 2,3 1 Dept. of Computer Science, King s College London, London WC2R
More informationMassachusetts Institute of Technology 6.042J/18.062J, Fall 02: Mathematics for Computer Science Professor Albert Meyer and Dr.
Massachusetts Institute of Technology 6.042J/18.062J, Fall 02: Mathematics for Computer Science Professor Albert Meyer and Dr. Radhika Nagpal Quiz 1 Appendix Appendix Contents 1 Induction 2 2 Relations
More informationINF4130: Dynamic Programming September 2, 2014 DRAFT version
INF4130: Dynamic Programming September 2, 2014 DRAFT version In the textbook: Ch. 9, and Section 20.5 Chapter 9 can also be found at the home page for INF4130 These slides were originally made by Petter
More informationCSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo
CSE 431/531: Analysis of Algorithms Dynamic Programming Lecturer: Shi Li Department of Computer Science and Engineering University at Buffalo Paradigms for Designing Algorithms Greedy algorithm Make a
More informationLecture 3: Finite Automata
Administrivia Lecture 3: Finite Automata Everyone should now be registered electronically using the link on our webpage. If you haven t, do so today! I d like to have teams formed by next Monday at the
More informationDeterministic Finite Automata (DFAs)
Algorithms & Models of Computation CS/ECE 374, Spring 29 Deterministic Finite Automata (DFAs) Lecture 3 Tuesday, January 22, 29 L A TEXed: December 27, 28 8:25 Chan, Har-Peled, Hassanieh (UIUC) CS374 Spring
More informationEfficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem
Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Hsing-Yen Ann National Center for High-Performance Computing Tainan 74147, Taiwan Chang-Biau Yang and Chiou-Ting
More information4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd
4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd Data Compression Q. Given a text that uses 32 symbols (26 different letters, space, and some punctuation characters), how can we
More information20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming
20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment
More informationEntropy Rate of Stochastic Processes
Entropy Rate of Stochastic Processes Timo Mulder tmamulder@gmail.com Jorn Peters jornpeters@gmail.com February 8, 205 The entropy rate of independent and identically distributed events can on average be
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationCSC Discrete Math I, Spring Relations
CSC 125 - Discrete Math I, Spring 2017 Relations Binary Relations Definition: A binary relation R from a set A to a set B is a subset of A B Note that a relation is more general than a function Example:
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationFINITE MEMORY DETERMINACY
p. 1/? FINITE MEMORY DETERMINACY Alexander Rabinovich Department of Computer Science Tel-Aviv University p. 2/? Plan 1. Finite Memory Strategies. 2. Finite Memory Determinacy of Muller games. 3. Latest
More informationComplexity Theory of Polynomial-Time Problems
Complexity Theory of Polynomial-Time Problems Lecture 1: Introduction, Easy Examples Karl Bringmann and Sebastian Krinninger Audience no formal requirements, but: NP-hardness, satisfiability problem, how
More informationPrimary classes of compositions of numbers
Annales Mathematicae et Informaticae 41 (2013) pp. 193 204 Proceedings of the 15 th International Conference on Fibonacci Numbers and Their Applications Institute of Mathematics and Informatics, Eszterházy
More informationMidterm 2 for CS 170
UC Berkeley CS 170 Midterm 2 Lecturer: Gene Myers November 9 Midterm 2 for CS 170 Print your name:, (last) (first) Sign your name: Write your section number (e.g. 101): Write your sid: One page of notes
More informationOn Pattern Matching With Swaps
On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720
More informationString Search. 6th September 2018
String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)
More informationSolutions to Homework Set #3 Channel and Source coding
Solutions to Homework Set #3 Channel and Source coding. Rates (a) Channels coding Rate: Assuming you are sending 4 different messages using usages of a channel. What is the rate (in bits per channel use)
More informationImage Registration Lecture 2: Vectors and Matrices
Image Registration Lecture 2: Vectors and Matrices Prof. Charlene Tsai Lecture Overview Vectors Matrices Basics Orthogonal matrices Singular Value Decomposition (SVD) 2 1 Preliminary Comments Some of this
More informationDeterministic Finite Automata (DFAs)
Algorithms & Models of Computation CS/ECE 374, Fall 27 Deterministic Finite Automata (DFAs) Lecture 3 Tuesday, September 5, 27 Part I DFA Introduction Sariel Har-Peled (UIUC) CS374 Fall 27 / 36 Sariel
More information