Algorithms for Approximate String Matching

Size: px
Start display at page:

Download "Algorithms for Approximate String Matching"

Transcription

1 Levenshtein Distance Algorithms for Approximate String Matching Part I Levenshtein Distance Hamming Distance Approximate String Matching with k Differences Longest Common Subsequences Part II A Fast and Practical Bit-Vector Algorithm for the Longest Common Subsequences Problem Yoan Pinzón Universidad Nacional de Colombia Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial ypinzon@unal.edu.co August 26 Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 965. If you cannot spell or pronounce Levenshtein, the metric is also called edit distance. The edit distance δ(p, t) between two strings p (pattern) and t (text) (m = p, n = t ) is the minimum number of insertions, deletions and replacements to make p equal to t. [Insertion] insert a new letter a into x. An insertion operation on the string x = vw consists in adding a letter a, converting x into x = vaw. [Deletion] delete a letter a from x. A deletion operation on the string x = vaw consists in removing a letter, converting x into x = vw. [Replacement] replace a letter a in x. A replacement operation on the string x = vaw consists in replacing a letter for another, converting x into x = vbw.

2 Example : p ="approximate matching" t ="appropiate meaning" String t: a p p r o p r i a t e m ǫ e a n i n g String p: a p p r o x i m a t e m a t c h i n g δ(t, p) = 7 Solution: Using Dynamic Programing (DP): We need to compute a matrix D[..m,..n], where D i,j represents the minimum number of operations needed to match p..i to t..j. This is computed as follows: D[i,] =i D[, j]=j D[i, j] =min{d[i, j] +, D[i, j ] +, D[i, j] + δ(p i, t j )} δ(p, t) = D[m, n] Pseudo-code: Example 2: p ="surgery" t ="survey" δ(t, p) = String t: s u r v e ǫ y String p: s u r g e r y procedure ED(p, t) {m = p, n = t } 2 begin 3 for i to m do D[i,] i 4 for j to n do D[, j] j 5 for i to m do 6 for j to n do 7 if p i = t j then D[i, j] D[i, j ] 8 else 9 D[i, j] min(d[i, j ], D[i, j], D[i, j ]) + od od 2 return D[m, n] 3 end Running time: O(nm) 2 3

3 A Graph Reformulation for the Edit Distance problem Example: p="survey", t="surgery" ǫ s u r g e r y ǫ s u r v e y String t: String p: s u r g e r y s u r v e ǫ y The Dynamic Programming matrix D can be seen as a graph where the nodes are the cells and the edges represent the operations. The cost (weight) of the edges corresponds to the cost of the operations. δ(t, p) = shortest-path from node [,] to the node [n,m]. Running time: O(nm log(nm)) Example: p="survey", t="surgery" [,] s u r v e y s u r g e r y [6,7] 4 5

4 Hamming Distance The Hamming distance H is defined only for strings of the same length. For two strings p and t, H(p, t) is the number of places in which the two strings differ, i.e., have different characters. Examples: H("pinzon", "pinion") = H("josh", "jose") = H("here", "hear") = 2 H("kelly", "belly") = H("AAT", "TAA") = 2 H("AGCAA", "ACATA") = 3 H("AGCACACA", "ACACACTA") = 6 Pseudo-code: too easy!! Approximate String Matching with k Differences Problem: The k-differences approximate string matching problem is to find all occurrences of the pattern string p in the text string t with at most k differences (substitution, insertions, deletions). Solution: Using DP D[i,] =i D[, j]= D[i, j] =min{d[i, j] +, D[i, j ] +, D[i, j] + δ(p i, t j )} if D[m, j] k then we say that p occurs at position j of t. Running time: O(n) 6 7

5 Example: Pseudo-code: procedure KDifferences(p, t, k) {m = p, n = t } 2 begin 3 for i to m do D[i,] i 4 for j to n do D[, j] 5 for i to m do 6 for j to n do 7 if p i = t j then D[i, j] D[i, j ] 8 else 9 D[i, j] min(d[i, j ], D[i, j], D[i, j ]) + od od 2 for j to n do 3 if D[m][j] k then Output(j) 4 od 5 end Running time: O(nm) p ="CDDA", t ="CADDACDACDBACBA" k = ǫ C A D D A C D A C D B A C B A ǫ C 2 D D A p occurs in t ending at positions 5, 8 and

6 Longest Common Subsequence Preliminaries For two sequences x = x x m and y y n (n m) we say that x is a subsequence of y and equivalently, y is a supersequence of x, if for some i < < i p, x j = y ij. Given a finite set of sequences, S, a longest common subsequence (LCS) of S is a longest possible sequence s such that each sequence in S is a supersequence of s. Example: y="longest", x="large" LCS(y,x)="lge" String y: l o n g e s t String x: l a r g e Longest Common Subsequence Problem: The Longest Common Subsequence (LCS) of two strings, p and t, is a subsequence of both p and of t of maximum possible length. Solution: Using Dynamic Programing: We need to compute a matrix L[..m,..n], where L i,j represent the LCS for p..i and t..j. This is computed as follows:, if either i = or j = L[i, j] = L[i, j ] +, if p i = t j max{l[i, j], L[i, j ]}, if p i t j Pseudo-code: procedure LCS(p, t) {m = p, n = t } 2 begin 3 for i to m do L[i,] 4 for j to n do L[, j] 5 for i to m do 6 for j to n do 7 if p i = t j then L[i, j] L[i, j ] + 8 else 9 if L[i, j ] > L[i, j] then L[i, j] L[i, j ] else L[i, j] L[i, j] od 2 od 3 return L[m][n] 4 end Running time: O(nm)

7 Example : p ="survey" and t ="surgery" ǫ s u r g e r y ǫ s 2 u r v e y String t: String p: s u r g e r y s u r v e y Example 2: p ="ttgatacatt" t ="gaataagacc" ǫ g a a t a a g a c c ǫ t 2 t 3 g a t a c a t LCS(p, t) ="surey" LLCS(p, t) = L[6,7] = 5 LCS(p, t) =? LLCS(p, t) = L[9,] = 5 2 3

8 Some More Definitions Part II A Fast and Practical Bit-Vector Algorithm for the Longest Common Subsequence Problem The ordered pair of positions i and j of L, denoted [i, j], is a match iff x i = y j. If [i, j] is a match, and an LCS s i,j of x x 2...x i and y y 2...y j has length k, then k is the rank of [i, j]. The match [i, j] is k-dominant if it has rank k and for any other pair [i, j ] of rank k, either i > i and j j or i i and j > j. Computing the k-dominant matches is all that is needed to solve the LCS problem, since the LCS of x and y has length p iff the maximum rank attained by a dominant match is p. A match [i, j] precedes a match [i, j ] if i < i and j < j. 4 5

9 k= k=2 k=3 k=4 k=5 Let r be the total number of matches points, and d be the total number of dominant points (all ranks). Then p d r nm. Let R denote a partial order relation on the set of matches in L. A set of matches such that in any pair one of the matches always precedes the other in R constitutes a chain relative to the partial order relation R. A set of matches such that in any pair neither element of the pair precedes the other in R constitutes an antichain. Sankoff and Sellers (973) observed that the LCS problem translates to finding a longest chain in the poset of matches induced by R. A decomposition of a poset into antichains partitions the poset into the minimum possible number of antichains. "aaagtgacctagcccg" "tccagatg" t = p = t a a a g t g a c c t a g c c c g t c c a g a t g p match chain k-dominat match anti-chain (LCS) k-dominat match & LCS symbol "tccagg" LCS(p,t) = k=

10 "aaagtgacctagcccg" "tccagatg" t a a a g t g a c c t a g c c c g t c c a g a t g A Simple Bit-Vector Algorithm Here we will make use of word-level parallelism in order to compute the matrix L more efficiently. The algorithm is based on the O()-time computation of each column in L by using a bitparallel formula under the assumption that m w, where w is the number of bits in a machine word or O(nm/w)-time for the general case. An interesting property of the LCS allows to represent each column in L by using O()-space. That is, the values in the columns (rows) of L increase by at most one. i.e. L[i, j] = L[i, j] L[i, j] {,} for any (i, j) {..m} {..n}. In other words L will use the relative encoding of the dynamic programming table L. L is defined as NOT L. p t = p =

11 Example: x= ttgatacatt and y= gaataagacc ǫ G A A T A A G A C C ǫ T 2 T 3 G A T A C A T (a) Matrix L G A A T A A G A C C T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T Matrix L G A A T A A G A C C ǫ G A A T A A G A C C ǫ T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T (b) Matrix L T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T L 6 =< > 2 2

12 First we compute the array M of the vectors that result for each possible text character. If both the strings x and y range over the alphabet Σ then M[Σ] is defined as M[α] i = if y i = α else. Example: x= ttgatacatt and y= gaataagacc. A C G T T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T A C G T T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T Basic steps of the algorithm. Computation of M and M 2. Computation of matrix L (L) as follows: { 2 L = m, for j = (L j + (L j AND M(y j ))) OR (L j AND M (y j )), for j {..n} 3. Let LLCS be the number of times a carry took place. (a) Matrix M (a) Matrix M 22 23

13 Illustration of L 4 Computation for x= gaataagacc and y= ttgatacatt. Pseudo-code LLCS(x, y) n = y, m = x, p = begin 2 Preprocessing 3 for i until m do 4 M[α](i) y i = α 5 M [α](i) y i α 6 Initialization 7 L = 2 m 8 TheMainStep 9 for j until n do L j (L j + (L j AND M[y j ])) OR (L j AND M [y j ]) if L j (m + ) = then p++ 2 return p 3 end ǫ G A A T A A G A C C T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T Bit-Carry (a) Matrix L 9 T A C A T A G T T A C G T T 2 T 3 G 4 A 5 T 6 A 7 C 8 A 9 T (b) Matrix M L' 3 M[x =T] 4 L'

14 (4) {}}{ (3) {}}{ () (2) L 4 ( L 3 + {}}{ (L 3 &M T ) ) {}}{ (L 3 &M T ) (4) {}}{ (3) {}}{ () (2) L 4 ( L 3 + {}}{ (L 3 &M T ) ) {}}{ (L 3 &M T ) L 3 & M T () L 3 & M T (2) L 3 + () (3) (3) (2) (4) L 3 & M T () L 3 & M T (2) L 3 + () (3) (3) (2) (4) 26 27

15 Automata for Addition Experimental Results carry states,, 8 6 AllisonDix This,, carry,, carry Time (in micro secs.) , 4, Text Length 28 29

Maximum sum contiguous subsequence Longest common subsequence Matrix chain multiplication All pair shortest path Kna. Dynamic Programming

Maximum sum contiguous subsequence Longest common subsequence Matrix chain multiplication All pair shortest path Kna. Dynamic Programming Dynamic Programming Arijit Bishnu arijit@isical.ac.in Indian Statistical Institute, India. August 31, 2015 Outline 1 Maximum sum contiguous subsequence 2 Longest common subsequence 3 Matrix chain multiplication

More information

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009 Outline Approximation: Theory and Algorithms The Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 1 Nikolaus Augsten (DIS) Approximation: Theory and

More information

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

Outline. Similarity Search. Outline. Motivation. The String Edit Distance Outline Similarity Search The Nikolaus Augsten nikolaus.augsten@sbg.ac.at Department of Computer Sciences University of Salzburg 1 http://dbresearch.uni-salzburg.at WS 2017/2018 Version March 12, 2018

More information

A Multiobjective Approach to the Weighted Longest Common Subsequence Problem

A Multiobjective Approach to the Weighted Longest Common Subsequence Problem A Multiobjective Approach to the Weighted Longest Common Subsequence Problem David Becerra, Juan Mendivelso, and Yoan Pinzón Universidad Nacional de Colombia Facultad de Ingeniería Department of Computer

More information

Approximation: Theory and Algorithms

Approximation: Theory and Algorithms Approximation: Theory and Algorithms The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 Nikolaus Augsten (DIS) Approximation:

More information

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012 Similarity Search The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 8, 2012 Nikolaus Augsten (DIS) Similarity Search Unit 2 March 8,

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Similarity Search. The String Edit Distance. Nikolaus Augsten. Similarity Search The String Edit Distance Nikolaus Augsten nikolaus.augsten@sbg.ac.at Dept. of Computer Sciences University of Salzburg http://dbresearch.uni-salzburg.at Version October 18, 2016 Wintersemester

More information

Efficient (δ, γ)-pattern-matching with Don t Cares

Efficient (δ, γ)-pattern-matching with Don t Cares fficient (δ, γ)-pattern-matching with Don t Cares Yoan José Pinzón Ardila Costas S. Iliopoulos Manolis Christodoulakis Manal Mohamed King s College London, Department of Computer Science, London WC2R 2LS,

More information

Data Structures in Java

Data Structures in Java Data Structures in Java Lecture 20: Algorithm Design Techniques 12/2/2015 Daniel Bauer 1 Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways of

More information

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval Kjell Lemström; Esko Ukkonen Department of Computer Science; P.O.Box 6, FIN-00014 University of Helsinki, Finland {klemstro,ukkonen}@cs.helsinki.fi

More information

Two Algorithms for LCS Consecutive Suffix Alignment

Two Algorithms for LCS Consecutive Suffix Alignment Two Algorithms for LCS Consecutive Suffix Alignment Gad M. Landau Eugene Myers Michal Ziv-Ukelson Abstract The problem of aligning two sequences A and to determine their similarity is one of the fundamental

More information

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings Information Processing Letters 108 (2008) 360 364 Contents lists available at ScienceDirect Information Processing Letters www.vier.com/locate/ipl A fast and simple algorithm for computing the longest

More information

3515ICT: Theory of Computation. Regular languages

3515ICT: Theory of Computation. Regular languages 3515ICT: Theory of Computation Regular languages Notation and concepts concerning alphabets, strings and languages, and identification of languages with problems (H, 1.5). Regular expressions (H, 3.1,

More information

Chapter 5. Finite Automata

Chapter 5. Finite Automata Chapter 5 Finite Automata 5.1 Finite State Automata Capable of recognizing numerous symbol patterns, the class of regular languages Suitable for pattern-recognition type applications, such as the lexical

More information

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche Department of Computer Science University of Warwick June 2006 Outline 1 Introduction Motivation The BSP

More information

Definition: A binary relation R from a set A to a set B is a subset R A B. Example:

Definition: A binary relation R from a set A to a set B is a subset R A B. Example: Chapter 9 1 Binary Relations Definition: A binary relation R from a set A to a set B is a subset R A B. Example: Let A = {0,1,2} and B = {a,b} {(0, a), (0, b), (1,a), (2, b)} is a relation from A to B.

More information

Finding all covers of an indeterminate string in O(n) time on average

Finding all covers of an indeterminate string in O(n) time on average Finding all covers of an indeterminate string in O(n) time on average Md. Faizul Bari, M. Sohel Rahman, and Rifat Shahriyar Department of Computer Science and Engineering Bangladesh University of Engineering

More information

Section Summary. Relations and Functions Properties of Relations. Combining Relations

Section Summary. Relations and Functions Properties of Relations. Combining Relations Chapter 9 Chapter Summary Relations and Their Properties n-ary Relations and Their Applications (not currently included in overheads) Representing Relations Closures of Relations (not currently included

More information

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances 7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

Text and multimedia languages and properties

Text and multimedia languages and properties Text and multimedia languages and properties (Modern IR, Ch. 6) I VP R 1 Introduction Metadata Text Markup Languages Multimedia Trends and Research Issues Bibliographical Discussion I VP R 2 Introduction

More information

PATTERN MATCHING WITH SWAPS IN PRACTICE

PATTERN MATCHING WITH SWAPS IN PRACTICE International Journal of Foundations of Computer Science c World Scientific Publishing Company PATTERN MATCHING WITH SWAPS IN PRACTICE MATTEO CAMPANELLI Università di Catania, Scuola Superiore di Catania

More information

Define M to be a binary n by m matrix such that:

Define M to be a binary n by m matrix such that: The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]

More information

The Longest Common Subsequence Problem

The Longest Common Subsequence Problem Advanced Studies in Biology, Vol. 4, 2012, no. 8, 373-380 The Longest Common Subsequence Problem Anna Gorbenko Department of Intelligent Systems and Robotics Ural Federal University 620083 Ekaterinburg,

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

An Algorithm for Computing the Invariant Distance from Word Position

An Algorithm for Computing the Invariant Distance from Word Position An Algorithm for Computing the Invariant Distance from Word Position Sergio Luán-Mora 1 Departamento de Lenguaes y Sistemas Informáticos, Universidad de Alicante, Campus de San Vicente del Raspeig, Ap.

More information

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching Ricardo Luis de Azevedo da Rocha 1, João José Neto 1 1 Laboratório de Linguagens

More information

Deterministic Finite Automata (DFAs)

Deterministic Finite Automata (DFAs) CS/ECE 374: Algorithms & Models of Computation, Fall 28 Deterministic Finite Automata (DFAs) Lecture 3 September 4, 28 Chandra Chekuri (UIUC) CS/ECE 374 Fall 28 / 33 Part I DFA Introduction Chandra Chekuri

More information

On Antichains in Product Posets

On Antichains in Product Posets On Antichains in Product Posets Sergei L. Bezrukov Department of Math and Computer Science University of Wisconsin - Superior Superior, WI 54880, USA sb@mcs.uwsuper.edu Ian T. Roberts School of Engineering

More information

Deterministic Finite Automata (DFAs)

Deterministic Finite Automata (DFAs) Algorithms & Models of Computation CS/ECE 374, Fall 27 Deterministic Finite Automata (DFAs) Lecture 3 Tuesday, September 5, 27 Sariel Har-Peled (UIUC) CS374 Fall 27 / 36 Part I DFA Introduction Sariel

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances 6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 03: Edit distance and sequence alignment Slides adapted from Dr. Shaojie Zhang (University of Central Florida) KUMC visit How many of you would like to attend

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

Algorithm Design and Analysis

Algorithm Design and Analysis Algorithm Design and Analysis LECTURE 8 Greedy Algorithms V Huffman Codes Adam Smith Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: Let e be the

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

Exact vs approximate search

Exact vs approximate search Text Algorithms (4AP) Lecture: Approximate matching Jaak Vilo 2008 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 Exact vs approximate search In exact search we searched for a string or set of strings in

More information

DFA to Regular Expressions

DFA to Regular Expressions DFA to Regular Expressions Proposition: If L is regular then there is a regular expression r such that L = L(r). Proof Idea: Let M = (Q,Σ, δ, q 1, F) be a DFA recognizing L, with Q = {q 1,q 2,...q n },

More information

Enumeration Schemes for Words Avoiding Permutations

Enumeration Schemes for Words Avoiding Permutations Enumeration Schemes for Words Avoiding Permutations Lara Pudwell November 27, 2007 Abstract The enumeration of permutation classes has been accomplished with a variety of techniques. One wide-reaching

More information

{a, b, c} {a, b} {a, c} {b, c} {a}

{a, b, c} {a, b} {a, c} {b, c} {a} Section 4.3 Order Relations A binary relation is an partial order if it transitive and antisymmetric. If R is a partial order over the set S, we also say, S is a partially ordered set or S is a poset.

More information

Average Complexity of Exact and Approximate Multiple String Matching

Average Complexity of Exact and Approximate Multiple String Matching Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science

More information

A Survey of the Longest Common Subsequence Problem and Its. Related Problems

A Survey of the Longest Common Subsequence Problem and Its. Related Problems Survey of the Longest Common Subsequence Problem and Its Related Problems Survey of the Longest Common Subsequence Problem and Its Related Problems Thesis Submitted to the Faculty of Department of Computer

More information

Chapter 2: Finite Automata

Chapter 2: Finite Automata Chapter 2: Finite Automata Peter Cappello Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 cappello@cs.ucsb.edu Please read the corresponding chapter before

More information

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for MARKOV CHAINS A finite state Markov chain is a sequence S 0,S 1,... of discrete cv s from a finite alphabet S where q 0 (s) is a pmf on S 0 and for n 1, Q(s s ) = Pr(S n =s S n 1 =s ) = Pr(S n =s S n 1

More information

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint SA-REPC - Sequence Alignment with a Regular Expression Path Constraint Nimrod Milo Tamar Pinhas Michal Ziv-Ukelson Ben-Gurion University of the Negev, Be er Sheva, Israel Graduate Seminar, BGU 2010 Milo,

More information

Lecture 4 : Adaptive source coding algorithms

Lecture 4 : Adaptive source coding algorithms Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv

More information

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions Copyright c 2015 Andrew Klapper. All rights reserved. 1. For the functions satisfying the following three recurrences, determine which is the

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r Syllabus R9 Regulation UNIT-II NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: In the automata theory, a nondeterministic finite automaton (NFA) or nondeterministic finite state machine is a finite

More information

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift

More information

CS 154, Lecture 3: DFA NFA, Regular Expressions

CS 154, Lecture 3: DFA NFA, Regular Expressions CS 154, Lecture 3: DFA NFA, Regular Expressions Homework 1 is coming out Deterministic Finite Automata Computation with finite memory Non-Deterministic Finite Automata Computation with finite memory and

More information

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases

More information

Samson Zhou. Pattern Matching over Noisy Data Streams

Samson Zhou. Pattern Matching over Noisy Data Streams Samson Zhou Pattern Matching over Noisy Data Streams Finding Structure in Data Pattern Matching Finding all instances of a pattern within a string ABCD ABCAABCDAACAABCDBCABCDADDDEAEABCDA Knuth-Morris-Pratt

More information

Some useful tasks involving language. Finite-State Machines and Regular Languages. More useful tasks involving language. Regular expressions

Some useful tasks involving language. Finite-State Machines and Regular Languages. More useful tasks involving language. Regular expressions Some useful tasks involving language Finite-State Machines and Regular Languages Find all phone numbers in a text, e.g., occurrences such as When you call (614) 292-8833, you reach the fax machine. Find

More information

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Data Compression. Limit of Information Compression. October, Examples of codes 1

Data Compression. Limit of Information Compression. October, Examples of codes 1 Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality

More information

NP and Computational Intractability

NP and Computational Intractability NP and Computational Intractability 1 Polynomial-Time Reduction Desiderata'. Suppose we could solve X in polynomial-time. What else could we solve in polynomial time? don't confuse with reduces from Reduction.

More information

Longest Common Prefixes

Longest Common Prefixes Longest Common Prefixes The standard ordering for strings is the lexicographical order. It is induced by an order over the alphabet. We will use the same symbols (,

More information

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence

More information

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.

More information

Robustness Analysis of Networked Systems

Robustness Analysis of Networked Systems Robustness Analysis of Networked Systems Roopsha Samanta The University of Texas at Austin Joint work with Jyotirmoy V. Deshmukh and Swarat Chaudhuri January, 0 Roopsha Samanta Robustness Analysis of Networked

More information

More Dynamic Programming

More Dynamic Programming CS 374: Algorithms & Models of Computation, Spring 2017 More Dynamic Programming Lecture 14 March 9, 2017 Chandra Chekuri (UIUC) CS374 1 Spring 2017 1 / 42 What is the running time of the following? Consider

More information

Computing Longest Common Substrings Using Suffix Arrays

Computing Longest Common Substrings Using Suffix Arrays Computing Longest Common Substrings Using Suffix Arrays Maxim A. Babenko, Tatiana A. Starikovskaya Moscow State University Computer Science in Russia, 2008 Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing

More information

Finite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine)

Finite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine) Finite Automata - Deterministic Finite Automata Deterministic Finite Automaton (DFA) (or Finite State Machine) M = (K, Σ, δ, s, A), where K is a finite set of states Σ is an input alphabet s K is a distinguished

More information

Theory of Computation

Theory of Computation Thomas Zeugmann Hokkaido University Laboratory for Algorithmics http://www-alg.ist.hokudai.ac.jp/ thomas/toc/ Lecture 3: Finite State Automata Motivation In the previous lecture we learned how to formalize

More information

CS6901: review of Theory of Computation and Algorithms

CS6901: review of Theory of Computation and Algorithms CS6901: review of Theory of Computation and Algorithms Any mechanically (automatically) discretely computation of problem solving contains at least three components: - problem description - computational

More information

More Dynamic Programming

More Dynamic Programming Algorithms & Models of Computation CS/ECE 374, Fall 2017 More Dynamic Programming Lecture 14 Tuesday, October 17, 2017 Sariel Har-Peled (UIUC) CS374 1 Fall 2017 1 / 48 What is the running time of the following?

More information

Permutation distance measures for memetic algorithms with population management

Permutation distance measures for memetic algorithms with population management MIC2005: The Sixth Metaheuristics International Conference??-1 Permutation distance measures for memetic algorithms with population management Marc Sevaux Kenneth Sörensen University of Valenciennes, CNRS,

More information

Towards a Theory of Information Flow in the Finitary Process Soup

Towards a Theory of Information Flow in the Finitary Process Soup Towards a Theory of in the Finitary Process Department of Computer Science and Complexity Sciences Center University of California at Davis June 1, 2010 Goals Analyze model of evolutionary self-organization

More information

X-MA2C01-1: Partial Worked Solutions

X-MA2C01-1: Partial Worked Solutions X-MAC01-1: Partial Worked Solutions David R. Wilkins May 013 1. (a) Let A, B and C be sets. Prove that (A \ (B C)) (B \ C) = (A B) \ C. [Venn Diagrams, by themselves without an accompanying logical argument,

More information

State Complexity of Neighbourhoods and Approximate Pattern Matching

State Complexity of Neighbourhoods and Approximate Pattern Matching State Complexity of Neighbourhoods and Approximate Pattern Matching Timothy Ng, David Rappaport, and Kai Salomaa School of Computing, Queen s University, Kingston, Ontario K7L 3N6, Canada {ng, daver, ksalomaa}@cs.queensu.ca

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science Zdeněk Sawa Department of Computer Science, FEI, Technical University of Ostrava 17. listopadu 15, Ostrava-Poruba 708 33 Czech republic September 22, 2017 Z. Sawa (TU Ostrava)

More information

Dynamic Programming ACM Seminar in Algorithmics

Dynamic Programming ACM Seminar in Algorithmics Dynamic Programming ACM Seminar in Algorithmics Marko Genyk-Berezovskyj berezovs@fel.cvut.cz Tomáš Tunys tunystom@fel.cvut.cz CVUT FEL, K13133 February 27, 2013 Problem: Longest Increasing Subsequence

More information

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Efficient High-Similarity String Comparison: The Waterfall Algorithm Efficient High-Similarity String Comparison: The Waterfall Algorithm Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick)

More information

Computing repetitive structures in indeterminate strings

Computing repetitive structures in indeterminate strings Computing repetitive structures in indeterminate strings Pavlos Antoniou 1, Costas S. Iliopoulos 1, Inuka Jayasekera 1, Wojciech Rytter 2,3 1 Dept. of Computer Science, King s College London, London WC2R

More information

Massachusetts Institute of Technology 6.042J/18.062J, Fall 02: Mathematics for Computer Science Professor Albert Meyer and Dr.

Massachusetts Institute of Technology 6.042J/18.062J, Fall 02: Mathematics for Computer Science Professor Albert Meyer and Dr. Massachusetts Institute of Technology 6.042J/18.062J, Fall 02: Mathematics for Computer Science Professor Albert Meyer and Dr. Radhika Nagpal Quiz 1 Appendix Appendix Contents 1 Induction 2 2 Relations

More information

INF4130: Dynamic Programming September 2, 2014 DRAFT version

INF4130: Dynamic Programming September 2, 2014 DRAFT version INF4130: Dynamic Programming September 2, 2014 DRAFT version In the textbook: Ch. 9, and Section 20.5 Chapter 9 can also be found at the home page for INF4130 These slides were originally made by Petter

More information

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo CSE 431/531: Analysis of Algorithms Dynamic Programming Lecturer: Shi Li Department of Computer Science and Engineering University at Buffalo Paradigms for Designing Algorithms Greedy algorithm Make a

More information

Lecture 3: Finite Automata

Lecture 3: Finite Automata Administrivia Lecture 3: Finite Automata Everyone should now be registered electronically using the link on our webpage. If you haven t, do so today! I d like to have teams formed by next Monday at the

More information

Deterministic Finite Automata (DFAs)

Deterministic Finite Automata (DFAs) Algorithms & Models of Computation CS/ECE 374, Spring 29 Deterministic Finite Automata (DFAs) Lecture 3 Tuesday, January 22, 29 L A TEXed: December 27, 28 8:25 Chan, Har-Peled, Hassanieh (UIUC) CS374 Spring

More information

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Hsing-Yen Ann National Center for High-Performance Computing Tainan 74147, Taiwan Chang-Biau Yang and Chiou-Ting

More information

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd 4.8 Huffman Codes These lecture slides are supplied by Mathijs de Weerd Data Compression Q. Given a text that uses 32 symbols (26 different letters, space, and some punctuation characters), how can we

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

Entropy Rate of Stochastic Processes

Entropy Rate of Stochastic Processes Entropy Rate of Stochastic Processes Timo Mulder tmamulder@gmail.com Jorn Peters jornpeters@gmail.com February 8, 205 The entropy rate of independent and identically distributed events can on average be

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

CSC Discrete Math I, Spring Relations

CSC Discrete Math I, Spring Relations CSC 125 - Discrete Math I, Spring 2017 Relations Binary Relations Definition: A binary relation R from a set A to a set B is a subset of A B Note that a relation is more general than a function Example:

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

FINITE MEMORY DETERMINACY

FINITE MEMORY DETERMINACY p. 1/? FINITE MEMORY DETERMINACY Alexander Rabinovich Department of Computer Science Tel-Aviv University p. 2/? Plan 1. Finite Memory Strategies. 2. Finite Memory Determinacy of Muller games. 3. Latest

More information

Complexity Theory of Polynomial-Time Problems

Complexity Theory of Polynomial-Time Problems Complexity Theory of Polynomial-Time Problems Lecture 1: Introduction, Easy Examples Karl Bringmann and Sebastian Krinninger Audience no formal requirements, but: NP-hardness, satisfiability problem, how

More information

Primary classes of compositions of numbers

Primary classes of compositions of numbers Annales Mathematicae et Informaticae 41 (2013) pp. 193 204 Proceedings of the 15 th International Conference on Fibonacci Numbers and Their Applications Institute of Mathematics and Informatics, Eszterházy

More information

Midterm 2 for CS 170

Midterm 2 for CS 170 UC Berkeley CS 170 Midterm 2 Lecturer: Gene Myers November 9 Midterm 2 for CS 170 Print your name:, (last) (first) Sign your name: Write your section number (e.g. 101): Write your sid: One page of notes

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

String Search. 6th September 2018

String Search. 6th September 2018 String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)

More information

Solutions to Homework Set #3 Channel and Source coding

Solutions to Homework Set #3 Channel and Source coding Solutions to Homework Set #3 Channel and Source coding. Rates (a) Channels coding Rate: Assuming you are sending 4 different messages using usages of a channel. What is the rate (in bits per channel use)

More information

Image Registration Lecture 2: Vectors and Matrices

Image Registration Lecture 2: Vectors and Matrices Image Registration Lecture 2: Vectors and Matrices Prof. Charlene Tsai Lecture Overview Vectors Matrices Basics Orthogonal matrices Singular Value Decomposition (SVD) 2 1 Preliminary Comments Some of this

More information

Deterministic Finite Automata (DFAs)

Deterministic Finite Automata (DFAs) Algorithms & Models of Computation CS/ECE 374, Fall 27 Deterministic Finite Automata (DFAs) Lecture 3 Tuesday, September 5, 27 Part I DFA Introduction Sariel Har-Peled (UIUC) CS374 Fall 27 / 36 Sariel

More information