An Algorithm for Computing the Invariant Distance from Word Position

Similar documents
A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

Approximation: Theory and Algorithms

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings

A Survey of the Longest Common Subsequence Problem and Its. Related Problems

Parameterized Expectations Algorithm and the Moving Bounds

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Data Structures in Java

A New Smoothing Method for Lexicon-based Handwritten Text Keyword Spotting

6.6 Sequence Alignment

Horst Bunke IAM

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

Topology of Strings: Median String is NP-Complete.

On Two Families of Multiset Tree Automata

Efficient Approximation of Large LCS in Strings Over Not Small Alphabet

Approximation: Theory and Algorithms

The Longest Common Subsequence Problem

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

Algorithms for Approximate String Matching

Recursive Definitions

An implementation of deterministic tree automata minimization

Secure and Private Sequence Comparisons

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1

Multiple Pattern Matching

Finding a longest common subsequence between a run-length-encoded string and an uncompressed string

Remark on Hsu-Du New Algorithm for the LCS Problem

Automata and Formal Languages - CM0081 Non-Deterministic Finite Automata

Outline. 1 Merging. 2 Merge Sort. 3 Complexity of Sorting. 4 Merge Sort and Other Sorts 2 / 10

More Dynamic Programming

6. DYNAMIC PROGRAMMING II

Consensus Optimizing Both Distance Sum and Radius

More Dynamic Programming

Compressed Index for Dynamic Text

Writing Numbers. in Technical Documents. Celia M. Elliott Department of Physics University of Illinois

Permutation distance measures for memetic algorithms with population management

SUFFIX TREE. SYNONYMS Compact suffix trie

Soundex distance metric

B. J. Oommen and R. K. S. Loke School of Computer Science Carleton University Ottawa ; CANADA : K1S 5B6 ABSTRACT

Analysis and Design of Algorithms Dynamic Programming

Theoretical Computer Science

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

A metric approach for. comparing DNA sequences

straight segment and the symbol b representing a corner, the strings ababaab, babaaba and abaabab represent the same shape. In order to learn a model,

Z = F(X) Combinational circuit. A combinational circuit can be specified either by a truth table. Truth Table

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching

String averages and self-organizing maps for strings

Algorithms for computing variants of the longest common subsequence problem

State Complexity of Neighbourhoods and Approximate Pattern Matching

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

Procedures for calculating reversible one-dimensional cellular automata

The Set-Set LCS Problem. D. S. Hirschberg ~ and L. L. Larmore t

On the Uniform Distribution of Strings

Lecture 4 : Adaptive source coding algorithms

Pattern Matching with Mismatches: A Simple Randomized Algorithm and its Analysis

An Algorithm to compute a T-transitive approximation of a reflexive and symmetric fuzzy relation

Enumeration Schemes for Words Avoiding Permutations

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

Linearly dependent vectorial decomposition of clutters

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

arxiv: v2 [cs.ds] 9 Nov 2017

Define M to be a binary n by m matrix such that:

Alphabet Friendly FM Index

Algorithms and Theory of Computation. Lecture 9: Dynamic Programming

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra

String Range Matching

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Implementing Approximate Regularities

Average Complexity of Exact and Approximate Multiple String Matching

Counting Methods. CSE 191, Class Note 05: Counting Methods Computer Sci & Eng Dept SUNY Buffalo

Einführung in die Computerlinguistik

Alternative Combinatorial Gray Codes

Sequence Comparison with Mixed Convex and Concave Costs

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions

arxiv: v2 [cs.ds] 21 May 2014

Computing a Longest Common Palindromic Subsequence

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

*Corresponding author. Phone, x2920; Fax +34. Vicente del Raspeig, Alicante, Spain

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des

Robustness Analysis of Networked Systems

Small-Space Dictionary Matching (Dissertation Proposal)

Secret-sharing with a class of ternary codes

INF4130: Dynamic Programming September 2, 2014 DRAFT version

2012 ACM-ICPC Asia Hatyai Regional Programming Contest

An Algebraic Characterization of the Halting Probability

Catalytic Networks. Mark Baumback

The set of integers will be denoted by Z = {, -3, -2, -1, 0, 1, 2, 3, 4, }

(In)approximability Results for Pattern Matching Problems

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra

2017 ACM ICPC ASIA GWALIOR REGIONAL CONTEST

Reduction of Ultrametric Minimum Cost Spanning Tree Games to Cost Allocation Games on Rooted Trees. Kazutoshi ANDO and Shinji KATO.

Algorithms Design & Analysis. String matching

Patterns related to the Smarandache circular sequence primality problem

Ultrametric Automata with One Head Versus Multihead Nondeterministic Automata

On the Average Path Length of Complete m-ary Trees

CSE 2001: Introduction to Theory of Computation Fall Suprakash Datta

A Multiobjective Approach to the Weighted Longest Common Subsequence Problem

Optical Character Recognition of Jutakshars within Devanagari Script

Transcription:

An Algorithm for Computing the Invariant Distance from Word Position Sergio Luán-Mora 1 Departamento de Lenguaes y Sistemas Informáticos, Universidad de Alicante, Campus de San Vicente del Raspeig, Ap. Correos 99 E-03080 Alicante, Spain {sluan, mpalomar}@dlsi.ua.es Abstract. There are many problems involving string matching. The string matching bases in a number of similarity or distance measures, and many of them are special cases or generalisations of the Levenshtein distance. In this paper, we focus on the problem of evaluating an invariant distance from word position between two strings: e.g., the strings Universidad de Alicante and Alicante, Universidad de refer to the same concept, so the distance ought to be low, in order to consider them as very similar strings. Keywords: String Matching, String Similarity, Approximate String Matching, Edit Distance, Pattern Matching 1 Phone: (+34) 965 90 34 00 ext. 2514 Fax: (+34) 965 90 93 26

An Algorithm for Computing the Invariant Distance from Word Position Sergio Luán-Mora Departamento de Lenguaes y Sistemas Informáticos, Universidad de Alicante, Campus de San Vicente del Raspeig, Ap. Correos 99 E-03080 Alicante, Spain {sluan, mpalomar}@dlsi.ua.es Abstract. There are many problems involving string matching. The string matching bases in a number of similarity or distance measures, and many of them are special cases or generalisations of the Levenshtein distance. In this paper, we focus on the problem of evaluating an invariant distance from word position between two strings: e.g., the strings Universidad de Alicante and Alicante, Universidad de refer to the same concept, so the distance ought to be low, in order to consider them as very similar strings. 1 Introduction The problem of determining the differences between two sequences of symbols has been studied extensively [1, 3, 4, 5, 6, 10, 12]. Algorithms for the problem have numerous applications, including approximate string matching, spelling error detection and correction system, phonetic string matching, file comparison tools, and the study of genetic evolution [7, 9, 13, 14]. The string matching bases in a number of similarity or distance measures, and many of them are special cases or generalisations of the Levenshtein distance [8]. This distance is not useful when we want to compute the similarity between two strings of words: although edit distance is robust to spelling variants, it can be completely useless when permutations of words occur. Consider the seven strings in Table 1. They belong in three classes, {,,, }, {, } and { }. Each of these classes represents a separate institution and we would like to consider them as similar strings. In Section 2, we present the Levenshtein distance and show how this distance is useless when a permuted word order exists. In Section 3 we outline the invariant distance from word position as a mechanism for overcoming the problem. : Universidad de Alicante : Universitat d Alacant : University of Alicante : Alicante University Table 1. Strings Length 23 21 22 19

: Ciencias, Universidad de Valencia : Universitat de València, Ciències : Universidad Politécnica de Valencia 33 33 35 2 String Similarity The similarity between any two strings can be evaluated by the edit distance or Levenshtein distance (LD) [8]. This distance has been traditionally used in approximate-string searching and spelling error detection and correction. The LD of strings x and y is defined as the minimal number of simple editing operations that are required to transform x into y. The simple editing operations considered are: the insertion of a character, the deletion of a character, and the substitution of one character with another, with costs δ(λ, σ), δ(σ, λ), and δ(σ 1, σ 2 ), that are functions of the involved character(s). Extended Levenshtein distances also consider transposing two adacent characters. In the examples of this paper, we have taken a unitary cost function for all the operations and for all of the characters. The LD of two strings m and n in length, respectively, can be calculated by a dynamic programming algorithm. The algorithm requires Θ(mn) time and space, although refinements of the algorithm require only Θ(ND) time and space [11], where N is the sum of m and n and D is the distance between the strings. The Levenshtein distance between x[1:m] and y[1:n], LD(m, n), can be calculated by the next recurrence relation [5]: 0, LD(0, 1) + δ ( λ, y ), LD( i, ) = LD( i 1,0) + δ ( xi, λ), LD( i 1, 1), LD( i, 1) + δ ( λ, y ), min LD( i 1, ) + δ ( xi, λ), LD( i 1, 1) + δ ( xi, y, ) if i = 0 and if i = 0 and if i > 0 and if if x x i i = y y = 0 > 0 = 0 (1) Table 2 shows the pairwise edit distance (LD) for the strings of Table 1. As it is a symmetric matrix, the lower part of the matrix is not presented. As we can see, the strings,, and (type of institution, name) have a low LD between them: 6 ( - ), 5 ( - ), and 7 ( - ). However, if we consider the previous strings and the string (name, type of institution), the distances between them increase very much: 19 ( - ), 2 All the strings are converted to lower case before the distance is calculated.

18 ( - ), and 18 ( - ). The difference is due to word permutation: the four strings represent the same institution and have similar words, but the string has a different word order. If two strings contain the same words but with a permuted word order (variant forms of the same term), the LD will be large. Therefore, the LD is not a useful distance when a permuted word order exists. Table 2. LD matrix 0 6 5 19 17 14 19 0 7 18 20 15 21 0 18 22 16 21 0 23 25 28 0 22 16 0 19 0 3 Invariant Distance from Word Position The problem we have shown in the previous section can be resolved if we use a different representation of the strings. The conventional representation of a string is a sequence of characters. A more useful representation is to think of a string as a set of words, where each word is a sequence of characters (letters and digits). Using this representation, the invariant distance from word position (IDWP) is calculated. This distance is based on the approximate word matching referred to in [2]. In Table 3 we can see the strings shown in Table 1 broken up into words. The words have been converted to lower case and the punctuation (commas and apostrophes) has also been removed. Table 3. Strings Number Words of words : Universidad de Alicante 3 universidad, de, alicante : Universitat d Alacant 3 universitat, d, alacant : University of Alicante 3 university, of, alicante : Alicante University 2 alicante, university : Ciencias, Universidad de Valencia 4 ciencias, universidad, de, valencia : Universitat de València, Cièncie universitat, de, valencia, ciencies : Universidad Politécnica de Valencia 4 universidad, politecnica, de, valencia

3.1 The algorithm To calculate the IDWP of two strings, they are broken up into words. The idea is to pair up the words so that the sum of the LD is minimized. If the strings contain different number of words, the cost of each word in excess is the length of the word. It is almost always the case that IDWP(x, y) < LD(x, y), although this is not always true. For instance, for the strings abc def and a bcd ef, the LD and IDWP values are 3 and 4 respectively. In Table 4 we show the IDWP algorithm. Basically the core of the algorithm is the recursive method matching, that is shown in Table 5. This method calculates the best matching of the words by means of a branch and bound scheme. Table 4. IDWP algorithm Input: S: Array of words ( s m ) m: Integer T: Array of words (t 1 t n ) n: Integer Output: Idwp: Integer Variables: i,, aux: Integer D: Matrix (m + 1) x (n + 1) of Integer * Fulfil D with zeros: For i = 0 To m For = 0 To n D[i][] = 0 * Fulfil the first column ant the first row of D with the length of s i and t respectively: For i = 1 To m D[i][0] = s i For = 1 To n D[0][] = t * Mark the words that matches between them: For i = 1 To m For = 1 To n If D[i][] = 0 Then For aux = 0 To m D[aux][] = -1

For aux = 0 To n D[i][aux] = -1 * Call the matching method: Idwp = -1 matching(d, m, n, 1, 0, Idwp) Table 5. Matching method Input: D: Matrix (m + 1) x (n + 1) of Integer m: Integer n: Integer i: Integer cost: Integer Idwp: Integer Output: Idwp: Integer Variables:, aux: Integer If i <= m Then If D[i][0] <> -1 Then * The word i of S has not yet been matched: For = 1 To n * If the word of T has not yet been matched: If D[i][] <> -1 And D[0][] > 0 Then aux = cost + D[i][] If aux < Idwp Or Idwp = -1 Then * Mark the word of T as matched: D[0][] = -D[0][] matching(d, m, n, i + 1, aux, Idwp) * Remove the mark of word of T: D[0][] = -D[0][] * The method consider the word i of S is not matched, so its length is

added to the final distance: aux = cost + D[i][0] If aux < Idwp Or Idwp = -1 Then matching(d, m, n, i + 1, aux, Idwp) Else * The word i of S has already been matched, the method goes on to the next word: matching(d, m, n, i + 1, cost, Idwp) Else * All the words of S have been matched; if any word of T has not been matched, its length is added to the final distance: aux = cost For = 1 To n If D[0][] > 0 Then aux = aux + D[0][] If aux < Idwp Or Idwp = -1 Then Idwp = aux Table 5 shows the pairwise IDWP for the strings of Table 1. As it is a symmetric matrix, the lower part of the matrix is not presented. Table 5. IDWP matrix 0 5 5 5 15 17 16 0 6 5 17 15 20 0 2 20 19 21 0 20 19 21 0 3 8 0 10 0 In Table 6 is shown LD matrix versus IDWP matrix. As we can see, the greater changes between this two distances are between and {,, } and between and. The IDWP between the strings of {,,, } and {, } are more homogenous than LD. IDWP is lower or equal than LD in all the cases, except - and -. Table 6. LD matrix vs. IDWP matrix

0 6 / 5 5 / 5 19 / 5 17 / 15 14 / 17 19 / 16 0 7 / 6 18 / 5 20 / 17 15 / 15 21 / 20 0 18 / 2 22 / 20 16 / 19 21 / 21 0 23 / 20 25 / 19 28 / 21 0 22 / 3 16 / 8 0 19 / 10 0 References 1. A.V. Aho, D.S. Hirschberg, J.D. Ullman. Bounds on the Complexity of the Longest Common Subsequence Problem. Journal of the ACM, 23(1):1-12, 1976. 2. J.C. French, A.L. Powell, E. Schulman. Applications of Approximate Word Matching in Information Retrieval. In Forouzan Golshani, Kia Makki, editors, Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM 1997), pages 9-15, Las Vegas (USA), November 1997. 3. D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. 4. D.S. Hirschberg. Algorithms for the Longest Common Subsequence Problem. Journal of the ACM, 24(4):664-675, 1977. 5. D.S. Hirschberg. Serial Computations of Levenshtein Distances. In A. Apostolico, Z. Galil, editors, Pattern Matching Algorithms. Oxford University Press, 1997. 6. J.W. Hunt, T.G. Szymanski. A Fast Algorithm for Computing Longest Common Subsequences. Communications of the ACM, 20(5):350-353, 1977. 7. K. Kukich. Techniques for Automatically Correcting Words in Text. Computing Surveys, 24(4):377-440, 1992. 8. V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10:707-710, 1966. 9. R. Lowrance, R.A. Wagner. An Extension of the String-to-String Correction Problem. Journal of the ACM, 22(2):177-183, 1975. 10. W.J. Masek, M.S. Paterson. A Faster Algorithm for Computing String Edit Distances. Journal of Computer and Systems Sciences, 20(1):18-31, 1980. 11. E. Myers. An O(ND) Difference Algorithm and its Variations. Algorithmica, 1(2):251-266, 1986. 12. N. Nakatsu, Y. Kambayashi, S. Yaima. A Longest Common Subsequence Algorithm Suitable for Similar Text Strings. Acta Informatica, 18:171-179, 1982. 13. W. Tichy. The String-to-String Correction Problem with Block Moves. ACM Transactions on Computer Systems, 2(4):309-321, 1984. 14. R.A. Wagner, M.J. Fischer. The String-to-String Correction Problem. Journal of the ACM, 21(1): 168-173, 1974.