Codes in the Damerau Distance for Deletion and Adjacent Transposition Correction

Size: px

Start display at page:

Download "Codes in the Damerau Distance for Deletion and Adjacent Transposition Correction"

Cuthbert Dawson
6 years ago
Views:

1 1 Codes in the Damerau Distance for Deletion and Adjacent Transposition Correction Ryan Garys, Eitan Yaakoi, and Olgica Milenkovic ECE Department, University of Illinois, Urana-Champaign Technion University arxiv: v3 [cs.it] 30 Apr 2018 Astract Motivated y applications in DNA-ased storage, we introduce the new prolem of code design in the Damerau metric. The Damerau metric is a generalization of the Levenshtein distance which, in addition to deletions, insertions and sustitution errors also accounts for adjacent transposition edits. We first provide constructions for codes that may correct either a single deletion or a single adjacent transposition and then proceed to extend these results to codes that can simultaneously correct a single deletion and multiple adjacent transpositions. We conclude with constructions for joint lock deletion and adjacent lock transposition error-correcting codes. 1 I. INTRODUCTION The edit distance is a measure of similarity etween two strings evaluated ased on the minimum numer of operations required to transform one string into the other. If the operations are confined to symol deletions, insertions and sustitutions, the distance of interest is the Levenshtein (edit) distance [15]. The Levenshtein distance has found numerous applications in ioinformatics, where a weighted version of this metric is used to assess the similarity of DNA strings and reconstruct phylogenetic trees [13], and natural language processing, where the distance is used to model spelling errors and provide automated word correction [3]. In parallel to the work on developing efficient algorithms for computing the edit distance and performing alignments of large numer of strings, a long line of results were reported on the topic of designing codes for this distance function. Codes in the edit distance are of particular importance for communication in the presence of synchronization errors, a type of error encountered in almost all modern storage and data transmission systems. Classical derivations of upper ounds on code sizes y Levenshtein [15] and single deletion-correcting code constructions y Varshamov and Tenengoltz [21], [22] have estalished the framework for studying many challenging prolems in optimal code design for this metric [2], [6], [11], [18], [20]. The Damerau distance is an extension of the Levenshtein distance that also allows for edits of the form of adjacent symol transpositions [3]. Despite the apparent interest in coding for edit channels, the prolem of designing codes in the Damerau distance was not studied efore. A possile reason for this lack of interest in the Damerau distance may e attriuted to the fact that not many practical channel models involve adjacent transposition errors, and even if they do so, they tend not to allow for user-selected message 2. Our motivating application for studying codes in the Damerau distance is the emerging paradigm of DNA-ased storage [1], [5], [9], [25] [27]. In DNA-ased storage systems, media degradation arises due to DNA aging caused y metaolic and hydrolitic processes, or more precisely, y exposure to standard or increased level radiation, humidity, and high temperatures. As an example, human cellular DNA undergoes anywhere etween reakages in a cell cycle [23]. These DNA reakages or symol/lock deletions result in changed structures of the string: If a string reaks in two places, which is the most likely scenario, either the sequence reattaches itself without resulting in structural damage, reattaches itself in the opposite direction, resulting in what is called a reversal error, or the roken string degrades, resulting in a ursty (lock) deletion; if a string reaks in three positions, which is the second most likely reakage scenario, either the adjacent roken locks exchange positions or one or oth lock disintegrate leading to a ursty deletion. It is the latter scenario that motivates the study of channels in which adjacent locks of symols may e exchanges or individual locks deleted. It is straightforward to see that this editing scenario corresponds to a lock version of the Damerau editing process. The lock editing process is hard to analyze directly, so we first study the symol-level Damerau editing process and then proceed to analyze the lock model. Also, for simplicity of exposition, we focus our attention on deletion and adjacent transposition errors and delegate the more complex analysis of all four edit operations to future work. Our contriutions are two-fold. We introduce the Damerau distance code design prolem, and descrie the first known scheme for correcting one deletion or one adjacent transposition. The scheme has near-optimal redundancy. We then proceed to extend and generalize this construction so as to otain codes capale of correcting one deletion and one adjacent transposition that also have near-optimal redundancy. Our results also shed light on the new prolems of mismatched Varshamov-Tenengoltz (VT) decoding and run length limited VT codes. Second, we descrie significantly more involved code constructions for correction of multiple adjacent transposition errors and proceed to introduce codes capale of correcting a lock deletion and adjacent lock 1 Parts of the results were presented at the International Symposium on Information Theory in Barcelona, We note the an adjacent transposition may e viewed as a deletion/insertion pair. However, the locations of the deletion and insertion are adjacent, and hence correlated correcting for two random indel errors is in this case suoptimal. Codes in the Damerau distance address this prolem y handling a comination of random deletions and correlated (adjacent) indels.

2 2 transposition. In the derivation process, we improve upon the est known constructions for lock deletion-correcting codes (i.e., codes capale of correcting a lock of consecutive deletions). The paper is organized as follows. Section II contains the prolem statement and relevant notation. Section III contains an analysis of the code design procedure for single deletion or single adjacent transposition correction. Section IV contains an order optimal code construction for correcting a single deletion and a single adjacent transposition, as well a low-redundancy construction for codes correcting a single deletion and multiple adjacent transpositions. Sections V and VI are devoted to our main findings: The est known code construction for single lock deletion correction, and codes capale of correcting a single lock deletion and a single adjacent lock transposition. II. TERMINOLOGY AND NOTATION We start y defining the Damerau-Levenshtein distance, which arose in the works of Damerau [7] and Levenshtein [15], and y introducing codes in this metric. We then proceed to extend the underlying coding prolem so that it applies to locks, rather than individual symol errors. Definition 1. The Damerau Levenshtein distance is a string metric, which for two strings of possily different lengths over some (finite) alphaet equals the minimum numer of insertions, deletions, sustitutions and adjacent transposition edits needed to transform one string into the other. The lock Damerau Levenshtein distance with lock length is a string metric, which for two strings of possily different lengths over some (finite) alphaet equals the minimum numer of insertions, deletions, sustitutions and adjacent transposition edits of locks of length at mostneeded to transform one string into the other. For simplicity, we focus on edits involving deletions and adjacent transpositions only, and with slight ause of terminology refer to the underlying sequence comparison function as the Damerau metric 3. Furthermore, we restrict our attention to inary alphaets only. Generalizations to larger alphaet sizes may potentially e accomplished y a careful use of Tenegoltz up-down encoding, descried in [14], [16], ut this prolem will e discussed elsewhere. For a vector x F n 2, let B T D (x) denote the set of vectors that may e otained from x y either at most one single adjacent transposition (T) or at most one single deletion (D). Note that the size of B T D (x) is 2r(x), where r(x) is the numer of runs in x, i.e., the smallest numer of nonoverlapping sustrings involving the same symol that covers the sequence. Example 1. Suppose that x = (0,0,1,1,0) F n 2. Then, B T D (x) = {(0,1,1,0),(0,0,1,0),(0,0,1,1), (0,0,1,1,0),(0,1,0,1,0),(0,0,1,0,1)}. In particular, B T D (x) = B D (x) B T (x), where B D (x) is the set of words otained y deleting at most one element in x, while B T (x) is the set of words otained from at most one adjacent transposition in x. The derivative of x, denoted y (x) = x is a vector defined as x = (x 1,x 2 + x 1,x 3 + x 2,...,x n + x n 1 ). Clearly, the mapping etween x and x is a ijection. Hence, the integral 1 (x) x is well-defined for all x F n 2. Oserve that 1 (x) = ( x 1, x 2,..., x n ) F n 2, where x i = i j=1 x j for all i [n]. For a set X F n 2, we use X to denote the set of derivatives of vectors in X, and similarly, we use X to denote the set of integrals of vectors in X. For two vectors x,y F n 2, we let d H (x,y) denote their Hamming distance. Furthermore, we let C H (n,d) stand for any code of length n with minimum Hamming distance d, and similarly, we let C D (n) stand for any single-deletion-correcting code of length n. Similar notation will e used for other types of editing errors, alls, distances and codes, with their meaning apparent from the context. Furthermore, for the convenience of the reader, relevant notation and terminology referred to throughout the paper is summarized in Tale I. III. SINGLE TRANSPOSITION OR DELETION-CORRECTING CODES We start y descriing a general construction for single transposition or deletion-correcting codes. We then show how to use this construction in order to devise codes with near-optimal redundancy. Let C H (n,3) e a single-error-correcting code, and, as efore, let C D (n) e a single-deletion-correcting code. We define a code C T D (n), which we show in Lemma 2 is capale of correcting one transposition (T) or ( ) one deletion (D) as follows: C T D (n) = {x F n 2 : x C D(n),x C H (n,3)}. (1) The code C T D (n) consists of codewords that elong to a single deletion error-correcting code and have integrals that elong to a single sustitution error-correcting code. Lemma 2. The codec T D (n) descried in (1) can correct a single adjacent transposition or a single deletion. 3 Since we only consider deletions, what we refer to as Damerau distance is strictly speaking not a metric, ut we use the terminology as it is custom to do so.

3 3 Notation Description Position in the manuscript B D (x) The set of words that may e otained from at most one single deletion in a vector x. End of Section II. B T (x) The set of words that may e otained from at most one single adjacent transposition in a vector x. End of Section II. B T D (x) B T D (x) = B D (x) B T (x). End of Section II. x, (x) The derivative of x. End of Section II. x, 1 (x) The integral of x End of Section II. C H (n,d) A code of minimum Hamming distance d. End of Section II. C D (n) A code that can correct a single deletion error. End of Section II. C T D (n) A code that can correct a single adjacent transposition or deletion. Section III, preceding Lemma 2. X D (n,a) A code that can correct a single deletion error. Section III, preceding Claim 1. X H (n,a) A code that can correct a single sustitution error. Section III, preceding Claim 1. B (T,l) (x) The set of words otained from x via l adjacent transpositions. Section IV, preceding Example 2. B (T,l),D (x) The set of words otained from x via l adjacent transpositions and a single deletion. Section IV, preceding Example 2. C V T (n,a,l) A VT-type code taken with modulus given y the parameter l. The code C V T (n,a,,l) comprises a suset of codewords in C V T (n,a,l) dictated y the parameter. Section IV, following Lemma 6. D V T,n,l A decoder for C V T (n,a,l). Section IV, following Lemma 6. D VT,n,,l A decoder for C V T (n,a,,l). Section IV, following Lemma 8. C (T,l) D (n,a,) A code which may correct a single deletion and up to l adjacent transpositions. C (T,l) D (n,a,) is a suset of words in C V T (n,a,,l). Section IV, efore Theorem 11. Y T D (n,a 1,a 2 ) A code used in the definition of C T D (n,a 1,a 2 ). Section IV, following Corollary 12. C T D (n,a 1,a 2 ) A code that may correct one adjacent transposition and one deletion. Section IV, following Corollary 12. B D, (x) The set of words that may e otained from x via a urst of consecutive deletions of length at most. Section V-A, Part 1. B D, (x) The set of words that may e otained from x via a urst of consecutive deletions of length exactly. Section V-A, Part 1. C par(n,,d) A code used to determine the weight of a deleted sustring. Section V-A, Part 1. I(y,v,k I ) A vector otained y inserting v into y at position k I. Section V-A, preceding Claim 3. D(y,,k D ) A vector otained y deleting consecutive its from y starting at position k D. Section V-A, preceding Claim 3. Bal(n, ) A (alanced) set of words in which any sufficiently long sustring has roughly half ones and half zeros. Section V-A, preceding Claim 4. C odd (n,a,d) A code for determining the approximate location of a urst of deletions. The code C V T (n,a,,l) comprises a suset of words in C V T (n,a,l). Section V-A, following Claim 4. SVT c,d (n,m) A code for determining the exact location of a deletion given an approximate location for the same. Section V-A, Part 3. C odd (n,a,c,d) C (n,a, C, D) A code which may correct a urst of deletions of odd length. The code C odd (n,a,c,d) is constructed using the codes C odd (n,a,d) and SVT c,d (n,m). A code capale of correcting a urst of deletions of any length. Section V-A, preceding Theorem 17. Section V-B, following Example 9. C (n,a, C, D) is constructed using the code C odd (n,a,c,d). B BT, (x) The set of words otained from x via one adjacent lock transposition. Section VI, preceding Example 11. B BT D, (x) The set of words otained from x via one adjacent lock transposition and one lock deletion. Section VI, following Example 12. T(x,k T ) The vector resulting from transposing the symols at positions k T and k T +1 in x. Section VI, preceding Lemma 22. C (1) TD, (n,a,c,d) A code for determining the approximate location of a lock of deletions and adjacent transposition. Section VI, following Lemma 22. C(n,m;t 1,t 2 ) A code for correcting special types of urst errors. SectionVI, following Definition 24. C Odd,B (n, a, C, D) A code for correcting an odd-length lock of deletions and adjacent lock transposition. Section VI, following Lemma 22. C TD, (n,a, C, D) A code for correcting one lock of deletions and one adjacent lock tranposition. Section VI, efore Theorem 27. TABLE I RELEVANT NOTATION AND TERMINOLOGY. Proof: We prove this claim y showing that for all x C T D (n), one can uniquely recover x from any z B T D (x). Assume first that z F2 n 1, so that z is the result of a single deletion occurring in x. Since x C D (n), one may apply the decoder of the code C D (n) to successfully recover x C T D (n). Assume that z F n 2, so that z is the result of at most one single transposition in x. We show that d H (x,z) 1. When this inequality holds, since x elongs to a code with minimum Hamming distance 3, the vector x can e uniquely determined ased on z. Note that since the mapping is injective, d H (x,z) = 0 if and only if x = z. Let the transmitted word x e sujected to one adjacent transposition involving the ith and (i+1)th its, so that x i x i+1

4 4 and z = (x 1,...,x i 1,x i+1,x i,x i+2,...,x n ). First, we compute the integral z as z = (z 1,z 2 +z 1,z 3 +z 2 +z 1,..., z j ) = (z 1,...,z n ). Let x = (x 1,...,x n ). Then, clearly (x 1,...,x i 1 ) = (z 1,...,z i 1 ). Furthermore, j=1 i 1 i 1 z i = x j +x i+1 = x j +(1+x i ) = 1+ x i, j=1 j=1 and for any k i+1, z k = i 1 j=1 x j +x i+1 +x i + k j=i+2 x j = x k, so that d H (x,z) = 1 as desired. Oserve that we did not explicitly state the choices of codes in (1). A natural choice would e a single sustitution-correcting Hamming code, for which one requires that n = 2 m 1 for some positive integerm, and the single deletion-correcting Varshamov- Tenengoltz (VT) code [15], or some cosets of these codes. Since the cosets of the codes cover F n 2, one can see that there exists a code with redundancy at most 2log(n+1). We show next how to improve this result y constructing one code that may serve oth as a single deletion-correcting codefor x and a single sustitution-correcting code for x. The redundancy of this code is at most log n+log6. Our choice of codes is as follows. Let a e a non-negative integer such that 0 a 6n 4. For the single deletion code, we use For the code C H (n,3), we choose n 1 X D (n,a) ={x F n 2 : ix i + (2n 1)x n a mod (6n 3)}. n 2 X H (n,a) = {x F n 2 : (2i+1)x i +(2n 1)x n } +(3n 2)x n 1 a mod (6n 3). Claim 1. For any vectorx F n 2, if x X D (n,a) thenx X H (n,a) and thus ifx X D (n,a) thenx X H (n,a). Proof: Suppose that x X D (n,a). By definition, n 1 ix i +(2n 1)x n a mod 6n 3. Therefore, since x = (x 1,x 1 +x 2,x 2 +x 3,...,x n 1 +x n ), we have n 1 x 1 + i(x i +x i 1 )+(2n 1)(x n 1 +x n ) a i=2 mod 6n 3, which implies that x X H (n,a). This proves the claim. According to Claim 1 and Lemma 2, in order to show that the code C T D (n) = X D (n,a) is a single transposition or deletioncorrecting code, we only have to show that the codes X D (n,a) and X H (n,a) have the desired error-correcting properties. Lemma 3. The code X H (n,a) is a single sustitution error-correcting code. Proof: Let H = (3,5,7,..., 2n 3,3n 2,2n 1) so that x X H (n,a) if and only if H x T a mod (6n 3). Assume on the contrary that X H (n,a) is not a single sustitution error-correcting code. Then, there exist two different codewords x 1,x 2 X H (n,a) and two vectors e j,e k such that x 1 +e j = x 2 +e k, where oth e j,e k have at most one non-zero entry of value either 1 or 1. This would imply H(x 1 +e j ) T H (x 2 +e k ) T mod (6n 3), and He T j He T k mod (6n 3), which holds if and only if e j = e k. Therefore, we must have x 1 = x 2, a contradiction.

5 5 Lemma 4. The code X D (n,a) can correct a single deletion. Proof: By definition, if x X D (n,a), we may write H x T a mod 6n 3, where H = (1,2,3,...,n 1,2n 1). The result follows y oserving that (1,2,3...,n 1,2n 1) is a Helerg sequence as defined in Definition III.2 from [10]. Thus, according to Theorem III.4 of the same paper, the code X D (n,a) can correct a single deletion. The following corollary summarizes the main result of this section. Corollary 5. There exists a single transposition or deletion-correcting code whose redundancy is at mostlog(6n 3) its. Proof: Using the pigeon-hole principle considered in [21], one may easily show that X H (n,a) = C T D (n,a) 2n 6n 3, since C T D (a,n) partitions the amient space F n 2 into 6n 3 codes, one of which has to have a size at least as large as the right-hand side of the inequality. Note that every single transposition or deletion-correcting code is also a single deletion error-correcting code. Hence, a lower ound on the redundancy of the latter code is log n [12], so that the difference etween the redundancy of our deletion/adjacent transposition codes and the redundancy of a optimal single deletion code is at most log6 its. We also note that improving the lower ound on a single transposition or deletion-correcting code is left as an open prolem. IV. CODES CORRECTING DELETIONS AND ADJACENT TRANSPOSITIONS We now turn our attention to the significantly more challenging task of constructing codes that can correct oth deletions and adjacent transpositions simultaneously. Our main result is a construction of a code capale of correcting a single deletion along with multiple adjacent transpositions. At the end of this section, we present an improved construction for the special case of a single deletion and a single transposition. We start y introducing some useful notation. Let B (T,l) (x) denote the set of vectors that may e otained y applying at most l adjacent transpositions (T) to x. Hence, B (T,l) (x) = B (T,1) (...(B (T,1) ( x))...). }{{} l times Let B (T,l),D (x) denote the set of vectors that may e otained from x y at most l adjacent transpositions followed y at most one single deletion. As efore, let B D (x) e the set of words that may e otained y introducing at most one deletion into x. With a slight ause of notation, we use the same symol B independently on the the argument of the set eing a word or a collection of words. In the latter case, the set B equals the union of the corresponding sets of individual words in the argument. The next example illustrates the relevant notation. Example 2. Suppose that x = (0,0,1,1,0). Then, B (T,1) (x) = {(0,0,1,1,0),(0,1,0,1,0),(0,0,1,0,1)}, B D (x) = {(0,0,1,1,0),(0,1,1,0),(0,0,1,0),(0,0,1,1)}, B (T,1),D (x) = {(0,0,1,1,0),(0,1,1,0),(0,0,1,0),(0,0,1,1),(1,0,1,0) (0,1,0,1),(0,1,0,0),(0,0,0,1),(0,1,0,1,0),(0,0,1,0,1)}. Lemma 6. For anyx F n 2, B (T,l),D (x) = B D (B (T,l) (x)) = B (T,l) (B D (x)). Proof: The proof is y induction on l. For the ase case l = 1, we show that B D (B (T,1) (x)) = B (T,1) (B D (x)) y demonstrating that if y B (T,1) (B D (x)), then y B D (B (T,1) (x)). Furthermore if y B D (B (T,1) (x)), then y B (T,1) (B D (x)). Suppose that y (d) = (x 1,...,x i 1,x i+1,...x n ) is the result of deleting the symol at position i, where i [n]. Also, assume that y = y (d,t) is otained from y (d) y transposing the symol in position j with the symol in position j +1 in y (d), where j [n 2]. One needs to consider two different scenarios: 1) j [n 2]\(i 1); and 2) j = i 1. First, we show that if j [n 2] \ (i 1), then y B D (B (T,1) (x)). To see why this claim holds, note that if j < i 1 then y may e generated y first transposing the symols in positions j,j +1 in x to otain y (t) and then deleting the symol in position i. Otherwise, if j i, one may first transpose the symols in positions j +1,j +2, and then delete the symol in position i. Suppose now that j = i 1. Then x i 1 x i+1 and so x i equals either x i 1 or x i+1. Suppose that x i = x i 1. Then

6 6 y may e generated y first transposing x i and x i+1, and then deleting the symol in position i 1. Otherwise, if x i = x i+1, y may e otained y first transposing x i 1 and x i and then deleting the symol in position i+1. Using a similar argument, it can e shown that if y B D (B (T,1) (x)), then y B (T,1) (B D (x)). This estalishes the ase case B D (B (T,1) (x)) = B (T,1) (B D (x)). We now prove the inductive step. Suppose that B D (B (T,l) (x)) = B (T,l) (B D (x)) holds for all l < L. We show that B D (B (T,L) (x)) = B (T,L) (B D (x)) holds as well. This may e seen from the following chain of equalities: B D (B (T,L) (x)) = B D (B (T,L 1) (B (T,1) (x))) = B (T,L 1) (B D (B (T,1) (x))) = B (T,L 1) (B (T,1) (B D (x))) = B (T,L) (B D (x)), where the second line follows from the inductive hypothesis, which is applied to each vector in the set, and where the third line is a result of the previous result which showed that B D (B (T,1) (x)) = B (T,1) (B D (x)). As a consequence of the previous lemma, we may henceforth assume that the deletion always occurs after the adjacent transposition(s). We then say that a code C can correct l adjacent transpositions and a single deletion, and refer to it as a l-td code if for any two different codewords u,v C, B (T,l),D (u) B (T,l),D (v) =. Our code construction and the ideas ehind the coding approach are est explained through the decoding procedure. Suppose that the code C T D (n,l) is an l-td code, which is a suset of codewords of a single deletion-correcting code. Assume also that x C T D (n,l) was transmitted and that the vector y was received, where y is the result of at most l transpositions followed y at most one single deletion in x. The simplest idea to pursue is to try to correct the single deletion y naively applying the decoder for the chosen constituent single-deletion code. Clearly, such a decoder may produce an erroneous result due to the presence of the adjacent transposition errors. It is therefore important to construct the code C T D (n,l) in such a way that the result of the mismatched deletion correction x, otained from y, is easy to characterize and contains only a limited numer of errors that may e corrected to recover x C T D (n,l) from x. To this end, define the following code: C VT (n,a,l) = {x F n 2 : ix i a mod (n+2l+1)}. Since the code is a VT code, the decoder D V T,n,l for C VT (n,a,l) can correct a single deletion occurring in any codeword in C VT (n,a,l) [21]. Note that the standard definition of a single deletion-correcting code entails setting n ix i to e equal to some a modulo n +1 [21]. Our construction fixes n ix i to a modulo n +2l+1 instead. As we demonstrate in Claim 2, this change is needed due to the fact that adjacent transpositions may change the value of the syndrome a y at most ±l. As efore, and for the special case of VT codes, assume that x is the result of VT decoding the vectory where y B (T,l),D (x). Our first aim is to characterize the difference etween x and x, and for this purpose we use an intermediary word y (l) that is generated from at most l adjacent transpositions in x, i.e., a word such that y B D (y (l) ). More precisely, we demonstrate that if oth x,y (l) C VT (n,a,l), then the decoder outputsd VT,n,l (a,y) andd V T,n,l (a,y (l) ) will differ only in the transpositions that actually occurred in x. On the other hand, if x,y (l) elong to two different VT codes (i.e. they have different values of the VT syndrome parameter a), then x and x differ y at most 2l adjacent transpositions. The following simple claim is a consequence of the fact that an adjacent transposition changes the VT syndrome y at most one. Claim 2. Suppose that y (l) = (y (l) 1,...,y(l) n ) B (T,l) (x) wherex F n 2. Then, one has n ix i n iy(l) i l. Proof: The proof is y induction on l. For the ase case suppose y (1) B (T,1) (x). The result clearly holds if y (1) = x and so assume y (1) is the result of transposing the symols in positions j and j +1 in x. Then ix i iy (l) i ( ) = ix i +(i+1)x i+1 ix i+1 +(i+1)x i = x i+1 x i = 1, since x i x i+1. For the inductive step, suppose that the result holds for all l < L and consider the case l = L. Let y (L) B (T,L)

7 7 and let y (L 1) B (T,L 1) e such that y (L) and y (L 1) differ y at most one single adjacent transposition. Then, ix i iy (L) i = ix i ix i L 1+1 = L. iy (L 1) i + iy (L 1) i + iy (L 1) i iy (L 1) i iy (L) i iy (L) i As a consequence of the previous claim, if x C VT (n,a,l) and y (l) B (T,l) (x), then y (l) C VT (n,â,l) for some â, where a â l. The next lemma summarizes the previous discussion. Lemma 7. Suppose that y (l) B (T,l) (x), where x C VT (n,a,l), and let y B D (y (l) ). Then, D VT,n,l (â,y) = y (l) for some â such that a â l. Example 3. Suppose that x = (0,1,1,0,0,1,0,0,0,0,1,0) C VT (12,3,3) was transmitted and that the vector y = (0,1,1,0, 0,1,0,0, 1,0,0) was received after at most three adjacent transpositions and a single deletion. For y (3) = (0,1,1,0, 0,1,0,0, 0,1,0,0) (where y B D (y (3) ), we have n 1 i y i 2 mod 19. Thus, since a = 3 and â = 2, we get that a â 1 l = 3 as desired. Note that if we use the decoder D VT,12,3 we arrive at x = D V T,12,3 (3,y) = (0,1,1,0,0,0,1,0,0,1,0,0). Hence, we have x = (0,1,1,0,0,0,1,0,0,1,0,0), and x = (0,1,1,0,0,1,0,0,0,0,1,0). We characterize next the difference etween D VT,n,l (a,y) and D VT,n,l (â,y) for the case that a â l, as the value â is not known eforehand. Our main result may e intuitively descried as follows: Suppose that y B D (x), where x C VT (n,a,l) and where y is otained y deleting the kth it, x k, from x. Also, assume that the value of x k is known to the decoder and that x = D V T,n,l (a+v,y), for some offset v, is otained y inserting the it x k into y at some position determined y the decoder. Then, if x k = 0, we may otain x from x y sliding the inserted it to the left/right using a series of adjacent transposition operations past at most v ones. Otherwise, if x k = 1, then we can otain x from x y sliding the inserted it to the left/right past at most v zeros. The next lemma rigorously summarizes this oservation. Lemma 8. Suppose thaty is the result of a single deletion occurring inx C V T (n,a,l) at positionk. Givenk, letv L = {j [n] : j < k,x j = 1} andv R = {j [n] : j > k,x j = 1}. Then, 1) If x k = 0, then for all v { v R, v R + 1,...,v L }, one may otain D V T,n,l (a + v,y) y inserting the symol 0 into y immediately after the(v L v)-th one. 2) Ifx k = 1, then for all v { (k 1)+v L, k+v L +2,...,(n k) v R }, one may otaind VT,n,l (a+v,y) y inserting the symol1intoy immediately after the(v +k v L 1)-th zero. Example 4. Suppose that x = (0,1,1,0,0,1,0,0,0,0,1,0) C VT (12,3,3), and that ˆx = D VT,n,l (3,y), was otained y VT decoding y = (0,1,1,0,1,0,0,0,0,1,0). For v = 2, one has D V T,n,l (5,y) = (0,0,1,1,0,1,0,0,0,0,1,0), whereas for v = 1, one has D VT,n,l (2,y) = (0,1,1,0,1,0,0,0,0,0,1,0). Next, suppose that y = (0,1,1,0,0,0,0,0,0,1,0), where y is the result of deleting the third 1 at position k = 6 from x = (0,1,1,0,0, 1,0,0, 0,0,1,0). In this case, choosing v = 3 gives D VT,n,l (6,y) = (0,1,1,0,0,0,0,0,1,0,1,0), while v = 2 gives D V T,n,l (1,y) = (0,1,1,1,0,0,0,0,0,0,1,0). Proof of Lemma 8: Suppose first that y is the result of deleting a zero from x C VT (n,a,l). Let a a n 1 iy i mod (n+2l+1). The decoder D V T,n,l for C VT (n,a,l) produces the vector x C VT (n,a,l) y inserting a zero into the first position k that has a ones to the right of it. If x k = 0, then clearly a = v R, and the decoder correctly outputs x so that x = x. If the decoder D VT,n,l for C VT (n,a+v,l) were applied to y instead, one would have n 1 a a+v iy i mod (n+2l+1) a +v mod (n+2l+1). Hence, the decoder D V T,n,l for C VT (n,a+v,l) would insert a zero in the vector y at the first position k that has a +v ones to the right of it. The claim follows y oserving that the position immediately following a +v ones is in the same run as the position in y preceding (v L v) ones.

8 8 Suppose next that y is the result of deleting a one from x C V T (n,a,l). Let a a n 1 iy i mod (n + 2l+1). The decoder D V T,n,l for C VT (n,a,l) produces the vector x C VT (n,a,l) y inserting a one into the first position k with a k ones its right. If x k = 1, then clearly k = k and the decoder correctly outputs x, so that x = x. Note that position k appears efore v R = a k ones and after k 1 v L zeros (i.e., position k has a k ones on its right and k 1 v L zeros to its left). Furthermore, the total numer of ones in x is v L +v R +1 = v L +a k +1, which implies that v L +v R = v L +a k. (2) If the decoder D V T,n,l for C VT (n,a+v,l) were applied to y instead, then one would have a a +v mod (n+2l+1) as efore. The decoder D V T,n,l for C V T (n,a + v,l) would insert a one into the vector y at the first position k preceeding a +v k ones (or with a +v k ones to its right). This produces a vector x. Given (2), since the total numer of ones in x is v L +v R +1, we know that the numer of ones preceding position k (i.e., to its left) is Thus, the numer of zeros preceeding k (or to its left) is v L +a k (a +v k ) = v L k v +k. (k 1) (v L k v +k ) = k +v v L 1, which proves the claim of the lemma. The previous lemma motivates the introduction of a modification of VT codes, which will e used as a constituent component in a construction of codes capale of correcting a deletion and multiple adjacent transpositions. This modified code structure also leads to a straightforward decoding procedure of the underlying codes. The code may e defined as follows: C VT (n,a,,l) = {x F n 2 : (3) ix i a mod (n+2l+1), x i mod 2}. The code C VT (n,a,,l) allows one to first determine the value of the deleted it using the second parity constraint and then susequently determine the location of the deleted it using the VT-type constraint. The decoder for C VT (n,a,,l), denoted y D VT,n,,l, operates as follows. Suppose that x C VT (n,a,,l) is transmitted and that y B (T,l),D (x) is received. Suppose that n 1 denotes the numer of ones in y. Then, for a Z n+2l+1 and F 2, D V T,n,,l (a,y) executes the following steps: 1) Set x n 1 y i + mod 2. 2) Compute a a n 1 iy i mod (n+2l+1). 3) If x = 0 and a {0,1,...,n 1 }, insert a zero into the first position in y that has a ones on its right. If a {n 1 +1,n 1 +2,...,n 1 +l}, insert a zero in the first position in y. If a {n+l+1,n+l+2,...,n+2l}, insert a zero in the last position of y. 4) If x = 1 and a {n 1 +1,n 1 +2,...,n}, insert a one in the first position k of y that has a k ones to its right. Otherwise, if a {n+1,n+2,...,n+l}, insert a one in the last position of y. If a {n 1 l+1,n 1 l+2,...,n 1 }, insert a one in the first position of y. Note that the VT decoder discussed so far aims to correct a single deletion only, ut potentially in a mismatched fashion as additional adjacent transposition errors may have een incurred during deletion correction. The output of the deletion-correcting decoder has to e fed into the input of a transposition error-correcting code, and we descrie how this susequent decoding is accomplished after providing an illustration of the VT decoding process. Example 5. Suppose that x = (0,1,1,0,0,1,0,0,0,0,1,0) C VT (12,3,0,3), and that y = (0,1,1,0,1,0,0,0,1,0,0) is the received word, which is the result of a single deletion and a single transposition. We first apply the decoder D V T,12,0,3 to y. In the first step of the procedure, we conclude that the deleted it has value x = 0. In the second step of decoding, we compute a = 3. Since 0 a 4, we have x = (0,1,0,1,0,1,0,0,0,1,0,0). Note that x = (0,1,0,1,0,1,0,0,0,1,0,0), and x = (0,1,1,0,0,1,0,0,0,0,1,0), differ in two adjacent transpositions. The previous example illustrates that x and x differ in a limited numer of transpositions which depends on the original numer of transposition errors. In particular, for the given example, the two vectors differ in two adjacent transpositions as x is the result of a single deletion and a single transposition in y. The next lemma gives a more precise characterization of the distance etween x and x.

9 9 Lemma 9. Suppose that y (l) B (T,l) (x) where x C VT (n,a,,l) and where y B D (y (l) ). Let x = D V T,n,,l (a,y). Then the following statements are true: 1) If x is the result of inserting a zero in y in a position with v (1) R ones to the right of the inserted it, then y(l) can e otained fromy y inserting a zero in y in the first position that precedesj ones wherej {v (1) R l,v(1) R l+1,...,v(1) R +l}. 2) If x is the result of inserting a one in y in position k with v (0) R zeros to the right of the inserted it, then y(l) can e otained fromy y inserting a one in y in the first position that precedesj zeros wherej {v (0) R l,v(0) R l+1,...,v(0) R +l}. Proof: Suppose that x = D VT,n,,l (a,y) is the result of inserting a zero into y. According to Claim 2, y (l) C V T (n,a+ v,,l) for some v, where v l. Suppose next that y is the result of deleting a zero from y (l) at position k, where position k precedes ṽ (1) R ones in y(l), and position k follows ṽ (1) L ones. Clearly, y(l) = D VT,n,,l (a + v,y). According to Lemma 8, x = D V T,n,,l ((a + v) v,y) is otained from y y inserting a zero into the first position with ṽ (1) L + v ones to its left and ṽ (1) R v to its right, which proves the first statement in the lemma. Suppose next that x = D V T,n,,l (a,y) is the result of inserting a one into y. Based on the same reasoning as the one used in the first part of the proof, we have y (l) C V T (n,a+v,,l) for some v, where v l. Suppose y is the result of deleting a one from y (l) at position k, where k is such that there are ṽ (1) R ones to the right of this position, and ṽ(1) L ones to the left of this position. Furthermore, we assume there are ṽ (0) R zeros to the right of position k, and ṽ (0) L zeros to the left of position k. Then, y (l) = D VT,n,,l (a+v,y). According to Lemma 8, x = D VT,n,,l ((a+v) v,y) is otained from y y inserting a one into y after the (k ṽ (1) L 1 v)-th zero. Equivalently, we can otain x y inserting a one into y in the first position with ṽ(0) L v zeros to its left and ṽ (0) R +v zeros to its right, since ṽ(0) L = (k 1) ṽ (1) L. The following corollary summarizes one of the main results of this section. Corollary 10. Suppose that y B (T,l),D (x) wherex C VT (n,a,,l) and let x = D VT,n,,l (a,y). Thenx B (T,2l) ( x). Consequently, the mismatched VT decoder increases the numer of adjacent transposition errors y at most a factor of two. Based on the results on mismatched VT decoding and Corollary 10, we are now ready to define a family of codes capale of correcting a single deletion and multiple adjacent transposition errors. Recall that given a inary word x, its derivative (x) = x is defined as x = (x 1,x 2 +x 1,x 3 +x 2,...,x n +x n 1 ) and its inverse (integral) as 1 (x) = x = (x 1,x 1 +x 2,..., n x i). We claim that the code C (T,l) D F n 2 C (T,l) D (n,a,) = {x F n 2 : x C H(n,4l+1), x C VT (n,a,,l)} (4) is an l-td code (i.e., a code capale of correcting l adjacent transpositions (T, l) and ( ) one deletion (D)). This result intuitively follows from the fact that the coupling of a VT-type constraint and a sustitution error-correcting code with sufficiently large distance can handle a single deletion along with a numer of adjacent transpositions, akin to what was estalished in the previous sections for the case of a single adjacent transposition. Theorem 11. The codec (T,l) D (n,a,) is anl-td code. Proof: Suppose that y B (T,l),D (x). We show how to recover x from y. First, we determine x = D V T,n,,l (a,y). From Corollary 10, we have that x B (T,2l) ( x). Since x B (T,2l) ( x), we have d H ( 1 ( x),x) 2l. Because the minimum distance of the code C (T,l) D (n,a,) is 4l+1, we can uniquely recover x from 1 ( x). The following ound follows y noting the existence of inary codes of length n and minimum distance 4l+1 which have 2llogn its of redundancy (see [17], Prolem 8.12). Corollary 12. There exists an l-td code which redundancy at most 2l log n + log(n + 2l + 1) its. Next, we improve upon this result for the case when l = 1. Let a 1,a 2 Z n+2l+1. Define Y T D (n,a 1,a 2 ) F n 2 according to Y T D (n,a 1,a 2 ) ={x : x n = 0, n 1 (2i+1)x i a 1 mod (n+2l+1), n 1 (2i+1) 3 x i a 2 mod (n+2l+1)}, where L 1 is chosen so that n+2l+1 is a prime numer greater than 2n 1. Let C T D (n,a 1,a 2 ) = Y T D (n,a 1,a 2 ),

10 10 where Y stands for the collection of all derivatives of words in Y. As we show next, the first VT-type constraint in the preceding code Y may e used to approximately correct the deletion and the adjacent transposition. Given that the approximate correction may e erroneous, the second VT-type constraint is used to perform exact correction. We have the following lemma. Lemma 13. For all a 1,a 2 Z n+2l+1, the codec T D (n,a 1,a 2 ) is a 1-TD code. Proof: We use the same approach as the one outlined in the proof of Claim 1. Since n 1 (2i+1)x i a 1 mod (n+2l+1) and x n = 0, we have that ix i a 1 mod n+2l+1. (5) Furthermore, since x n = 0, x i 0 mod 2. (6) From (5) and (6), it is clear that if x Y T D (n,a 1,a 2 ), then x C VT (n,a 1,0,L). Similarly to what was done in Theorem 11, it can e shown that if L 1 and Y T D (n,a 1,a 2 ) has Hamming distance at least 5, then C T D (n,a 1,a 2 ) is a 1-TD code. By design, L 1 and so we turn our attention to showing that Y T D (n,a 1,a 2 ) has Hamming distance at least 5. We claim that the vectors in Y T D (n,a 1,a 2 ) represent a coset of a Berlekamp code [17, Chapter 10.6] with Lee distance 5, which implies the desired result. To prove the claim, note that the inary code Y T D (n,0,0) has a parity-check matrix of the form [ ] n 1 H = (2n 1) 3. According to [17, Chapter 10.6], in order for Y T D (n,0,0) to have minimum Lee distance 5, the following statement has to e true: For any two columns of H, say h i and h j, it has to hold that [ 0 h i1 +h i2, c] for any possile choice of c F n+2l+1. Clearly, this condition is true since n+2l+1 is an odd prime and the sum of two odd numers cannot equal another odd numer. Thus, Y T D (n,0,0) has minimum Lee distance at least 5 and so Y T D (n,a 1,a 2 ) has minimum Lee distance at least 5, as claimed. The aove construction improves upon the general construction descried y the result (4) in terms of log n its of redundancy. Remark 1. It has een a long standing open prolem to find extensions for the single-deletion VT code construction which would have order optimal redundancy and impose syndrome constraints of the form i f k(i)x i a mod (n+1), for some judiciously chosen functions f k (i). Attempts ased on using this approach have failed so far [2]. On the other hand, the result of Lemma 13 shows that syndrome constraints of the form descried aove can accommodate cominations of one deletion and other forms of errors, such as adjacent transpositions. Corollary 14. There exists a 1-TD code which redundancy at most2log n+c its, for some asolute constantc. In the next section, we turn our attention to the prolem of constructing codes capale of correcting transposition and deletion errors in the form of locks of its. First, we analyze the prolem of constructing codes capale of correcting a single lock of adjacent deletions. Then, we focus on constructing codes capale of correcting a single transposition of adjacent locks in addition to handling one lock deletion. V. CODES FOR CORRECTING A BLOCK OF DELETIONS We descrie next a new family of codes capale of correcting one lock of at most consecutive deletions; the codes require log logn+o( 2 log loglogn) its of redundancy, and hence improve upon the state-of-the art scheme which requires at least ( 1)logn its of redundancy [19]. The proposed lock-deletion codes will susequently e used in Section VI to construct codes capale of correcting oth a lock of deletions (which we alternatively refer to a urst of deletions) and an adjacent transposition of two locks of consecutive symols. To explain the intuition ehind our approach, we start with a short overview of existing code constructions for correcting a lock of consecutive deletions, where the length of the lock is fixed. It will e helpful to think of codewords of length n = c, c 1, as two dimensional arrays formed y writing the its in the codeword column-wise, i.e., y placing the its (x 1,x 2,...,x )

11 11 in an orderly fashion within the first column of the array, the its (x +1,x +2,...,x 2 ) within the second column and so on. As an example, for c = n/, the codeword x = (x 1,...,x n ) would read as follows: x 1 x +1 x x c( 1)+1 x 2 x +2 x x c( 1) (7) x x 2 x 3... x n For simplicity, throughout the remainder of this section, we use the term interleaved sequence to refer to a row in the array. Note that in this setting, a lock of consecutive deletions in a codeword x leads to one deletion within each interleaved sequence, and that the locations of deletions in the interleaved sequences are correlated. As an example, the lock may cause the same deletion location in the first interleaved sequence, ut affect the symols in the other interleaved sequences differently (The deleted symols are underlined): x 1 x +1 x x c( 1)+1 x 2 x +2 x x c( 1)+2..., (8) x x 2 x 3... x n or or x 1 x +1 x x c( 1)+1 x 2 x +2 x x c( 1)+2..., (9) x x 2 x 3... x n x 1 x +1 x x c( 1)+1 x 2 x +2 x x c( 1) (10) x x 2 x 3... x n As a result, y finding the location of the deletion in the first interleaved sequence does not automatically allow one to determine the shift of the lock with respect to that location. Furthermore, deletion correcting codes such as VT codes only identify the run of symols in which the deletion occurs and not its exact position, as the goal is to reconstruct the correct codeword and not precisely determine the location of the error. As a result, further uncertainty exists aout the locations of the deletions in the second, third etc. interleaved sequence of the codeword. To mitigate these prolems, the authors of [4] proposed a construction of codes capale of correcting a lock of consecutive deletions of length exactly ased on imposing simple constraints on the interleaved sequences of a codeword. A construction with redundancy of approximately logn its requires all the interleaved sequences of (20) to elong to a VT code. The main drawack of this construction is that each interleaved sequence is treated independently of the others and that consequently, the redundancy of the codes is too high. To address this prolem, one should use the position of the deletion in the first interleaved sequence to approximately determine the location of the deletion in the second row and similarly for all other susequent rows. In [4], the authors also proposed a code which has an alternating sequence (i.e., a sequence of the form 0,1,0,1,0,1,...) as its first interleaved sequence and all the remaining interleaved sequences satisfying a constraint that requires log 3 its of redundancy. The proposed code may e easily decoded y first determining the location of the deletion in the first row through a reference to the alternating sequence structure. Then, this location is used y the remaining rows to correct the remaining 1 deletions. This approach requires at least n/ its of redundancy, due to the fact that one has to fix the first row of the codeword array. Thus, the redundancy of this approach is actually higher than that of the code that uses individual VT code constraints for each interleaved sequence. The alternating sequence approach was improved and generalized in [19], where the authors constructed lock deletioncorrecting codes with a significantly more relaxed constraint placed on the first interleaved sequence. Their idea was to comine constrained coding with a variant of VT codes which we explain in details in what follows. The relaxed constraints allow one to approximately determine the locations of the remaining deletions in x after decoding the first interleaved sequence of the array. The constrained and VT-type constraints imposed on the higher index rows nevertheless allow for unique recovery of the codeword x y using VT codes confined to the suspect range predicted to haror the deletions. The codes constructed in [19] require approximately logn its of redundancy for the constraint in the first row of the array, and loglogn its of redundancy for each of the remaining rows. This results in a total redundancy of roughly logn+( 1)loglogn its for correcting a lock of consecutive deletions of length exactly (compared to the redundancy of [4] which equals logn its). To allow for correcting any single lock of length at most, the codes from [19] have to e changed so as to include nested redundant its that capture multiple coding constraints and may allow for correcting a range of lock lengths. Which of the constraints to use is apparent upon oserving the length of the received word: To correct one lock of any possile length at most

12 12, the decoder for the underlying code locates the position of the lock of consecutive deletions differently for each possile lock length. For instance, if x experiences a lock error of length 1, then the code uses one VT-type constraint, say K VT,1. However, if x experiences an error urst of length 2 with 2 < 1, then the code effectively uses a different VT-type constraint, say K VT,2. Note that since each of the constraints K VT,i, 2 i, is de facto a VT-type constraint, one requires roughly ( 1)logn + 2 loglogn its of redundancy, compared to the 2 log n redundancy which would have een required y the scheme in [4]. Our approach in this work for a further improvement is to reuse the same VT-type constraint for multiple possile lock lengths, in which case the redundancy will amount to roughly log logn +log 2 loglogn its. To descrie this method, we start with a construction that allows for correcting one odd-length lock of consecutive deletions of length at most, and then proceed to extend the result for even-length locks. A. Odd Length Blocks Our code construction is centered around three main ideas: 1) The use of VT codes (12). 2) The use of running sum constraint (14). 3) The use of a sequence of Shifted VT codes [19], defined in (17) i.e., codes that enforce multiple modular VT-type constraints with parameter values smaller than n+1. As discussed in more details in what follows, our choice of the Shifted VT codes requires approximately 2 loglogn its of redundancy and the constrained coding constraint requires a single it of redundancy, the proposed construction introduces roughly logn+ 2 loglogn its of redundancy. The decoder operates as follows. Suppose that y is the result of deleting t consecutive its from x, with t and t odd. Then, 1) The decoder computes a numer of parities and decides on the appropriate Shifted VT code (17) to use in determining the Hamming weight of the its deleted from x. 2) Using oth the VT-type constraint (12) and the constraint (14), the decoder determines an approximate location for the lock deletion in x that resulted in y. 3) Given the approximate location of the lock of deletions, the decoder uses a series of Shifted VT codes (17) to determine the exact locations and values of the its deleted from x that lead to y. Part 1. Determining the weight of the deleted sustring. We start with some relevant terminology and notation. For a word x F n 2, let B D, (x) denote the set of all words that may e otained from x y deleting at most consecutive its. For example, for x = (0,1,1,0,0,1) F 6 2, we have B D, 2 (x) = { (0,1,1,0,0,1),(1,1,0,0,1),(0,1,0,0,1), (0,1,1,0,1),(0,1,1,0,0),(1,0,0,1),(0,0,0,1), } (0,1,0,1),(0,1,1,1),(0,1,1,0). Similarly, let B D, (x) denote the set of words that may e otained from x y deleting exactly consecutive its. Furthermore, given a vector d F 2, define the code C par(n,,d) as follows 4 : n j C par (n,,d) = {x F n 2 : j [], x j+i d j mod 2}. It is straightforward to see that the code imposes a single parity-check constraint on the interleaved sequences of x, which suffices to determine the weight of the deleted lock. In addition, we oserve that we used a set of parametersd i for the weight constraints, rather than the classical even parity constraints for reasons that will ecome apparent in the susequent exposition. In a nutshell, the resulting codes of the section will e nonlinear and averaging arguments for the size of codes require the use of a range of parameter values. Example 6. Suppose thatx = (0,1,1,0,0,1,0,1,0,1,0,1) C par (12,2,(1,1)) was transmitted andy = (0,1,1,0,0,1,0,1,0,1) B D,2 (x) was received instead. Since x C par (12,2,(1,1)), it is straightforward to determine that the its 0,1 were deleted from x to otain y. Notice, however, that we cannot infer the order in which the deleted its {0,1} appeared in x from the constraints of the code C par (12,2,(1,1)), nor their exact location. Part 2. Imposing the generalized VT conditions. 4 We use the suscript par to refer to the function of the code, which is to recover the weight of the deleted lock (sustring) y using a parity check. i=0

Upper Bounds for Stern s Diatomic Sequence and Related Sequences

Upper Bounds for Stern s Diatomic Sequence and Related Sequences Colin Defant Department of Mathematics University of Florida, U.S.A. cdefant@ufl.edu Sumitted: Jun 18, 01; Accepted: Oct, 016; Pulished: