Square-free Strings Over Alphabet Lists

Size: px

Start display at page:

Download "Square-free Strings Over Alphabet Lists"

Byron Riley
6 years ago
Views:

1 Square-free Strings Over Alphabet Lists [Updated: 08/10/015 11:31 - v.3] Neerja Mhaskar 1 and Michael Soltys 1 McMaster University Dept. of Computing & Software 180 Main Street West Hamilton, Ontario L8S 4K1, CANADA pophlin@mcmaster.ca California State University Channel Islands Dept. of Computer Science One University Drive Camarillo, CA 9301, USA michael.soltys@csuci.edu Abstract. We solve an open problem that was posed in [GKM10]: Given an alphabet list L = L 1,..., L n, where L i = 3 and 0 i n, can we always find a square-free string, W = W 1W... W n, where W i L i? We show that this is indeed the case. We do so using an offending suffix characterization of forced repetitions, and a counting, non-constructive, technique. We discuss future directions related to finding a constructive solution, namely a polytime algorithm for generating square-free words over such lists. Keywords: Thue words, Zimin words, repetition, square-free 1 Introduction The study of repetitions in strings is an attractive field for both theoretical and applied research. It dates back to the early 0th century and the seminal work of Axel Thue [Thu06, Thu1], who showed, using iterated morphism, how to construct non-repetitive strings over an alphabet of size three. Repetitions have many applications ranging from abstract algebra, to data compression, and to the genetic code. There are many kinds of repetitions, for example squares, cubes, etc., overlapping, non-overlapping, etc. In this paper, we focus on a problem related to the existence of squares over an alphabet list of arbitrary length, where all the alphabets are of size three, but not necessarily consisting of the same elements. This natural generalization of Thue s original problem has been posed by [GKM10]. The authors of [GKM10] showed that for every finite alphabet list, where the alphabets are of size four, there exists a non-repetitive string over such a list (the i-th symbol of this string comes from the i-th alphabet on the list). However, Research supported in part by an NSERC Discovery Grant.

2 they left open the question whether the same holds if the alphabets are restricted to be of size three. Note that Thue s original result applies to the case where all the alphabets in the list are of size three, and contain the same elements. In this paper we present a new result showing that general lists where the alphabets are of size three, but not necessarily equal, also always have non-repetitive strings. Since it is trivial to show that every binary string of length at least four has a repetition, our solution completes this area of research. The alphabet list problem has been studied by [GKM10, Sha09, MS15], among others. The outline of the paper is as follows: in Section, we give a brief introduction to the terminology. In Section 3, we define the term offending suffix, and show that having such a suffix is a necessary and sufficient condition to force repetitions in strings over alphabet lists, Lemma 1. We then use this result to prove the main theorem of the paper, Theorem 3. Background A string over a (finite) alphabet Σ is an ordered sequence of symbols from the alphabet: let W = W 1 W... W n, where for each i [n], W i Σ. In order to emphasize the array structure of W, we sometimes represent it as W [1..n]. A string V is a subword (also know as substring or factor) of W if V = W i W i+1... W j, where i j. If i = j, then V is a single symbol in W ; if i = 1 and j = n, then V = W. If i = 1, then V is a prefix of W and if j < n, V is a proper prefix of W. If j = n, then V is a suffix of W and if i > 1, V is a proper suffix of W. We can express V is a subword more succinctly as follows: V = W [i..j], and when the delimiters do not have to be expressed explicitly, we use the notation V W. A string V is a subsequence of W if V = W i1 W i... W ik, for i 1 < i < < i k. We now define a fundamental concept to this paper, namely a string over an alphabet list. Let: L = L 1, L,..., L n, be an ordered list of (finite) alphabets. We say that W is a string over the list L if W = W 1 W... W n where for all i, W i L i. Note that we impose no conditions on the L i s: they may be equal, disjoint, or have elements in common. The only condition on W is that the i-th symbol of W must be selected from the i-th alphabet, i.e., W i L i. Let Σ k denote a fixed generic alphabet of k symbols and let Σ L = L 1 L L n. Given a list L of finite alphabets, we can define the set of strings W over L with a regular expression R L := L 1 L... L n. Let L + := L(R L ) be the language of all the strings over the list L. For example, if L 0 = {{a, b, c}, {c, d, e}, {a, 1, }}, then R L0 := {a, b, c} {c, d, e} {a, 1, }, and ac1 L + 0, but ca L+ 0. Also, in this case L+ 0 = 33 = 7. We should point out that {a, b, c} is often written as (a + b + c), but we use the curly brackets since it is reminiscent of indeterminate strings, which is yet another way of looking at

3 strings over alphabet lists. See, for example, [Abr87] or [SW09] for a treatment of indeterminates. A string W is said to have a repetition (or a square) if there exists a V such that V V W and it is non-repetitive (or square-free) if no such subword exists. Given a string W over a list L = L 1, L,..., L n, we say that L n+1 forces a repetition on W if for all a L n+1, W a has a repetition. An alphabet list L is admissible if L + contains a non-repetitive string. Let L represent a class of lists; the intention is for L to denote lists with a given property. For example, we are going to use L Σk to denote the class of all lists L = L 1, L,..., L n, where for each i [n] = {1,,..., n}, L i = Σ k, and L k will denote the class of all lists L = L 1, L,..., L n, where for each i [n], L i = k, that is, those lists consisting of alphabets of size k. Note that L Σk L k. We say that a class of lists L is admissible if every list L L is admissible. Given this notation, our contribution in this paper is showing that L 3 is admissible. A border β of a string W, is a subword that is both a proper prefix and suffix of W ; see [Smy03] for the vast amount of study regarding borders. We are not going to use borders until the last section of the paper, Section 5, where we propose future directions. We now define an important concept that will play a central role in our paper: a pattern. Let Σ = {a 1, a, a 3,...} be a fixed set of generic alphabet symbols; we let Σ n = {a 1, a,..., a n } Σ. Let = {X 1, X, X 3,...} be a set of variables that take their values over Σ. A pattern is a string over ( Σ) ; for example, P = X 1 a 1 X is a pattern. We say that a word W over some alphabet Σ conforms to a pattern P if there is a homomorphism h : ( Σ) Σ such that h Σ = id, i.e., h restricted to Σ is the identity, meaning that h(a i ) = a i, and h(p ) = W. Intuitively, patterns are templates for strings. Note that some authors define patterns as being words just over variables [BP07], but we find the definition given here as more amenable to our purpose. We say that a pattern is avoidable, if strings of arbitrary length exist, such that no subword of the string conforms to the pattern, otherwise it is said to be unavoidable. For example, the pattern XX is unavoidable for all strings in {W {0, 1} : W 4}, but it is avoidable over the ternary alphabet (Thue s result). To tie it all together, we say that a list is admissible if XX is an avoidable pattern. The idea of unavoidable patterns was developed independently by Bean et. al. in [BEM79] and by Zimin in [Zim8]. Zimin words (also known as sesquipowers) constitute a certain class of unavoidable patterns. The n-th Zimin word, Z n, is defined recursively over the alphabet = {X 1,..., X n } of variables as follows: Z 1 = X 1, and for n > 1, Z n = Z n 1 X n Z n 1. (1) The understanding is that the variables X i, where 0 i n, range over words over some fixed background alphabet Σ. In [Zim8], Zimin proved that the n-th Zimin word, Z n, constitutes an unavoidable pattern for strings over Σ, where Σ = n. For a background on Zimin patterns, see [RS1, BP07, BLRS08].

4 3 Offending Suffix Pattern In this section we introduce the tools needed to prove the main theorem of the paper, Theorem 3, in Section 4. In particular, we introduce the notion of an offending suffix, and in Lemma 1 we relate such suffixes to squares in a word over a list. We define a particular pattern that is the basis for an offending suffix. Let C(n), n N, be defined recursively as follows: C(1) = X 1 a 1 X 1, and for n > 1, C(n) = X n C(n 1)a n X n C(n 1) () For example, C(3) = X 3 X X 1 a 1 X 1 a X X 1 a 1 X 1 a 3 X 3 X X 1 a 1 X 1 a X X 1 a 1 X 1. (3) Note that the definition of C(n) is similar to the definition of Zimin words mentioned in the previous section. Comparing (1) to (), one can see that mapping X i to a i in (1) yields the same string as mapping X i to ε in (). We call C(n) the offending suffix for strings over lists L L n. In particular, C(3) is the offending suffix for lists in L 3. Note that: C(3) = X X + 8 X (4) When all the variables in map to ε, i.e., the empty word, we get the pattern for the shortest possible offending suffix for a list L L n. We call this shortest pattern: C s (n) = a 1 a a 1... a n... a 1 a a 1. (5) As we are interested in lists in L 3, the shortest offending suffix for such list is, C s (3) = a 1 a a 1 a 3 a 1 a a 1 of length 7. We use this definition in the proof of Theorem 3, and to give a construction of non-repetitive words over a certain class of lists. Also note that C s (n) may be obtained from Zimin words that map X i to a i. The advantage of C(n) over Zimin words Z n is that C(n) allows for the succinct expression of the most general offending suffix possible. Finally, note that C s (n) = C s (n 1) + 1, where C s (1) = 1, and so C s (n) = n 1. Let h : ΣL, be a morphism that assigns to the variables in strings over Σ L. We say that h respects a list L = {L 1, L,..., L n }, if h yields a string over L. So, for example, an h that maps X 1, X, X 3 to ε (the empty word) yields h(c(3)) = a 1 a a 1 a 3 a 1 a a 1, and in particular such an h respects a list of length seven where all alphabets consist of {a 1, a, a 3 }. We now prove a lemma that ties together forced repetitions with the notion of an offending suffix. We show that we cannot have one without the other. Thus, it will follow that a string over a list is square-free if and only if it does not have an offending suffix. Lemma 1. Suppose that W = W 1 W... W i 1 is a square-free string over a list L = L 1, L,..., L i 1. Then, L i forces a repetition on W iff W has an offending suffix.

5 Proof. The proof is by contradiction. We assume throughout that our lists are from the class L 3, and for readability (especially in diagrams), we use the notation a = a 1, b = a, c = a 3. ( ) If W = W 1 W... W i 1 has the offending suffix, then it has a suffix that conforms to the pattern given by C(3), and defined in (3). Clearly, if we let L i = {a, b, c}, then each W a, W b, W c has a square, and hence by definition L i forces a repetition on W. ( ) Suppose, on the other hand, that L i forces a repetition on W over L = L 1, L,..., L i 1. We need to show that W must have a suffix that conforms to the pattern C(3). Let L i = {a, b, c}, and since L i forces a repetition, we know that W a, W b, W c has a square for a suffix (as W itself was square-free). Let XaXa, GbGb, HcHc be the squares created by appending a, b and c to W, respectively. Here X, G, H are treated as subwords of W. As all three squares XaXa, GbGb, HcHc are suffixes of the string, it follows that X, G, H must be of different sizes, and so we can order them, without loss of generality XaX < GbG < HcH. It also follows from the fact that all three are suffixes of W, the squares from left-to-right are suffixes of each other. Hence, while X may be empty, we know that G and H are not. We now consider different cases of the overlap of XaX, GbG, HcH, showing in each case that the resulting string has a suffix conforming to the pattern C(3). Note that it is enough to consider the interplay of GbG, HcH, as then the interplay of XaX, GbG is symmetric and follows by analogy. Also keep in mind that the assumption is that W is square-free; this eliminates some of the possibilities as can be seen below. 1. H = P GbG as shown in Figure 1, where P is a proper prefix of H and since W is square-free, P G and P b, then HcH = P GbGcP GbG. Thus, this case is possible. Fig. 1: H = P GbG. H = GbG as shown in Figure. Then, HcH = GbGcGbG. This case is also possible. 3. ch = GbG as shown in Figure 3, then G[1] = c. Let G = cs, where S is a proper suffix of G, then HcH = csbcsccsbcs. The subword cc indicates a repetition in W, which is not possible. 4. HcH = NGbG as shown in Figure 4, where N is a proper prefix of HcH and GbG > ch. Let G = P cs, where P, S are proper prefix and suffix of

6 Fig. : H = GbG Fig. 3: ch = GbG G. Therefore H = SbP cs. Since P is also a proper suffix of H, one of the following must be true: Fig. 4: HcH = NGbG (a) S = P and so S = P. Since S = P, H = SbScS and HcH = SbScScSbScS. The subword ScSc, indicates a repetition in W, and therefore, this case is not possible. (b) S > P and so S = RP, where R is a proper prefix of S. Substituting RP for S, we have H = SbP cs = RP bp crp and HcH = RP bp crpcrpbp crp. The subword crpcrp indicates a repetition in W, and therefore, this case is not possible. (c) S < P and so P = RS, where R is a proper prefix of P. Substituting RS for P, we have H = SbP cs = SbRScS and HcH = SbRScScSbRScS. The subword ScSc indicates a repetition in W, and therefore this case is not possible.

7 From the above possible cases, we can conclude that for L i to force a repetition in a square-free string W, it must be the case that H = ZGbG, where Z is a prefix (possibly empty) of H and Z G. Similarly, we get G = Y XaX, where Y is a prefix (possibly empty) of G and Y X and Z Y. Substituting values of G in H, we get H = ZY XaXbY XaX and HcH = ZY XaXbY XaXcZY XaXbY XaX. But HcH, is a suffix of the non repetitive string, W, and it conforms to the offending suffix C(3). Therefore, we have shown that if an alphabet L i forces a repetition in W, then W has a suffix conforming to the pattern C(3). Corollary. If L is a list in L 3 of length at most 7, then L is admissible. Also, C s (n) generates a square-free string of size n 1 over L Σn. Proof. It follows from Lemma 1, and the fact that the shortest possible offending suffix is of length, C s (3) = 7. (See (5) and the discussion following it.) 4 L 3 is admissible In this section, we prove the main result of the paper. Recall that L 3 is the class of lists where all alphabets are of size 3, but we place no other restrictions on them. If all the alphabets were equal, than the existence of square-free strings follows from Thue s original result. The admissibility of L 4 was shown in [GKM10]. What remained to show was that L 3 is also admissible. We show that the class of lists L 3 is admissible by showing that the number of strings conforming to an offending suffix is less than the total number of all strings over L L 3. We use Lemma 1 which characterizes repetitive strings with offending suffixes. Theorem 3. L 3 is admissible. Proof. Here is the strategy for the proof: we want to show that all lists in L 3 are admissible; we argue by contradiction, and assume that some list is not. We pick the shortest such list, and by Corollary we know that such a list is of length at least eight. Then, we remove the last alphabet from the list, and observe that every word in the resulting list must have an offending suffix by Lemma 1. We then count the number of strings over the resulting list that conform to the offending suffix, and see that it is less than the total number of strings possible over the list. Hence there is a string without an offending suffix, and so by adding a symbol from the deleted alphabet we get a square-free string of the original length, which is a contradiction. Suppose that L L 3 is an inadmissible list of minimum length, and let L = L 1, L,..., L n+1, where n 7 (by Corollary ). Let L = L 1, L,..., L n, so that L = L, L n+1. We know that the number of words over L is 3 n. We also know that every word over L has an offending suffix with respect to L n+1. This is the case because L is inadmissible, while by the minimality of L, we know that L is admissible, and so by Lemma 1, L must have an offending suffix with respect to L n+1.

8 We now count the number of strings over L that have an offending suffix. Start by counting the number of offending suffixes of length l, where 7 l n; again, by Corollary, we know that an offending suffix with respect to L n+1 must be of length at least 7. By Lemma 1, we know that the offending suffix pattern is given by C(3) (see (3)), and let i = X 1, j = X, k = X 3, so that C(3) = k + 4j + 8i + 7 = l. We assume that L n+1 = {a 1, a, a 3 }, i.e., we impose an ordering on L n+1 that corresponds to the offending suffix (this is not strictly necessary, but makes the argument cleaner). The number of offending suffixes of length l is then clearly 3 i+j+k where k + 4j + 8i + 7 = l. So, in turn, the total number of strings over L that contain an offending suffixes is bounded above by: 3 n l 3 i+j+k. (6) 7 l n k+4j+8i+7=l where the term 3 n l arises from the fact that if a string over L has an offending suffix of length l, it is then allowed to have an arbitrary prefix of length n l. We now bound from above the value of (6). First note that: 3 n l 3 i+j+k = 3 n l+i+j+k = 3 n l+(8i+4j+k+7) (7i+3j+k+7) < 3 n l/. And so it follows that (6) is strictly bounded above by n l=7 3n l/ < n 3 l=1 3l = 3 n. This is well below the total number of strings over L (3 n ), giving us the requisite contradiction. Note that this argument is non-constructive, in that we showed by counting that a square-free string must exist, but we do not have a procedure for constructing such a string in general, given an arbitrary list in L 3. 5 Conclusion and future work In this paper we showed that L 3 is admissible, meaning that given any list of alphabets, where each alphabet is of size 3, we can always construct a square-free string over such a list. Thus we solved an open problem posed in [GKM10], and in some sense this closes this area of research as the admissibility status of all lists is now known: lists where the alphabets are unary or binary are inadmissible, and all other lists are admissible. On the other hand, while we managed to solve an open problem, our proof of the admissibility of L 3 is non-constructive; can we give an algorithm for constructing a square-free string over any list in L 3 that is better than exhaustive search? Note that the result in [GKM10] showing that L 4 is admissible is also non-constructive. Since an exhaustive algorithm for finding a square-free string is exponential time (there are 3 n many strings over any list of length n in L 3 ), we want to know if a polytime algorithm exists. In the rest of this section we explore possible directions and ideas, but we do not have a strong feeling regarding an answer to this question.

9 5.1 Repetitions and borders We can characterize square-free strings using borders, as Lemma 4 below shows. As was mentioned in the Background section, there is a vast literature related to borders, and it seems plausible that we could build on that to design a polytime algorithm for constructing square-free strings. Lemma 4. A string W is non-repetitive iff every suffix S of W [1..i], where 1 i W has a border β such that 0 β S if β is odd, and 0 β < S if β is even. Proof. ( ) Suppose S is a suffix of W [1..i], where 0 i W, and it has a border β, and β > S S. As β >, a prefix (P ) of β is followed by itself (see Figure 5), thus creating a repetition in S and hence in W. This is a contradiction as W is non-repetitive. The case when β is even and β = S is a contradiction is trivially true. ( ) Suppose W has a repetition. Then there exists an i, such that 0 i W and W [1..i] has a suffix (S) of the form uu. Its clear that the suffix is an even length suffix with a border β and β = S. This contradicts our assumption that 0 β < S when S is even. Fig. 5: proof direction for Lemma 4 The construction of a non-repetitive string W, where Σ is sufficiently large, using Lemma 4 is as follows: select any symbol from Σ such that the resulting non repetitive string satisfies the border condition in Lemma 4. Note that, the sufficiently large alphabet size ensures that there always exists an element in Σ that satisfies the border condition of 4. Consider the class of lists, L mult L 3, with the property that for each list there is a parameter p so that every alphabet L ip, for i N, is guaranteed to have a new symbol that has not appeared thus far. That is, i, L ip 1 j ip 1 L j. The point is that every list is guaranteed to introduce a new symbol at fixed intervals which are multiples of p.

10 Corollary 5. L mult is admissible. Proof. By Theorem 3 we know that the lists in L mult are admissible, but it does not give us a way of constructing such lists. On the other hand, the sublists of length p 1 in between multiples of p can yield square-free strings by an exhaustive search in constant time 3 p 1, as p is fixed for each list in L mult. Given this, we can construct a square-free string over each list L L mult with the following procedure: for all i, construct W i by exhaustive search in time 3 p 1 over alphabets: L (i 1)p+1, L (i 1)p+,..., L (i 1)p+(p 1), and append new symbol n i L ip that has not appeared thus far. The string i W in i is clearly squarefree, obtained in time given by O( L ), where the constant is given by 3 p 1 and the square arises from the fact that for each L ip we have to search the strings constructed thus far, making sure that we select a symbol that has not appeared already; the existence of such a symbol is assured by the definition of L L mult. 5. Repetitions and compression We finish with an example of a simple counting technique that relies on the nature of repetitions to allow compression. We already proved our main theorem, Theorem 3, but we include this discussion as a possible future direction. Suppose that we want to encode the W s, as W, in a way that takes advantage of the repetitions in W. Assume that the L i s are ordered, and since each L i has three symbols, we can encode the contents with bits: Encoding Symbol 00 1st symbol 01 nd symbol 10 3rd symbol 11 separator Suppose now that W = W 1 V V W, where W = n, V = l, that is, W is a string over L of length n containing a repetition of length l. Then, we propose the following scheme for encoding W s: W := W 1 11 V 11 W. Given a W it does not necessarily have a unique encoding, as it may have several repetitions; but we insist that the encoding always picks a maximal repetition (in length). Note also that W encodes W over L as a string over Σ = {0, 1}. Note that W = (n l) + l + 4, where the term (n l) arises from the fact that W 1 + W have n l symbols (as strings over L), and each such symbol is encoded with two bits, hence (n l). The two separators 11, 11 take 4 bits, and the length of V is l (as a string over L), and so it takes l bits. It is clear that we can extract W out of W (uniquely), and so is a valid encoding; for completeness, let f decode strings: f : Σ L + work as follows: f( W ) = f( W 1 11 V 11 W ) = W 1 V V W, and if the input is not a well-formed encoding, say it is , then we let f output, for instance, the lexicographically first string over L +. On the other hand, : L + Σ encodes strings by finding the longest repetition V in a given W (if there are several maximal repetitions, it picks the

11 first one, i.e., the one where the index of first symbol of V V is smallest), yielding W 1 V V W, and outputting W 1 11 V 11 W. Suppose now that for a given L = L 1, L,..., L n every string has a maximal repetition of size at least l 0. We want to bound how big can l 0 be; to this end, we want to find l 0 such that: (n l0)+l0+4 < 3 n. (7) The reason is that the term on the left counts the maximal number of possible encodings given the assumption that every string over L has a repetition of size at least l 0, while the term on the right is the size of L +. The inequality expresses that if l 0 is assumed to be too big, then we won t be able to encode all the 3 n strings in L +. Since (7) can be simplified to n l0+4 < 3 n, and using log on both sides we obtain: n l 0 +4 < log 3n < 1.6n, which gives us n l 0 + < 0.8n, and so l 0 > n 0.8n + > 0.n. Thus, given any L = L 1, L,..., L n, there always is a W L + with a repetition no longer than 1 5n. Can we strengthen this technique to give a new Kolmogorov style proof of Theorem 3? References [Abr87] Karl R. Abrahamson. Generalized string matching. SIAM J. Comput., 16(6): , [BEM79] Dwight R. Bean, Andrzej Ehrenfeucht, and George F. McNulty. Avoidable patters in strings of symbols. Pacific Journal of Mathematics, 85():61 94, [BLRS08] Jean Berstel, Aaron Lauve, Christophe Reutenauer, and Franco V. Saliola. Combinatorics on Words: Christoffel Words and Repetitions in Words. American Mathematical Society, 008. [BP07] Jean Berstel and Dominique Perrin. The origins of combinatorics of words. Electronic Journal of Combinatorics, 8:996 10, 007. [GKM10] Jaros law Grytczuk, Jakub Kozik, and Pitor Micek. A new approach to nonrepetitive sequences. arxiv: , December 010. [MS15] Neerja Mhaskar and Michael Soltys. Non-repetitive strings over alphabet lists. In M.Sohel Rahman and Etsuji Tomita, editors, WALCOM: Algorithms and Computation, volume 8973 of Lecture Notes in Computer Science, pages Springer International Publishing, 015. [RS1] Narad Rampersad and Jeffrey Shallit. Repetitions in words. May 01. [Sha09] Jeffrey Shallit. A second course in formal languages and automata theory. Cambridge Univeristy Press, 009. [Smy03] Bill Smyth. Computing Patterns in Strings. Pearson Education, 003. [SW09] William F. Smyth and Shu Wang. An adaptive hybrid pattern-matching algorithm on indeterminate strings. Int. J. Found. Comput. Sci., 0(6): , 009. [Thu06] Axel Thue. Über unendliche Zeichenreichen. Norsek Vid. Selsk. Srk., I Mat. Nat. Kl., 7:1, [Thu1] Axel Thue. Über die gegenseitige lage gleicher teile gewisser Zeichenreihen. Kra. Vidensk. Selsk. Skrifter., I. Mat. Nat. Kl., 1:1 67, 191. [Zim8] A.Ĩ. Zimin. Blocking sets of terms. Mat. Sbornik, 119: , 198.

Binary words containing infinitely many overlaps

Binary words containing infinitely many overlaps arxiv:math/0511425v1 [math.co] 16 Nov 2005 James Currie Department of Mathematics University of Winnipeg Winnipeg, Manitoba R3B 2E9 (Canada) j.currie@uwinnipeg.ca