Periodicity and Unbordered Factors of Words

Size: px

Start display at page:

Download "Periodicity and Unbordered Factors of Words"

Bruno Johns
5 years ago
Views:

1 Periodicity and Unbordered Factors of Words Dirk Nowotka TUCS Turku Centre for Computer Science TUCS Dissertations No 50 10th June 2004 ISBN ISSN

3 Periodicity and Unbordered Factors of Words Dirk Nowotka

5 Periodicity and Unbordered Factors of Words Dirk Nowotka To be presented, with the permission of the Faculty of Mathematics and Natural Sciences of the University of Turku, for public criticism in the Auditorium XXI of the University on July 9th, 2004, at 12 noon. University of Turku Department of Mathematics FIN Turku, Finland 2004

6 Supervisors Professor Juhani Karhumäki Department of Mathematics University of Turku FIN Turku Finland Academy Research Fellow Tero Harju Department of Mathematics University of Turku FIN Turku Finland Reviewers Professor Jeffrey Shallit School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1 Canada Doctor Julien Cassaigne Dynamique, Arithmétique, Combinatoire Institut de Mathématiques de Luminy, CNRS Université de la Méditerranée Aix-Marseille II F Marseille Cedex 07 France Opponent Professor Volker Diekert Department of Theoretical Computer Science Institute of Formal Methods in Computer Science University of Stuttgart D Stuttgart Germany ISBN ISSN

7 To Nicole

9 Abstract Several questions about relationships between borders and global and local periods of finite words are investigated in this thesis. We consider the density of critical points and applications of the critical factorization theorem. A relationship between unbordered conjugates and internal critical points is established and a border correlation function of words is investigated. Moreover, we study the relation between the global period and the length of the longest unbordered factor of a word. In particular, we resolve a longstanding conjecture called the sharpened Duval s conjecture. Keywords: combinatorics on words, repetition, border, periodicity, critical factorization, border correlation, Duval s conjecture. i

11 Acknowledgements Primarily, I would like to thank my supervisors Prof. Juhani Karhumäki and Dr. Tero Harju. Juhani has been the leader and promoter of the research group I worked in. He has been responsible for the excellent working conditions I experienced and the great atmosphere in our group. Despite his tight schedule Juhani has always found the time to talk to me whenever I needed his help or opinion. Tero has been most responsible for the fact that I wrote my thesis in the field of combinatorics on words. He aroused my initial interest in discrete mathematics and has introduced me to many fascinating problems and techniques in combinatorics on words. Moreover, he has been a permanent source of advice, motivation, and inspiration. His way to work has constantly exhibited the joy of mathematics to me. Tero s personality and education has tremendously influenced my life in academia and beyond. I am immensely grateful for all of that. It is my great pleasure to thank the reviewers of my thesis, Prof. Jeffrey Shallit and Dr. Julien Cassaigne, for their detailed comments and suggestions. Their work has lead to many improvements which made this thesis a lot more readable and comprehensible. The Turku Center for Computer Science and the Department of Mathematics of the University of Turku provided an excellent working environment to me. Special thanks is therefore due to the people that run these institutes for their help and support. I would also like to thank those who do scientific work there. They have created a great work climate. Finally, I would like to especially thank my wonderful wife, Nicole, for her great support, encouragement, and endless patience. iii

13 Notation conjugate 16 p prefix 7, 10 < p proper prefix 7 s suffix 7, 10 < s proper suffix 7 lexicographic order 17 A finite words over A 8 A + nonempty, finite words over A 8 A ω right-infinite words over A 9 A finite and right-infinite words over A 9 ω A left-infinite words over A 9 A finite and left-infinite words over A 9 ω A ω bi-infinite words over A 10 A finite and bi-infinite words over A 10 alph(w) set of different letters in w 56 β(w) border correlation function 43 δ(w) density function of critical points in w 36 ε the empty word 5 η(w) number of critical points in w 15 F the Fibonacci word 21 F i i-th Fibonacci word 33 f i i-th Fibonacci number 33 ϕ Fibonacci morphism 21 G β (n) border correlation graph 49 G β (n) border correlation graph on conjugacy classes 52

14 vi Notation ind(w) index of w 13 L (A) Lyndon words over A w.r.t. 18 M the Thue Morse word 23 M i i-th Thue Morse word 37 µ(w) length of the longest unbordered factor of w 12 N nonnegative integers 9 p(w) the period of w 12, 14 p(w, p) the local period of w at point p 15 π maximum prefix w.r.t. 18 π maximum prefix w.r.t ψ Thue Morse morphism 23 w (i) i-th letter of w 7 [w] conjugacy class containing w 16 w length of w 8 w u number of occurences of u in w 8 w reverse of w 8, 10 σ(w) cyclic shift of w 16 T the Thue word 22 T i i-th Thue word 36 ϑ Thue morphism 22 τ maximum suffix w.r.t. 18 τ maximum suffix w.r.t Z integers 9 Z + positive integers 9 Z nonpositive integers 9

15 Contents Notation v 1 Introduction 1 2 Preliminaries Words Finite Words Infinite Words Repetitions Border Global Period Local Period Conjugacy Orderings Morphisms Iterated Morphisms Avoidability A Classical Example Critical Factorizations The Critical Factorization Theorem A Proof of the CFT Some Properties of Critical Factorizations Counting Critical Factorizations Words with Few Critical Factorizations Words with a High Density of Critical Factorizations An Application of Critical Factorizations Comments Unbordered Conjugates Optimal Words for Border Correlation Unbordered Conjugates and Critical Points

16 viii Contents 4.3 Iterations of the Border Correlation Function Comments Unbordered Factors On the Maximum Length of Unbordered Factors Duval Extensions Words without Nontrivial Duval Extension Minimal Duval Extensions Maximal Duval Extensions Duval s Conjecture Preliminary Results A Solution Comments Bibliography 77 Author Index 83 Index 85

17 Chapter 1 Für die Entwicklung der logischen Wissenschaften wird es, ohne Rücksicht auf etwaige Anwendungen, von Bedeutung sein, ausgedehnte Felder für Spekulation über schwierige Probleme zu finden. (Axel Thue, 1912) Introduction A word is one of the most basic and natural structural concepts. It is simply a sequence of elements, which are taken from some given set called an alphabet. Let us consider alphabets of finite size and call their elements letters. For example, we have that words in the written English language are over a set of 26 latin letters, DNA strands are represented as words over the four letters A, C, G, and T, or the sequences of bits over a communication channel are denoted as words over the alphabet consisting of 0 and 1. We have chosen to investigate some general structural properties of words in this thesis. In order to investigate general structural properties of a word we ignore any meaning that is given to a word as well as from particular names letters. That reduces our examples given above to words over an alphabet of size 26, 4 and 2, respectively, that do not belong to a particular semantics. Following this abstract point of view, noncommutativity and discreteness are the most distinguishing features of words which make the investigation of the properties of words into an area of its own in discrete mathematics. This area is called combinatorics on words. Combinatorics on words is a rather young field of mathematics. The first investigation of words for their own sake is usually attributed to Thue (1906; 1912). Unfortunately, his work was published in a Norwegian journal of small circulation, and hence, was practically unknown for several decades; see a translation by (Berstel, 1995). However, even before that time, combinatorial questions about words appeared to be natural enough to be asked for example by Bernoulli (1772), Gauß ( ; 1844), and Prouhet (1851) although this was done in the context of discussions of other problems. It was in the second half of the 20th century that a theory dedicated to words began to emerge to a broader audience. The first, more or less simultanious development of combinatorics on words is credited to Schützenberger (1956) in France and Adian (1979) and Makanin (1977) in Russia even though they still researched this topic

18 2 Introduction under different names. Other early works on the theory of words are (Lentin and Schützenberger, 1969), (Lentin, 1972), and (Hmelevskiĭ, 1971). The first uniform presentation of this area of research was given in a book published by a group of researchers under the pseudonym Lothaire (1983). This first book dedicated to combinatorics on words was developed out of the French school, and it has become a standard reference to the field. It is worth noting that the Lothaire project has been continued with (Lothaire, 2002) and efforts for a third volume are currently being undertaken; cf. (Lothaire, 2004). Even though 50 years is not a long period of time in the history of mathematics, combinatorics on words has matured already to an area of its own which is also reflected by its own mathematics subject classification 68R15 and a biannual conference, called WORDS, dedicated to this subject. Combinatorics on words has connections to a variety of other fields. Most notably, is its influence on the solution of algorithmic problems, which appear for example in the area of string and pattern matching, see the books by Crochemore and Rytter (2002) and Gusfield (1997), or the satisfiability problem for word equations, see a thorough presentation by Diekert (2002). The most immediate object of interest for words is the concept of repetitiveness, that is, where do certain parts of a word occur more than once. Repetitiveness can be considered in many ways. Probably the most common way to consider a repetitiveness is by the concepts of borderedness and global and local periodicity. This thesis is devoted to the investigation of questions concerning these fundamental word properties and their relation to each other. The concepts of border and global period of an entire word are strongly related. A border of some word w is a word u that is a prefix and a suffix of w, where we require that u is neither empty nor equal to w. Consequently, we call a word bordered, if it has a border, and unbordered otherwise. Let us denote the length of a word v by v. If u is a border of a word w, then a (global) period of w is the difference between the length of w and u, that is, w u is a period of w. Let p(w) denote the minimal period of w. Let us also introduce the notion of local periodicity. Two words x and y are prefix-comparable if either x is a prefix of y or vice versa. Suffix-comparability is defined similarly. Let an integer p, such that 1 p < w, be called a point of w. A factorization of w at a point p results in two nonempty words u and v such that w = uv and u = p. A repetition word z of w at a point p is defined such that z is prefix-comparable with v and suffix-comparable with u. We call the length of the shortest repetition word of w at a point p the local period of w at point p, denoted by p(w, p). It is not hard to see that for every word w we have p(w, p) p(w) for all points p of w. If p(w, p) = p(w), then we call the point p critical. Critical points and factorizations at critical points play a significant rôle in combinatorics on words due to the famous critical factorization theorem.

19 3 Critical factorizations are also of very practical interest in the well-known twoway string matching algorithm by Crochemore and Perrin (1991). Despite their importance not much seems to be known about critical factorizations. Therefore, we will investigate them in Chap. 3 of this thesis. A natural relationship between the concepts of local periodicity and borderedness can be seen if one considers circles rather than sequences of letters. In this case a natural equivalence relation, called conjugacy, on the set of all words (of a given alphabet) arises. Two words u and v are conjugates if u = xy and v = yx, that is, two words are conjugates of each other if one is obtained from the other by a cyclic shift of letters. It is well known that for every word w, either there exists a conjugate of w that is unbordered, or w is not primitive, which means that there exists a word u such that w = u k for some k 2. Unbordered conjugates do not arbitrarily occur in the set of conjugates of a word. For example, critical points and unbordered conjugates are closely related. This and some other properties of unbordered conjugates are discussed in Chap. 4 of this thesis. When we consider not only the borderedness of a word but also the borderedness of all its factors, i.e., all subsequences of a word, then interesting issues arise. Let µ(w) denote the length of the longest unbordered factor of w. Then µ(w) and p(w) denote similar concepts in the sense that every factor v of w that is longer than µ(w) is bordered, and every factor v of w that is longer than p(w) is bordered with the additional requirement that it has a border of length v p(w). So, µ(w) has a more loose definition than p(w), and we have µ(w) p(w). However, both concepts coincide, that is µ(w) = p(w) and all factors v larger than µ(w) of w have a border of length v p(w), for words of a certain length. For example, Ehrenfeucht and Silberger (1979) showed that µ(w) = p(w), if w 2p(w). However, it has turned out to be much harder to determine the length of w, such that µ(w) = p(w) holds, with respect to µ(w). Duval (1982) proved that 4µ(w) 6 w implies µ(w) = p(w). He also conjectured that, if w 2µ(w) and w has an unbordered prefix of length µ(w), then µ(w) = p(w). The truth of this conjecture would imply that we have µ(w) = p(w), if w 3µ(w). Duval s conjecture has received quite some attention, but however had remained open so far. Questions concerning the length of unbordered factors of words and, in particular, a solution of Duval s conjecture makes up a considerable part of this thesis. Chapter 5 has been devoted to this.

20 4 Introduction

21 Chapter 2 Preliminaries The aim of this chapter is to introduce the basic concepts and notation used throughout this thesis. It provides us with the basic language for reasoning about words. More specific definitions are given in the appropriate places later in this thesis. Our selection of notation is guided by its use in the contemporary literature of combinatorics on words. However, in some cases different notation are in common use in which situation we choose one by our liking. This chapter is meant to be read as an introduction to the field of combinatorics on words as well as a reference which can be consulted whenever needed while reading the later chapters. The exposition is self-contained and limited to the concepts considered in this thesis, only. Further information about words can be found in the standard reference for combinatorics on words by Lothaire (1983). For more recent expositions we refer to Lothaire (2002) and Berstel and Karhumäki (2003). We have chosen to set the terminology of combinatorics on words into the more general algebraic framework of semigroup theory at some places in this introduction. For a deeper introduction to semigroup theory we refer to the book by Howie (1976), and we refer to Lallement (1979) for a more applied point of view. 2.1 Words An alphabet A is a finite nonempty set. The elements of A are called letters. A word is a sequence of letters. A word could also be considered as a function from positions to letters. Both points of view serve the same purpose, and we use them alternatively where convenient. The empty sequence is called empty word and denoted by ε. Sequences, and hence words, can be of finite or infinite size. In general, we mean finite words when talking about words except otherwise stated since our major concern is about finite words in this thesis. Let us consider finite words more now and separately introduce infinite

22 6 Preliminaries words after that Finite Words The first thing we do with the set of all finite words is to look at its basic structure and to embed this structure into the more general setting of semigroups. Given two words u and v we naturally obtain a new word uv when simply concatenating them. So, we can say that concatenation is an operation on words. It is also clear that, given three words u, v, and w, the word uvw can be obtained by either first concatenating v and w and then concatenating u and vw or by forming uv first and then concatenating w to it. Formaly, u(vw) = (uv)w. Let us put this concept into a more general, that is, algebraic, framework. A semigroup (S, ) is a set S equipped with a binary, associative operation, that is, for all elements x, y, and z in S we have that x y S and x (y z) = (x y) z. The operation is usually called multiplication. Since is associative we omit parentheses wherever possible. However, we may use parentheses if we want to indicate a certain factorization. If the operation is clear from the context, we refer to a semigroup by its set S, only, and abbreviate x y by xy. We also let operate on subsets of S. Let M, N S. Then MN = {mn m M, n N}. Moreover, the expression xx x }{{} k-times of multiplying x k-times with itself is, as usual, abbreviated by x k. A monoid is a semigroup S that contains an identity element 1 such that for every x S 1x = x1 = x. Note that the identity element of a monoid is necessarily unique. It is clear that the set of all finite words over some alphabet is a monoid structure, called a word monoid, with concatenation as its multiplication and the empty word ε as the identity element. The set of all nonempty, finite words over some alphabet forms a semigroup, called a word semigroup, which is not a monoid. A word f is called a factor of a word w, if w = ufv. Moreover, f is called proper factor if f is not empty and u and v are not both empty, that is, f ε and uv ε. A sequence (x 1, x 2,..., x k ) of words is called a factorization of a word w if w = x 1 x 2 x 3. Moreover, if x i X, for all 1 i k and some set X A, then (x 1, x 2,..., x k ) is called an X-factorization of w. An X- factorization of a word w is illustrated in the following figure.

23 2.1 Words 7 x 1 x 2 x k w We often denote a sequence (x 1, x 2,..., x k ) of words by its product x 1 x 2 x k, and we often denote a singleton set by its element. Let w (1) w (2) w (n) denote the factorization of w such that w (i) A, then w (i) is the i-th letter of w and i is called a position of w, for all 1 i n. An integer p with 1 p w is called a point in w. Intuitively, a position i denotes the place of a letter w (i) in w and a point p denotes the place between w (p) and w (p+1) in w. A word f occurs at a position i in a word w if w = ufv with u = i 1 and v A, that is, f = w (i) w (i+1) w (k) where k = i + f 1. Consider the following figure. w : position i w (1) w (2) w (i) w (i+1) w (k) w (n) point i Factors in the beginning and the end of a word are given special names. A word f is called a prefix of a word w, denoted by f p w, if w = fu. Similarly, f is called a suffix of w, denoted by f s w, if w = uf. A prefix or suffix f of w is called proper, denoted by f < p w and f < s w, respectively, if it is a proper factor. A word u is called prefix-compatible, or just compatible, with a word w if u p w or w p u. Similarly, u is called suffix-compatible with w if u s w or w s u. A sequence (x 1, x 2,..., x k ) of words is called an X-interpretation of w if (x 1, x 2,..., x k ) is an X-factorization of uwv where u < p x 1 and v < s x k. An X-interpretation of a word w is illustrated in the following figure. f x 1 x 2 x k u w v We also say that w is interpreted by X. A word w has two different X- interpretations if either there exist two different sequences of words that interpret w or if w has two different occurrences in one sequence. Note that an occurrence of w in u k is an u-interpretation of w if u k 1 < w u k. So, w has different u-interpretations if there exist different positions in u k where w occurs.

24 8 Preliminaries Given an alphabet A, we can easily construct any word over that alphabet by concatenation and as well check if some word is over that alphabet or not. This is because any word is uniquely defined by the concatenation of its letters. That fact is also reflected in our algebraic terminology. In general, a semigroup S without identity is called free if there is a subset B of S such that every element of S can be uniquely expressed as a product of elements of B. In that case, B is called a free generating set or a base of S, and we say that S is freely generated by B or the free semigroup over B. A monoid S is called free if S \ {1} is a free semigroup. Note that the set of all finite, nonempty words of an alphabet A is the free (word) semigroup over A, denoted by A +. We define A = A + {ε}. More generally, let X A, then X + and X denote the semigroup and monid generated by X, respectively. For singletons we abbreviate {x} and {x} + by x and x +, respectively. Note that x = {x k 0 k}. A well known chracterization of free monoids by Levi (1944) gives the following property called equidivisibility. Proposition 2.1. Let s, t, u, v A such that st = uv. Then there exists a word f A such that either s = uf and ft = v or sf = u and t = fv. The following figure illustrates the concept above. u s f v t Let us fix more notation of basic word properties. Let w = w (1) w (2) w (n) be a nonempty word. The length n of w is denoted by w. We define ε = 0. The number of occurrences of a nonempty word u in w is denoted by w u. In particular, w a, with a A, denotes the number of occurrences of the letter a in w. The word w (n) w (2) w (1) is called the reversal of w, denoted by (w). For short expressions we simply write w instead of (w). For example, the fact that the reverse of a word is the reverse concatenation of the reverse of its factors is expressed by (uv) = ṽũ. Palindromes are words that are the same when read from left to right or from right to left. Formally, a word w is called a palindrome, if w = w. Let u and v be in A +. Then we say that u overlaps v from the left or from the right if there is a word w such that w < u + v, and u < p w and v < s w, or u < s w and v < p w, respectively.

25 2.1 Words 9 w : u We say that u overlaps with v if u and v overlap from either left or right. The length of an overlap is defined by u + v w. So, the longest overlap is actually determined by the shortest w such that u and v occur in w. We say that u intersects with v if either u and v overlap or one is a factor of the other. Intuitively, two words intersect if we can shift them over one another and find a match Infinite Words So far we have considered only finite sequences of letters. Let us turn to the infinite case, now. As usual, N and Z denote the natural numbers and integers, respectively. Let Z + = {1, 2, 3,...} denote the positive integers and Z = Z\Z + denote the non-positive integers. An infinite sequence of letters can be seen as a sequence indexed by a totally ordered infinite set. We use only countable infinite sets since we do not consider transfinite words here. In particular, we use Z + and Z as index sets. A right-infinite word w is a function w : Z + A. A left-infinite word w is a function w : Z A. These words are called one-sided infinite words. In the literature a right-infinite word is usually defined as a function from N to A. We chose a slightly different notation here which will be more convenient for us later. The same notation for the i-th letter of a one-sided infinite word w is used as for finite words, that is, w (i) = w(i). Let A ω and ω A denote the set of all right- and left-infinite words over A, respectively. Given a right- or left-infinite word, the concatenation to the right or left, respectively, is not defined. However, the concatenation of a finite word with a right- or left-infinite word from the left or right, respectively, is defined. Let w A, with w = n, and let u A ω and v ω A. Then wu A ω and vw ω A such that (wu) (i) = w (i), if 0 < i n, and (wu) (i) = u (i n), if n < i, and (vw) (i) = w (n+i), if n < i 0, and (vw) (i) = v (i+n), if i n. Let u A ω and v ω A. Then u = u (1) u (2) u (3) and v = v ( 2) v ( 1) v (0). We also define A = A A ω and A = A ω A. Naturaly, the concepts of right- and left-infinite words suggest the definition of bi-infinite words which can be thought of as joining a right- and a leftinfinite word together. A bi-infinite word w is a function w : Z A. Bi-infinite words are also called two-sided infinite words. Similarly to one-sided infinite v

26 10 Preliminaries words, we have that the i-th letter w(i) of a bi-infinite word w is denoted by w (i). We denote the set of all bi-infinite words over A by ω A ω. We also define A = A ω A ω. For bi-infinite words we have no way of concatenating words to either side. Yet, we can factorize a bi-infinite word w = vu such that v ω A and u A ω, so that we can write w = w ( 2) w ( 1) w (0) w (1) w (2) w (3) = v ( 2) v ( 1) v (0) u (1) u (2) u (3) Let X A. We write X ω, ω X, and ω X ω for the sets of left-, right-, and bi-infinite words that can be factorized into elements of X, respectively. Again, we identify singleton sets with their element wherever convenient. For example, {xxx } = xxx = x ω where x X +. Now, we can redefine the notation of factor in a more general way than for finite words. Let f A be a finite word. Then f is called a factor of a word w A if w = ufv, where u A and v A. Moreover, f is called proper factor if f is not empty and u and v are not both empty, that is, f ε and uv ε. A factor f A is called a prefix of a word w A, denoted by f p w if w = fu where u A. Similarly, f is called a suffix of a word w A, denoted by f s w if w = uf where u A. A prefix or suffix is called proper if it is a proper factor. A finite word u A is called prefix-compatible, or just compatible, with a word w A if u p w or w p u, in case w is finite. Similarly, u is called suffix-compatible with w A if u s w or w s u, in case w is finite. The concepts of reversal and overlap, as known from the finite case, will be defined for infinite words next. The reversal of w, denoted by (w) and abbreviated as w, is defined as follows. If w A ω then w ω A and w (1 i) = w (i) for all 0 < i. Symmetrically, if w ω A then w A ω and w (1 i) = w (i) for all i 0. If w ω A ω then w ω A ω and w ( i) = w (i). 2.2 Repetitions The most basic property of words is repetitiveness, that is, to which degree do we find factors of a word occurring repeatedly at different positions. One could say that repetitiveness measures the degree of uniformity or, in its absence, the complexity of a word. The most primitive way of noting a repetition in a word is to see whether or not it overlaps with itself which is defined in the notion of borderedness. Surely, considering the overlap of every factor of a word with itself is an obvious next step which leads us to the value µ(w) of a word w denoting the maximum length of its unbordered factors. That means that every factor of w longer than µ(w) is actually bordered, that is, it overlaps with itself.

27 2.2 Repetitions 11 If the size of the borders of these factors fulfill a certain criterion described further below, we arrive at the concept of the global period p(w) of w. Actually, the global period is the most fundamental property of a word that is studied in the field of combinatorics on words. We can say that p(w) is obtained by restricting the definition of µ(w) by adding a length constraint; however, p(w) can also be directly defined by the difference between the length of w and the length of the longest overlap of w with itself. Little is known about the relation between µ(w) and p(w), which is rather surprising given that borderedness and periodicity of words are considered to be so fundamental concepts. Their relation will be one of the major subjects of investigation of this thesis; see Chap. 5. Similarly to finite words, the concept of periodicity can also be defined for infinite words. However, the period of infinite words is not always properly defined since they can have unbordered factors of unbounded length. Nevertheless, an infinite word w can be considered to be almost periodic when it becomes so after removing just a finite part of it, that is, there is an infinite part of w that has a period. In that case w is called ultimately periodic. Questions of periodicity of bi-infinite words will be treated in Sect. 3.3 of this thesis. So far, we have considered repetitiveness only as a global property of a word. However, it is natural to also ask questions about local repetitions of a word w, for example: How long is the shortest factor that has two occurrences, one next to the other, at some certain point p in w? This length is denoted by the local period p(w, p) of w at point p. Actually, the local and the global period of a word are in a well known relation, see for example the work of Césari and Vincent (1978), Duval (1979), Mignosi, Restivo, and Salemi (1998), and Lepistö (2002). This relation is highlighted, in the case of finite words, by the concept of critical factorizations. Critical factorizations have also found applications in, for example, string matching algorithms as demonstrated by Crochemore and Perrin (1991) and Breslauer, Jiang, and Jiang (1997). However, still not much is known about critical factorizations. They will be another major subject of this thesis, see Chap Border A basic concept of repetitiveness following from the notation of overlap is the one of border and borderedness. A nonempty word g is called a border of a word w if w = gv = ug where u and v are not empty. So, a border of a word w is a proper factor that occurs both as a prefix and a suffix of w. We call w bordered if it has a border, otherwise w is called unbordered. So, a word w is bordered if it does overlap with itself. Let w be a word bordered by f. Moreover, let f itself be bordered by g. Then it is not hard to see that the border g of f is also a border of w.

28 12 Preliminaries w : f f g g g g So, every bordered word w has a shortest border h where h itself is an unbordered word, and moreover, w = hvh for some v A. Example 2.2. Consider w = abaabaab which has a border abaab. But, abaab itself is bordered by ab which is also a border of w. We have that ab is the shortest border of w, and ab itself is not bordered. We denote the maximum length of unbordered factors of a word w by µ(w). In other words, µ(w) denotes the smallest length such that any factor f in w longer than µ(w) overlaps with itself. The following example will be used several times in this thesis. It is taken from Assous and Pouzet (1979). Example 2.3. Consider w = a n ( ba n+1 ba n ba n+2) ba n ba n+1 ba n for some n 0. Then w = 7n + 10 and µ(w) = 3n + 6 where we indicated one unbordered factor of length µ(w) by parentheses Global Period The most essential notation for repetitions throughout a word is the one of its (global) period. An integer 1 p n is called a period of a word w if w (i) = w (i+p) for all 1 i n p. Note that w is always a period of w. The smallest period of w is called the minimum period or the global period of w, denoted by p(w). We define p(ε) = 0. Note that an unbordered word w has only its own length as a period, and hence, is only u-interpreted if u w. Indeed, if there is a shorter word u than w such that w occurs in u k, with k 2, then u is a period of w shorter than w. The following proposition highlights a basic property of periods. It was discovered by Fine and Wilf (1965). Alternative proofs can be found; see for example those by Perrin (1983), by Halava, Harju, and Ilie (2000), or by Berstel and Karhumäki (2003). As usual, gcd(p, q) denotes the greatest common divisor of p and q.

29 2.2 Repetitions 13 Proposition 2.4 (Fine & Wilf). If a word w has two periods p and q such that w p + q gcd(p, q), then w has also the period gcd(p, q). Recall that µ(w) of a word w denotes the smallest length such that any factor f in w longer than µ(w) overlaps itself. This differs from the concept of the period p(w) of a word w which additionally requires that the overlap is at least of length f p(w). Let us point out a rather obvious fact that µ(w) p(w) for any word w A. Indeed, p(w) is a period of any factor of w longer than p(w), and hence, any such factor is bordered. Example 2.5. Consider again w = ( a n ba n+1 ba n ba n+2 b ) ( a n ba n+1 ba n) with µ(w) = 3n + 6 from Example 2.3. We have µ(w) < p(w) = 4n + 7 where we indicated a factorization of w at point p(w) by parentheses. The next notation gives a numeric value to how much a word repeats itself by relating the period of a word to its length. We define the index of a nonempty word w by ind(w) = w p(w) and let ind(ε) = 1 by convention. Certainly, ind(w) 1 since w is a period for any word w. If ind(w) = 1, then w is unbordered. A word w overlaps with itself if and only if ind(w) > 1, that is, its period is less than its length. The following proposition states a basic fact about words. Proposition 2.6. Let u, v A. a word f A such that u, v f. Then uv = vu if and only if there exists Indeed, let us assume that u v without restriction of generality. Then there is a word f such that v = fu and v = uf by Proposition 2.1. So, fu = uf, and the result follows by induction on the length of the words. u v f A word f is called a root of a word w if w = f k for some k 1. A word w is called primitive if it is its only root, that is, if w = u k implies that k = 1. It is clear that every nonempty word has a primitive root. Moreover, it follows from Proposition 2.6 that the primitive root of a word is unique. Observe, that Proposition 2.6 implies that any word w has at most one u-interpretation for every primitive word u with u w. v u

30 14 Preliminaries Proposition 2.7. Let u, v, w A +. Then ww = uwv if and only if there exists a word f A + such that u, v, w f +. Indeed, the following figure shows that ww = uwv implies uv = vu. w w v u u w v So far, everything in this subsection has been defined for finite words only. However, the concept of periodicity is also a natural one for infinite words. Let p Z +. Then p is a period of an infinite word w if w (i) = w (i+p) for all i, if w ω A ω, or all 1 i, if w A ω, or all i p, if w ω A. Again, p(w) denotes the smallest period of w if it exists. Contrary to the finite case, a period does not have to exist for an infinite word w. So, p(w) is undefined, if w has no period. Note that, if w has a period, then there is a word u A + with u = p(w) such that w = u ω, w = ω u, or w = ω u ω for w A ω, w ω A, and w ω A ω, respectively. For infinite words the notion of periodicity might be a bit too stringent. For example, a right-infinite word can be periodic except for a finite prefix. We therefore relax the concept a little and define the following. A right-infinite or left-infinite word w is called ultimately periodic if w = vu ω or w = ω uv, respectively, with v A Local Period The local period at a given point of a word denotes the shortest length of a factor that repeats itself at the given point in that word. Let w be any word, possibly infinite. Let us first extend our definition of a point from page 7. An integer p, with 1 p < w, if w is finite, or 0 < p, if w A ω, or p < 0, if w ω A, or any integer p, if w ω A ω, is called a point in w. A finite, nonempty word u is called a repetition word at point p of w, if w = xy is a factorization of w at point p, and u is prefix-compatible with y and suffix-compatible with x. For example, let w be a finite word and z be a repetition word at some point p of w. Then we have the following possible situations. z z z z w : p internal w : p external

31 2.3 Conjugacy 15 z z z z w : p left-external, right-internal w : p left-internal, right-external For a point p in w, let p(w, p) = min { u u is a repetition word of w at point p } denote the local period at point p in w. For infinite words, a repetition word at point p might not exist in which case p(w, p) is undefined. Note that the repetition word of length p(w, p) at point p, if it exists, is unbordered since the border of a repetition word gives a shorter repetition word at the point p. Moreover, p(w, p) p(w) since there is a repetition word of length p(w) at every point of w. Let w be finite in the following. A factorization w = uv, with u and v not empty and u = p, is called critical if p(w, p) = p(w), and, if this holds, then p is called a critical point, otherwise it is called a noncritical point. Let η(w) denote the number of critical points in a word w. We will often indicate critical points by dots, like in the following example. Example 2.8. The word w = ab.aa.b has the minimum period p(w) = 3 and two critical points, 2 and 4, marked by dots. The shortest repetition words at the critical points are aab and baa, respectively. Note that the shortest repetition words at the remaining points 1 and 3 are ba and a, respectively. 2.3 Conjugacy So far we have considered finite words only as sequences with a beginning and an end. However, one could think of a word also as a circular object in the sense that the last letter is followed by the first one and the word has no explicit beginning and end. Given such a circular object that was formed by joining the two ends of a word w, we can, of course, recover w by cutting the circle at the point where we joined it, but we could also choose to cut the circle at a different point and we might then read a different word w. Surely, w and w are of the same length, but they are also very similar in the sense that they contain the same number of each letter and that these letters occur in the same

32 16 Preliminaries order except for one point where this order is broken. We say that w and w are conjugates in that case. The concept of conjugacy plays, like repetitions, an important rôle in the field of combinatorics on words by giving us a nontrivial notion of similarity of words. It will appear in many places throughout this thesis. In particular, we investigate the relation between conjugacy and borderedness more closely in Chap. 4. However, the concept of conjugacy is not just a natural theoretical object, but also appears in applied areas like computer architecture where it is, for example, a fundamental operation in microprocessors. Let us define conjugacy more formally, now. Let a mapping σ : A A, with σ(ε) = ε and σ(aw) = wa for all w A and a A, be called a cyclic shift. We call two words u and v conjugates denoted by u v if u = σ k (v) for some k 0, that is, u = xy and v = yx, where x = k (mod u ). Clearly, is an equivalence relation. Let [u] = {v u v} denote the conjugacy class of u. Example 2.9. Let w = aaab. Consider [w] = {aaab, aaba, abaa, baaa}. Note that all elements in [w] are primitive, and that the period is not the same for all elements, for instance, p(aaab) = 4 and p(aaba) = 3. We also have the following straightforward facts. See also Shyr and Thierrin (1977) for Proposition 2.10 and Proposition If u v, then u is primitive if and only if v is primitive. Indeed, a non-primitive word has no primitive conjugate. Let w = fg = x k, with k 2. Then f = x i x 0 and g = x 1 x j, where x 0 x 1 = x and k = i + j + 1, and hence, (x 1 x 0 ) k = gf w. w : f x x 0 x 1 g g x 1 x 0 f The next statement follows immediately. Proposition If uv = x k and x is primitive, then there exists a primitive word y such that vu = y k. Moreover, x y. The following proposition shows that two words u and v are conjugates if either u = v or there exists a word w such that wu = vw. See also Lyndon and Schützenberger (1962).

33 2.4 Orderings 17 Proposition Let u, v, w A. Then wu = vw if and only if there exist two words f, g A such that fg is primitive and w (fg) f and u (gf) and v (fg). Indeed, the case is clear if either w = ε or u = v = ε. Let us assume that w u. Then there exists a word h such that v = wh and u = hw by Proposition 2.1. Let fg be the primitive root of v such that w (fg) f, and we have that h (gf) g, and consequently, u (gf) and v (fg). In case u < w, then there exists a word h such that w = vh = hu by Proposition 2.1, and the claim follows by induction on the length of the equation. See the following figure as an illustration. w u f g f g f g f g f g f g f g f v w h The following proposition states a basic fact that is often used. Proposition Every primitive word has an unbordered conjugate. Actually, a primitive word has at least as many unbordered conjugates as there are different letters occurring in it, that is, a primitive word of length larger than one has at least two unbordered conjugates. This is so because every letter of a word is minimal in a different order on the alphabet and because every Lyndon word is unbordered. Orderings and Lyndon words are introduced in the next section. 2.4 Orderings The concept of ordering of words proves to be very useful when reasoning about words. In particular, we use the lexicographic ordering of words here. The maximal or minimal element, with respect to a lexicographic order, of a set of words often provides witnesses with a particular property which allows very elegant proof arguments. A particularly interesting example of that is the use of the maximum word, with respect to some lexicographic order, in the set of all suffixes of a word. An application of such words is demonstrated in Chap. 3. Another interesting word is the minimal element of a conjugacy class of words. Those words were introduced by Lyndon (1954; 1955) and will play a particular rôle in Chap Let be an ordering of A = {a 1, a 2,..., a n }, say a 1 a 2 a n. Then induces a lexicographic order on A defined by u v u p v or u = xau and v = xbv with a b

34 18 Preliminaries where a, b A. A suffix v of w is called maximal w.r.t. if v v for any suffix v of w. A prefix u of w is called maximal w.r.t. if ũ ũ for any prefix u of w. We will identify orders on alphabets and their respective induced lexicographic orders throughout this thesis. Let 1 denote the inverse order, say a n 1 1 a 2 1 a 1, of. Let τ (w) and τ (w) (= τ 1(w)) denote the maximal suffixes of w with respect to and 1, respectively, and let π (w) and π (w) (= π 1(w)) denote the maximal prefixes of w with respect to and 1, respectively. If the context is clear, we may write τ, τ, π, and π for τ (w), τ (w), π (w), and π (w), respectively. Observe the following facts for any word w A + and order on A 1. τ (w) τ (w) where w has at least two different letters, 2. w π (w)τ (w) since otherwise a s π (w) and a p τ (w) for some a A and neither π (w) nor τ (w) are maximal. Example Consider the following palindrome w = abaaabbabbbaabaabbbabbaaaba together with the ordering a b. Then τ = bbbabbaaaba, τ = aaaba, π = abaaabbabbb, π = abaaa. In general, we have τ = (π ) and τ = (π ) for all palindromes. Note the following properties of the lexicographic order. 1. u v if and only if su sv for all s A, 2. u p v and u v if and only if us vt for all s, t A, 3. u v ut if and only if v = us and s t, 4. u v and u 1 v if and only if u p v. Moreover, note that even though a b is equivalent to b 1 a for letters, this does not hold for words anymore. Let be some lexicographic order on A. A nonempty word w A is called a Lyndon word with respect to if w is primitive and w u for all u [w]. Let L (A) denote the set of all Lyndon words over A with respect to. Moreover, let L(A) = L (A). order on A

35 2.4 Orderings 19 We simply write L or L if A is clear from the context. A word in L is simply called a Lyndon word omitting the order. Note that every Lyndon word is unbordered. Indeed, if w = uvu then either vuu w, if vu uv, or uuv w, if uv vu. Example Consider again w = aaab from Example 2.9. Then [w] = {aaab, aaba, abaa, baaa} and aaab and baaa are Lyndon words with respect to an order where a b or b a, respectively. Lyndon words can be defined in several ways as shown by Chen, Fox, and Lyndon (1958). Proposition A word w A + is in L if and only if it satisfies one of the following conditions: either w A or w = uv such that u, v L and u v; w v for all proper suffixes v of w; w σ k (w) for all 0 < k < w. Lyndon words have numerous applications in the field of combinatorics on words. For now, let us just note, that they provide witnesses for Proposition 2.13 in the previous section, and that every word has a unique factorization into a non-increasing order of Lyndon words as established by Lyndon (1955). Theorem Let w A + and be an ordering of A. Then there exists a factorization w = u 1 u 2 u k where u k u 2 u 1 and u i L for all 1 i k. Moreover, this factorization is unique. Example Consider w = abaababaabaab and a b. Then w = (ab)(aabab)(aab)(aab) = (a)(baa)(babaabaa)(b) where the parentheses indicate the factorization of w for and 1, respectively, such that aab aab aabab ab and a 1 baa 1 babaabaa 1 b and {aab, aabab, ab} L and {a, baa, babaabaa, b} L 1.

36 20 Preliminaries 2.5 Morphisms So far, we have only looked at sets of words, such as A, and properties directly defined on words, such as periodicity. However, an important tool to relate words to one another is a structure preserving mapping a morphism. Structure preserving means that the image of a word is the same as the product of the images of its factors. The concept of morphism is a fundamental one in mathematics which relates algebraic structures. It also proves to be useful in combinatorics on words, for instance to define patterns in words, and it gives us a tool for a finite representation of infinite words. In our general algebraic setting, that is, semigroup theory, a morphism is defined as follows. Let (S, ) and (T, ) be semigroups. A mapping γ : S T is called a morphism if γ(x y) = γ(x) γ(y), for all x, y S. However, we will restrict ourselves to word semigroups, only. Let A and B be alphabets. A mapping φ: A B is called a word morphism, or simply, morphism, if φ(uv) = φ(u) φ(v) for all u, v A. It is clear that morphisms are determined by their actions on the letters of the domain alphabet. Therefore, we will mainly consider morphisms on their action on letters in the following. It is also clear that φ(ε) = ε for any morphism φ. A morphism φ is called nonerasing if φ(a) ε for all a A. We will assume all morphisms to be nonerasing in the following unless otherwise stated. Let φ: A B be a morphism, and A and B be orders on A and B. We say that φ is Lyndon preserving if φ(x) L B for every x L A. Richomme (2002) gave the following chracterization of Lyndon preserving morphisms. Theorem A morphism φ: A B is Lyndon preserving if and only if φ(a) is a Lyndon word for every letter a A and φ preserves the lexicographic order. In other words, φ is Lyndon preserving if and only if a 1 A a 2 implies that φ(a 1 ) B φ(a 2 ) for every a 1, a 2 A, and φ(a) L B for every a A. Let us extend the notion of morphism to infinite words such that for any φ: A A and w A ω we have φ(w) = φ(w (1) ) φ(w (2) ) φ(w (3) ). A word w A is called a fixed point of φ if φ(w) = w.

37 2.5 Morphisms Iterated Morphisms Let φ: A A be a morphism on the monoid A. Such a morphism φ and a letter a A define a sequence ( ) φ k (a) k 0 of words by iterating φ on a. We call φ prolongable if there exists an a A such that φ(a) = au with u A + and φ k (u) ε for all k 0. In more general terms, let (w k ) k 1 be a sequence of finite words. We say that this sequence converges to an infinite word g if every prefix of g is a prefix of all but finitely many words in (w k ). The word g is unique and denoted by g = lim k w k. In particular, (w k ) converges if w i is a proper prefix of w i+1 for all i 1. In fact, we have that a prolongable morphism defines an infinite word. To see this, consider a morphism φ on A such that φ(a) = au where a A and u A +. Let w k = φ k (a) and u k = φ k (u). Then clearly, w k+1 = w k u k, in particular, w k is a proper prefix of w k+1. So, there exists an infinite word g such that g = lim k φk (a). Let us describe g. It is not hard to see that w k+1 = au 0 u 1 u k. Therefore, g = au φ(u) φ 2 (u) φ 3 (u) φ k (u) and moreover, g is a fixed point of φ. We give the well-known example of the Fibonacci word next which will also be used in Chap. 3. Example 2.20 (Fibonacci Word). Let A = {a, b}. Consider the morphism ϕ: A A with Now, a ab b a. F = lim k ϕk (a) = abaababaabaababaababaabaababaabaababaabab and we call F the Fibonacci word and ϕ the Fibonacci morphism since the sequence ( ϕ(a) k ) forms the Fibonacci sequence (except for the missing k 0 first number of the sequence). The Fibonacci word is common knowledge, see for example (Berstel and Séébold, 2002) for reference.

38 22 Preliminaries Avoidability A natural question about words is the one of avoidability of a given pattern, for example a power of 2, in a word. Let X be an alphabet. We call a word h in X a pattern. Let w be a (possibly infinite) word in A. We say that a pattern h occurs in w if there exists a (possibly erasing) morphism ς : X A such that ς(h) is a factor of w. We say that w avoids the pattern h if h does not occur in w. Let x, y X and ς(x) A + and ς(y) A. For example, if a word w avoids x 2, then we have for every word f in A that f 2 is not a factor of w, that is, no squares occur in w. Consequently, w is then called square-free. Similarly, if w avoids x 3, x k, or xyxyx, then w is called cube-free, k-free, and overlap-free, respectively. If every conjugate of w is overlap-free, then w is called cyclically overlap-free. Note that there is an alternative point of view of what cyclically overlap-free should be defined like. Since ww contains all conjugates of w as factors, one could consider all those words w such that ww is has no overlapping factors. We call those words strongly cyclically overlap-free. The two definitions of cyclically overlap-freeness are not identical. Example Consider v = abaabb and w = vv = abaabbabaabb. It is easy to check that w is cyclically overlap-free. However, w is not strongly cyclically overlap-free since ww = vvvv contains vvv. Note the following fact shown by Harju (1985). Proposition If w A be a cyclically overlap-free word, then either w = v 2 for some word v A or w is strongly cyclically overlap-free. Examples 2.23 and 2.24 introduce words which are square-free and overlapfree, respectively, and which will also be used in Chap. 3. Example 2.23 (Thue Word). Let A = {a, b, c}. ϑ: A A with Consider the morphism a abc b ac c b. Now, T = lim k ϑk (a) = abcacbabcbacabcacbacabcbabcacbabcbacabcb and we call T the Thue word since it was thoroughly investigated by Thue (1912) who showed that T is square-free, that is, no factors of the form xx can be found in T. See also the translation of Thue s work by Berstel (1995). We call the morphism ϑ the Thue morphism.

39 2.6 A Classical Example 23 Example 2.24 (Thue Morse Word). Let A = {a, b}. Consider the morphism ψ : A A with Now, a ab b ba. M = lim k ψk (a) = abbabaabbaababbabaababbaabbabaabbaababba and M is commonly called the Thue Morse word or also Prouhet Thue Morse word. We stick to the former name and also call ψ the Thue Morse morphism for the rest of this thesis. Thue (1906) introduced the word M and showed that it is overlap-free, that is, it does not contain a factor of the form xyxyx. However, this word has emerged independently in several places. For example, it was rediscovered by Morse (1921), and already mentioned by Prouhet (1851). See the article by Allouche and Shallit (1999) for a survey about this word. 2.6 A Classical Example Lyndon and Schützenberger (1962) proved that the equation x i = y j z k, with i, j, k 2, has only cyclic solutions in a free group. This implies for the case of free semigroups that x, y, and z are powers of the same word if x i = y j z k holds. Lyndon and Schützenberger s result received a lot of attention. In particular, several direct proofs for the special case of words have been proposed, for example by Chu and Town (1978), Choffrut (1983), and Maňuch (2002). We will present the shortest proof of Theorem 2.25 known to the author. This shall serve as an example of an application of some of the notation and facts introduced in this chapter. This section is based on (Harju and Nowotka, 2004a). Theorem Let x, y, z A. If x i = y j z k with i, j, k 2, then x, y, z w for some w A. Proof. Assume without loss of generality that x, y and z are primitive. The case is clear if y p x + or z q x + for some 1 p j and 1 q k. Suppose there is no w A such that x, y, z w. So, let y p x + and z q x + for any 1 p j or 1 q k. If y > x or z > x, then y or z has more than one x-interpretation, respectively, and hence, is not primitive; a contradiction. So, let y < x and z < x. If i > 2 then x i = x 0 f i 1 x 1 where f is an unbordered conjugate of x = x 0 x 1. But, f is a factor of y j or z k, and thus bordered; a contradiction. If i = 2 then we can assume, by symmetry, that y j > z k. Assume also that x is of minimal length. Now, y j = xu r and u r z k = x for some primitive

40 24 Preliminaries word u. From u 2r z k = u r x and y j = xu r it follows that u 2r z k = v j for some primitive word v. We have that j = 2 by the previous paragraph. But now, v = y < x contradicts the minimality of x which proves the claim. Lyndon and Schützenberger s problem can be generalized in several ways. For example, Lentin (1965) considered the equation x i = y j z k w l, and Appel and Djorup (1968) investigated x k = z k 1 zk 2 zk n. However, those equations do not only permit periodic solutions. Note that the case i > 2 of the proof of Theorem 2.25 gives an immediate proof for the following fact for a more general equation. Proposition Let n 2 and x, z i A, for all 1 i n, be primitive. If x k = z k 1 1 zk 2 2 zkn n with k, k i 2, for all 1 i n, then for every 1 i n either z k i i < x + z i or z i and x are conjugates. Moreover, Theorem 2.27 below shows that not every z i in the above proposition can be a conjugate of x. We prove that the power of a primitive word cannot be factorized into conjugates of a primitive word. Theorem Let n 2 and x, z i A, for all 1 i n, be primitive. If x k = z 1 z 2 z n with k 2 and z i is a conjugate of a primitive word z, for all 1 i n, then n = k and z i = x, for all 1 i n. Proof. Assume that the claim does not hold, and consider a shortest counter example x for which n k in the statement of the theorem (over some alphabet A). We can assume that gcd(k, n) = 1. Indeed, if d = gcd(k, n) then z 1 z 2 z n = x k d and we have equivalent solutions to the original equation. d Now, x /n = z /k N, and thus we can write x = x 1 x 2 x n, where x i = x /n for each 1 i n, and z j {x 1, x 2,..., x n } k, for all 1 j n. The minimality assumption on x yields that each factor x i is a letter otherwise we have shorter equation with the words x, z 1, z 2,... z n over the alphabet {x 1, x 2,..., x n } that has an equivalent solution to the original equation. Let then a A be a letter that occurs m times in x with 1 m n, and thus km times in x k. Since z 1 a = z l a whenever 1 l n, we have that n divides km, and hence, n divides m, i.e., n = m. But now x = a n which is a contradiction. Let us remark that the solutions of an equation of the form x 1 x 2 x n = z 1 z 2 z m where x i is a conjugate of a primitive word x, with 1 i n, and z j is a conjugate of a primitive word z, with 1 j m, are not necessarily periodic as the following example shows.

41 2.6 A Classical Example 25 Example Let n = 2 and m = 3 and x 1 = aabbab x 2 = babaab z 1 = aabb z 2 = abba z 3 = baab then x 1 x 2 = aabbabbabaab = z 1 z 2 z 3.

42 26 Preliminaries

43 Chapter 3 Critical Factorizations A critical factorization denotes a point p of a word w where the shortest repetition word at point p is as long as the period of w, that is, p(w, p) = p(w). The critical factorization theorem, by Césari and Vincent (1978) and Duval (1979), states that any word w with at least two different letters and period d has a critical point p, moreover, p < d. Actually, we have at least one critical point in every d 1 consecutive points in w. Consider the following example: w = ab.aa.b where w has two critical points 2 and 4 which are marked by dots. The period d of w equals 3 and w is of index less than 2, since 2d > w. The shortest repetition words in the critical points 2 and 4 are aab and baa, respectively. Note that the shortest repetition words in the points 1 and 3 are ba and a, respectively. The ratio of the number of critical points and the number of all points is called the density of critical points. The density of w in our example is 1/2. We will first introduce the critical factorization theorem in Sect. 3.1 of this chapter and show some properties of critical factorizations that will be used later in this thesis. The critical factorization theorem constitutes a deep result in the field of combinatorics on words by stating that critical points occur not too far apart from each other in virtually every interesting word. However, nothing more is said about the actual density of critical points in certain words. Our main concern, therefore, is the density of critical points in this chapter; see Sect More precisely, we investigate the ratio between the number of critical points and the number of all points of a word, in infinite sequences of words of index less than two, that is, the period of which is longer than half of the length of the word. On one hand, we consider words with the lowest possible number of critical points, namely one, and show, as an example, that every Fibonacci word, which

44 28 Critical Factorizations can be defined by the use of palindromes, of length longer than 5 has exactly one critical factorization, in contrast to the fact that palindromes themselves have at least two critical points; see Subsect This result also implies immediately the well-known fact that the Fibonacci word is not ultimately periodic proven differently in the literature. On the other hand, sequences of words with a high density of critical points are considered. We show how to construct an infinite sequence of words over a 4- letter alphabet where every point in every word is critical; see Subsect We construct an infinite sequence of words over a 3-letter alphabet with densities of critical points approaching 1, using square-free words, and an infinite sequence of words a 2-letter alphabet with densities of critical points approaching 1/2, using Thue Morse words. It is shown that these bounds are optimal. The critical factorization theorem enjoys a number of applications. One of them will be presented in Sect. 3.3 by giving a shorter proof of a result by Costa (2003). Other applications will occur in later chapters. This chapter is based on (Harju and Nowotka, 2002a) and (Harju, Lepistö, and Nowotka, 2003). 3.1 The Critical Factorization Theorem The critical factorization theorem (CFT) is one of the main results about periodicity of words. A weak version of it was first conjectured by Schützenberger (1979) and proved by Césari and Vincent (1978). It was developed into its current form by Duval (1979). A short proof of the CFT is given here A Proof of the CFT Observe that every point of a word w is critical if w a for some a A. This case is trivial. So, let us assume that at least two different letters occur in w in the following. For the sake of clarity we furthermore assume that A is such that all letters of A occur in w since all other letters play no rôle when reasoning about w. Theorem 3.1 (CFT). Every word w, with w 2, has at least one critical factorization w = uv, with u, v ε and u < p(w), i.e., p(w, u ) = p(w). This theorem is a direct consequence from the following proposition which describes one critical point in any word and will be technically more useful in the following. The proof of Proposition 3.2 is a technically improved version of the proof of the CFT by Crochemore and Perrin (1991). Note that τ (w) τ (w) since they start with a different letter.

45 3.1 The Critical Factorization Theorem 29 Proposition 3.2. Let w be a word over A of length n 2, let be a lexicographic order on A, and let β be the shorter of the two suffixes τ (w) and τ (w). Then w β is a critical point. Proof. Assume β = τ (w) by symmetry. Let α = τ (w), so, α = u β. Let z be an unbordered repetition word at w β. We show that z is a period of w, which will prove the claim. u α = τ β = τ w : critical position If w is a factor of z 2, then obviously z is a period of w. If w = w 1 βw 2 for some w 2 ε, then β 1 βw 2 contradicts the choice of β. If w = yzβ, then, by the above, z p β, say β = zβ, but, then z 2 β = zβ 1 β = zβ implies that β = zβ 1 β ; a contradiction. Consequently, β = zw and w = z 1 zw for a suffix z 1 of the unbordered word z. u α β w : z 1 u w z z Therefore u is a suffix of z, and hence, u w is a suffix of α. Consequently, u w α = u β, and so w β, which together with w 1 β implies that w p β. Therefore β = zw = w z, and thus β = z k z 2 for some z 2 p z, which shows that z is a period of w. The CFT follows since w τ < p(w) for any order. Note that Proposition 3.2 could equally well be proven using maximum prefixes instead of maximum suffixes by the symmetry of the two concepts. We also remark that Duval, Mignosi, and Restivo (2001) provide yet another proof of the CFT Some Properties of Critical Factorizations In this subsection we give some interesting properties of critical factorizations which will be of particular interest for us in Sect. 3.2 and 5.3. The next theorem justifies why we consider only words of index less than two in the next section, where we investigate the density of critical factorizations of a word. This theorem is due to Duval (1979).

46 30 Critical Factorizations Theorem 3.3. Each set of p(w) 1 consecutive points in w, where w 2, has a critical point. Proof. If w = u i u 1, where u 1 p u and p(w) = u, then the maximal suffixes w.r.t. any orders of A are longer than u i 1 u 1. Hence w has a critical point at point p, where p < p(w). Let p be any critical point of w = uv, where u = p, and let z be the smallest repetition word at point p. So, z = p(w). We need to show that if v p(w), then there is critical point at p + k for 1 k < p(w). We have z v and p(v) = p(w). For, if p(v) < p(w), then z is bordered; a contradiction. Now, v has a critical point k such that we have k < p(v) = p(w). Clearly, this point p + k is critical also for w since the smallest repetition word at point p + k is a conjugate of z. Now, (p + k) p = k < p(w). Perhaps an even stronger motivation for considering only words of index less than two is the observation that in w k, with k 3, the critical points of the first factor w are inherited by the next k 2 factors w. That is, if w k = w 1 w 2 w k 1, where w 1 is a critical point, then ww 1 is also a critical point of w k. The following lemmas state some further properties of critical factorizations. These results will mainly be used in Sect Lemma 3.4. Let w = uv be unbordered and u be a critical point of w. Then u and v do not intersect. Proof. Note that p(w, u ) = p(w) = w since w is unbordered. Let u v without restriction of generality. Assume that u and v intersect. If u = u s and v = sv, then p(w, u ) s < w. On the other hand, if u = su and v = v s, then w is bordered with s. Finally, if v = sut then p(w, u ) su < w. The next two results follow directly from Lemma 3.4 where Lemma 3.5 was also shown by Breslauer, Jiang, and Jiang (1997). Lemma 3.5. Let w = uv be unbordered and u be a critical point. Then vu is unbordered. Indeed, u and v intersect if vu is bordered. Lemma 3.6. Let u 0 u 1 be unbordered and u 0 be a critical point of u 0 u 1. Then for any word x, we have u i xu i+1, where the indices are modulo 2, is either unbordered or has a minimum border g such that g u 0 + u 1. Indeed, if u i xu i+1 has a border g such that g < u 0 + u 1, then u 0 and u 1 intersect contradicting Lemma 3.4.

47 3.2 Counting Critical Factorizations Counting Critical Factorizations So far, we have presented results about the existence of critical points in words and factors of words of a certain length. In this section we turn to the question of how many, or how few, critical points a word can have. In particular, we consider the two extremes of words with only one critical point, see Subsect , and words with as many as possible critical points, see Subsect As mentioned before, we will restrict ourselves to words of index less than two in this section Words with Few Critical Factorizations Every word longer than one has at least one critical factorization. We investigate words with only one critical factorization in this section. Trivially, words of length two have no more than one critical point. We do not consider such cases but arbitrarily long words. However, the following lemma limits our investigation to words in two letters. Lemma 3.7. A word w with only one critical factorization is binary, that is, it is over a two-letter alphabet. Proof. Assume a word w contains the different letters a, b, and c and has exactly one critical factorization. Let be an order on A such that a b c. By symmetry, we can assume that τ < τ. Then p = w τ is a critical point of w by Proposition 3.2. Let a c b. Now, either w τ or w τ is a critical point p of w, again by Propposition 3.2. But, p p since τ begins with c and τ and τ begin with a and b, respectively. So, w has at least two critical points; a contradiction. By Lemma 3.7, we will only consider words in a and b in the rest of this section. Let us also fix the order on A such that a b. Note that τ τ and π π for any word since τ and τ start and π and π end with different letters. Proposition 3.2 straightforwardly leads to the following facts. Lemma 3.8. If a word w has exactly one critical point, then either or w = π τ and π p π and τ s τ w = π τ and π p π and τ s τ. Note the symmetry of maximum pre- and suffix and consider the following figure for the case where w = π τ.

48 32 Critical Factorizations τ τ w : π π Recall also that w = π τ and w = π τ is impossible (page 18). The converse of Lemma 3.8 does not hold in general. Consider for example w = aa.bb.abab which has two critical points, but we do have π = aa p aabb = π and τ = bbabab s aabbabab = τ and w = π τ. This lemma also implies that palindromes have more than one critical factorization. Proposition 3.9. Every palindrome of length at least 3 has at least two critical factorizations. Proof. Let w be a palindrome. Assume w has exactly one critical point. By symmetry, we can also assume that τ s τ. By the definition of maximal prefix and suffix and since w is a palindrome we have τ (w) = π ( w) = π (w) and τ (w) = π ( w) = π (w) where π (w) and π (w) denote the reversal of π (w) and π (w), respectively. Now, π s π, and hence, π p π, which contradicts Lemma 3.8 since π π in any case. w : π = τ π critical critical τ τ = π Another proof of Proposition 3.9 was suggested by Julien Cassaigne. That proof does not use Lemma 3.8. Proof. Let w be a palindrome. Let p be a critical point of w. Then w p is also a critical point of w by symmetry.

49 3.2 Counting Critical Factorizations 33 Assume that w has exactly one critical point. Then w = 2p, and we have w (p) = w (p+1) since w = w. But now p(w) = p(w, p) = 1 and w has exactly only one critical point only if w = 2; a contradiction. But, which words do have exactly one critical point? Let us now consider the critical points of Fibonacci words. We introduced the Fibonacci word and morphism already in Example 2.20 on page 21. The set of Fibonacci words is simply defined by { ϕ k (w) k 0 }. However, we will give an alternative definition of Fibonacci words here which will be more convenient for us later. This definition bears a more apparent resemblence to Fibonacci numbers. We define Fibonacci numbers by f 0 = 1, f 1 = 1, f k+2 = f k+1 + f k. Note that Fibonacci numbers are commonly defined by f 0 = 0 and f 1 = 1 and f k+2 = f k+1 +f k which is identical to our definition in the sense that f n = f n+1, for all n 0, and f 0 = 0. We use this non-standard definition since it better fits our needs in the following. Fibonacci words are defined by F 1 = a, F 2 = ab, F k+2 = F k+1 F k. Obviously, F i = f i. Observe that F i < p F n, for all 1 i < n. F n 1 F n 2 F n 2 F n 3 F n 3 F n 4 F n 5 F n 3 F n 4 F n 4 F n 4 F n 2 F n 4 F n 6 F n 5 F n Note the following facts about Fibonacci words F n with n 7 indicated in the figure above. 1. F n = F n 2 F n 3 F n 2, 2. F n 2 F n 2 p F n, 3. F n 3 F n 4 F n 4 F n 4 p F n, 4. F n 4 F n 3 F n 4 F n 3 F n 4 s F n.

50 34 Critical Factorizations The period of a Fibonacci word F n is f n 1. The first proof of this fact was published by Cummings, Moore, and Karhumäki (1996) to the best of my knowledge. A shorter proof is given next. Lemma We have p(f n ) = f n 1 for all n 2. Proof. Clearly, p(f n ) f n 1 follows from F n = F n 2 F n 3 F n 2 (Fact 1 above). We proceed by induction on n. The claim holds for n {2, 3} by inspection. If p(f n ) = f n 2 then F n 3 F n 2 = F n 2 F n 3 = F n 1 and F n 1 is not primitive. Hence, p(f n 1 ) f n 1 /2 < f n 2 contradicting p(f n 1 ) = f n 2. If p(f n ) f n 2 and p(f n ) < f n 1, then F n = F n 2 F n 3 F n 2 = u F n 2 v with v = p(f n ) and F n 2 overlaps itself such that p(f n 2 ) < f n 3 contradicting p(f n 2 ) = f n 3. We need one more observation in order to estimate the number of critical points in Fibonacchi words. Observation Let F n with n > 4 be a Fibonacci word. Every critical point of F n is smaller than f n 1. Indeed, the cases with n {5, 6} are easily checked. Fact 4 above shows that p(f n, p) f n 3 for all n 7 and f n 1 p < f n. However, Lemma 3.10 establishes p(f n, p) = f n 1 if p is critical. Remark Fibonacci words have a close connection to palindromes as the following properties show. Firstly, F n = α n d n where n 3 and α n is a palindrome and d n = ab if n is even and d n = ba if n is odd. This result has been credited to Berstel by de Luca (1981). Secondly, F n = β n γ n, where n 4 and β n and γ n are palindromes of length f n 1 2 and f n 2 + 2, respectively, by de Luca (1981). Moreover, de Luca shows that these two properties define the set of Fibonacci words. Given Remark 3.12 and Proposition 3.9, every palindrome has at least two critical factorizations, Theorem 3.14 below is rather surprising. Example We have F 2 = a.b, F 3 = a.b.a, F 4 = ab.aa.b. By the following Theorem, however, every Fibonacci word F n, with n > 4 has exactly one critical point, and that critical point is at point f n 1 1. Theorem A Fibonacci word F n, with n > 4, has exactly one critical point p. Moreover, p = f n 1 1.

51 3.2 Counting Critical Factorizations 35 Proof. Let n 7, and let p be a critical point of F n. Then p > f n 2, because otherwise F n 2 F n 2 p F n 2 F n 3 F n 2 = F n implies that p(w, p) f n 2 contradicting Lemma Consider the factorization F n = F n 2 F n 3 F n 2 = F n 2 F n 4 F n 5 F n 2. Then p > f n 2 + f n 4, since otherwise F n 3 F n 4 F n 4 F n 4 = F n 2 F n 4 F n 4 p F n 2 F n 4 F n 5 F n 2 = F n, implies F n 3 F n 4 < p F n 3 F n 4 F n 4 which gives p(w, p) f n 4 ; a contradiction. By induction we obtain F n 2 F n 4 F n 6 F n 2i+4 F n 2i+1 F n 2i F n 2i F n 2i = F n 2 F n 4 F n 6 F n 2i+4 F n 2i+2 F n 2i F n 2i p F n 2 F n 4 F n 6 F n 2i+4 F n 2i+2 F n 2i F n 2i 1 F n 2 = F n where 1 i n 2 2, and So, we have p > n 2 2 i=1 f n 2i = { f n 1 2 f n 1 3 if n is odd, if n is even. F n = F n 2 F n 4 F 3 F 2 F n 2 or F n = F n 2 F n 4 F 4 F 3 F n 2 where p > f n 1 2 and p > f n 1 3, respectively, and f n 1 > p by Observation So, p = f n 1 1 or p = f n 1 2, that is, a critical point has to exist in the suffix F 2 F n 2 s F n or F 3 F n 2 s F n where the former case gives the result. The latter case leaves the possibilities a.b.af n 2 s F n. But since b s F 4, we have bab.af n 2 s F n and only the marked point is critical which proves the claim. The following well known fact follow immediately from Theorem Corollary The Fibonacci word F is not ultimately periodic. The Fibonacci words are certainly not the only words with exactly one critical factorization. Example Let w = a i ba j, with i j and i + j > 0, then η(w) = 1. If i > j, then p(w) = a i b and i is the only critical point of w. Similarly for i < j, where i + 1 is the only critical point of w. See Lemma 3.18 for the case when i = j.

52 36 Critical Factorizations Words with a High Density of Critical Factorizations In this subsection we investigate sequences of words with a high density of critical points. The density δ(w) of a word w is defined by δ(w) = η(w) w 1. Notice, that in the above w 1 is the number of all points in w, and recall that η(w) denotes the number of critical points in w. Throughout this section we require all words to be of index less than two, otherwise, for any given alphabet {a 1, a 2,..., a k } and n > 0, we have δ ( (a 1 a 2 a k ) n) = 1 that is, every point is critical. The following example shows that there exists a sequence of words of index less than two in the alphabet A = {a, b, c} such that the limit of their densities is one. Example Consider the Thue morphism ϑ introduced in Example 2.23 on page 22. Note that ϑ l (a) ends in b or in c if l is even or odd, respectively. Let T 2k = a ϑ 2k (a) b and T 2k+1 = a ϑ 2k+1 (a) c, for all k > 0, then lim δ(t n) = 1 n because every word T n has a square prefix and suffix and ϑ n (a) is square-free, so, η(t n ) = T n 3 and δ(t n ) = 1 2/( T n 1). Of course, any square-free word with suitable borders can be used in Example It is also clear that with an alphabet with at least four letters, say a, b, c, and d, the sequence (T n) n 1 with T n = d ϑ n (a) d, consists of words with density one, only. Words in two letters, however, cannot be square-free if they are longer than three. So, the question arises: What is the highest density for words in A = {a, b}? The following lemma implies that ab, ba, aba, and bab are the only words in A which have density one. Lemma If w is of index one and has two consecutive critical points, then either w = a i ba i or w = b i ab i, with i 1. Proof. Assume, two consecutive critical points in w that are around b. Let i and j be maximal integers such that w = w 1 a i.b.a j w 2. Clearly, i, j 1. If w 1 = ε = w 2, then necessarily i = j, see Example By symmetry, we can assume that w 1 ε, that is, w = vba i ba j w 2. If j i, then w has the repetition ba i at the first critical point: w = v ( ba i) ( ba i) a j i w 2, where p(w) = ba i. But

53 3.2 Counting Critical Factorizations 37 then w has index greater than one; a contradiction. Therefore, j < i, and in this case, w 2 = ε in order to avoid repetition inside w at the second critical point. Also, p(w) = a j b, which implies that i = j; again a contradiction. By Lemma 3.18 we have lim sup n δ(w n ) 1/2, for any infinite sequence w 1, w 2, w 3,... of words in a binary alphabet A, and this bound is tight by the following example. Example Consider the Thue Morse morphism ψ introduced in Example 2.24 on page 23. For the sake of brevity, let ψ n denote ψ n (a) and ψ n denote ψ n (b). Let M 2k+1 = a 2 ψ 2k+1 b 2 and M 2k = a 2 ψ 2k a 2, for all k 0. We show in Theorem 3.20 that η(m n ) = 2 n and and hence, δ(m n ) = n lim n δ(m n) = 1 2. Note that ψ n equals ψ n up to exchanging of a and b. Moreover, ψ n does not contain overlapping factors, that is, factors of the form cucuc where c A. Note also that ψ n = 2 n. Theorem Every odd point in M n, with n 1, except point 1 and 2 n + 3, is critical. We consider the following lemma before proving Theorem Lemma The repetition words at every noncritical point in M n, for all n 1, are of length one or two. Proof. In M n, for any n > 2, the points 1, 2, 2 n + 2, and 2 n + 3 are noncritical, with repetition words of length one, and the points 3 and 2 n + 1 are critical. Clearly, the repetition word at every noncritical point in M 1 = aaabbb and M 2 = aaabbaaa is of length one. Assume, the repetition word at every noncritical point in M k, with k > 2, is of length one or two. We proceed by induction and show that the repetition word at every noncritical point in M k+1 is at most of length 2. We have M k+1 = a 2 ψ k ψk a 2 and M k+1 = a 2 ψ k ψk b 2 for odd and even k, respectively. Note that, by induction hypothesis, the repetition word at every noncritical point in M k is at most of length two. Clearly, the repetition words of length less or equal than two at points 2 to 2 k 2 in ψ k are not changed

54 38 Critical Factorizations by preceding and succeeding words (a 2 or b 2 ). The repetition word at point M k+1 /2 = 2 k + 2 in M k+1 is either b or ba since ab s ψ k or ba s ψ k and ba p ψk. It remains to show that the points 2 k + 1 and 2 k + 3 are critical in M k+1. Assume point 2 k + 1 or 2 k + 3 is not critical. If k is even, then the repetition word u at point 2 k + 1 is of the form abavb and u < 2 k + 2, otherwise point 2 k + 1 is critical because then a 3 occurs in u but not in ψ k ψk. The factor uu is followed by b in M k+1, otherwise abavbabavba is an overlapping factor of ψ k+1 ; a contradiction. But, now we have a s v, otherwise b 3 is a factor of ψ k+1, a contradiction, and ababa is a factor of ψ k+1 ; again a contradiction. If k is odd, then the repetition word u at point 2 k + 1 is of the form bbava and u < 2 k 1, otherwise point 2 k +1 is critical. Certainly, the factor uu must be preceded by a in M k+1, otherwise b 3 is a factor of ψ k+1 ; a contradiction. But, now abbavabbava is an overlapping factor of ψ k+1 ; again a contradiction. Point 2 k + 3 is shown to be critical by similar arguments. Proof of Theorem We show that there are no two consecutive noncritical points in M n except 1 and 2, and 2 n + 2 and 2 n + 3. By Lemma 3.21, the words a, b, ab, and ba are the only repetition words at noncritical points in M n. We need to consider only points from 4 to 2 n, since points 3 and 2 n + 1 are certainly critical. Assume a is the repetition word at some point p and point p 1 is noncritical. Now, a must be the repetition word at point p 1 and a 3 is a factor of ψ n ; a contradiction. The same argument holds if b is the repetition word at point p. Assume ab is the repetition word at some point p and point p 1 is noncritical. Now, ba must be the repetition word at point p 1 and babab is a factor of ψ n ; a contradiction. The same argument holds if ba is the repetition word at point p. The claim follows now from Lemma Remark Is there a sequence with a higher density of critical points than (M n ) n>0? Certainly, there is no sequence with a limit larger than 1/2 by Lemma Actually, there is no binary word larger than 5 with a density equal to 1/2 by the following Lemma A word M n is basically an overlapfree word with cubic prefix and suffix. In any case, Theorem 3.25 will show that any infinite sequence with a limit 1/2 of densities must include infinitely many words where the first and the last two points are noncritical. However, could we use other words than Thue Morse words to construct {M n }, with n > 0? If we choose a word with an overlapping factor, say w = w 1 au a uaw 2, then w has two consecutive points, marked by, that are not critical. Lemma 3.18 implies that w would not be a good choice. So, what about other overlap-free

55 3.2 Counting Critical Factorizations 39 words? Any infinite set of finite overlap-free binary words would certainly do for (M n ) n>0. However, ψ is the smallest morphism that takes an overlap-free word to a longer overlap-free word. So, (M n ) n>0, is optimal from that point of view. Lemma Every binary word w of index less than two and length greater than five satisfies δ(w) < 1 2. Proof. Assume w > 5 and δ(w) = 1/2. Then there are no two consecutive critical points in w by Lemma Certainly, p(w) > 3 since ind(w) < 2. The first and the last points of w are not critical. Otherwise, let a.b p w and point 1 is critical, then aba and abba are not prefixes of w since the repetition word in point 1 are then ba and bba, respectively; contradicting p(w) > 3. Also, abbbb is not a prefix of w since then δ(w) < 1/2 by Lemma Hence, abbba p w and p(w) = 4 since the smallest repetition word in point 1 is now bbba. So, w equals a.bbb.ab or a.bbb.abb and has just two critical points; a contradiction. The last point is a symmetric case. The claim follows now from Lemma Remark The largest binary words of index less than two and density one half are given by Lemma 3.18: aa.b.aa, bb.a.bb, and by the Fibonacci word F 4 and its reverse F 4 : ab.aa.b, ba.bb.a, b.aa.ba, a.bb.ab. Theorem Let (w n ) n>0, with w n > 5 for all n, be an infinite sequence of binary words such that lim sup δ(w n ) = 1 n 2. Then there is an infinite set I of natural numbers such that the first two points and the last two points of w i, for all i I, are noncritical. Proof. Let w k be such that δ(w k ) > 1/2 ɛ for some positive real number ɛ < 1/4. The first and the last point of w k are not critical by the proof of Lemma We have w k > 1/(2ɛ) since η(w k ) 2η(w k ) + 1 δ(w k) > 1 2 ɛ by the proof of Lemma 3.23 which implies η(w k ) > 1 4ɛ 1 2

56 40 Critical Factorizations and hence, w k 2(η(w k ) + 1) > 1 2ɛ using again the proof of Lemma Assume the second point of w k is critical. Let aa.b p au p w k where u = p(w k ) 1. The factor aa does not appear in u since the second point is critical. Actually, u = ab k 1 ab k2 ab kt, where k j 1 for all 1 j t, and since w k > 1/ɛ and ind(w) < 2, we have u > 1/(4ɛ). Since C = {ab i 1 i k t } is a code, we can consider u to be encoded in an alphabet X of the cardinality of C, let u be the encoded u. However, by the assumption that δ(w k ) > 1/2 ɛ, we must have a factor v in u, with v > 1/(4ɛ), where critical and noncritical points alternate, so, bbb is not a factor of v. Let v be the encoding in X of the smallest factor that contains v. Now, v must be a square-free word in at most two letters, namely the ones that encode ab and abb. But the longest square-free word in two letters is xyx, with x, y X, and hence, v < 9; a contradiction. Let ab.a p w = uv where u = p(w). Then u = aba p(w) 2 ; a contradiction. Similar arguments hold for the last but one point when u ends in aa or ba. 3.3 An Application of Critical Factorizations We will apply a corollary of the CFT, that is, Lemma 3.6, to give a shorter proof for the following result by Costa (2003). Theorem 3.26 (Costa). Let w be a bi-infinite word. Then there exist f, g A such that w = ω fgf ω if and only if there is a factorization w = suv, with u A such that every factor s uv, with s s s and v p v, is bordered. Let w = suv be a bi-infinite word, where u A, such that s uv is bordered for all s s s and v p v. Let ξ(x) denote the length of the shortest border of x, where we define ξ(x) = 0 if x is unbordered. Clearly, for every finite suffix t of s there exists a minimal m t such that ξ(tuv ) m t, for all v p v, since ξ(tuv ) tu, for all v p v, otherwise there is an unbordered prefix w of tuv such that w > tu contradicting our assumption on the shape of w. Moreover, there is a maximum integer m t such that m t = ξ(tuv ) for infinitely many v p v. Let χ(t) denote the prefix of length m t of tu. Note that χ(t) is unbordered. Lemma Let w = suv be a bi-infinite word, where u A, such that s uv is bordered for all s s s and v p v. There exists an integer k such that for every suffix t of s longer than k there is a critical point p in χ(t) with p t.

57 3.3 An Application of Critical Factorizations 41 Proof. The case is clear if there are only finitely many suffixes t of s such that p > t for all critical points p in χ(t). Assume there are infinitely many suffixes t of s such that p > t for all critical points p in χ(t). Surely, there is a prefix u of u such that there are infinitely many suffixes t of s with χ(t) = tu and p > t for all critical points p in χ(t). Then tu occurs in v infinitely often. Let v tu p v denote any such occurrence. Now, let t be a suffix of s such that χ(t ) = t u and t uv tu and p > t for all critical points p in χ(t ). Surely, the shortest border z of t uv tu is shorter than uv tu, and hence, shorter than t. However, z is longer than tu since t u is unbordered and tu s t u. So, u occurs in t, and hence, t u has no critical point larger than t ; a contradiction. Lemma Let w = suv be a bi-infinite word, where u A, such that s uv is bordered for all s s s and v p v. Then w = ω xyz ω. Proof. Firstly, we show that there exists a suffix t of s such that χ(t) has a critical factorization t 0 t 1, with t 0 t and t = t 0ˆt, and t 0 t 1 t 0 p tu and χ(ˆt) = χ(t). Consider the integer k from Lemma If χ(t ) is bounded for all t s s, then let t be a suffix of s such that t k and χ(t) is maximal. If χ(t ) is not bounded for all t s s, then let t be the shortest suffix of s such that χ(t) k and χ(t) has a critical point p with p t. So, tu = χ(t)u and there is a critical factorization χ(t) = t 0 t 1 with t 0 < t. We have that v t 0 t 1 p v for infinitely many prefixes v of v. Let t = t 0ˆt. Then ξ(t 1 u v t 0 ) t 0 t 1 for every prefix v t 0 t 1 of v by Lemma 3.6, and hence, χ(ˆt) χ(t). In fact, χ(ˆt) = χ(t), by the choice of t, and tu = t 0 t 1 t 0 û. Since the number of different shortest borders of tuv for all prefixes v of v is bounded by tu, there exists a prefix v 0 of v such that ξ(tuv 0 ) χ(t) for every v 0 p v with v 0 p v 0. Note that χ(t) = t 0t 1 occurs infinitely often in v. Let v 0 p v 0 such that v 0 t 0t 1 p v. Consider the shortest border z of t 1 t 0 ûv 0 t 0, where we have that z t 1 t 0 by Lemma 3.6. Since we have χ(ˆt) = χ(t), there are only finitely many prefixes of the kind of v 0 such that z > t 1 t 0. So, let v 1 p v such that v 0 p v 1 and the shortest border of t 1 t 0 ûv 1 t 0 is t 1 t 0 for every v 1 such that v 1 t 0t 1 p v and v 1 p v 1. Now, tuv 1 t 0t 1 = t 0 t 1 t 0 ûv 1 t 1t 0 t 1 for every occurrence of t 0 t 1 in v to the right of v 1. By Lemma 3.6, the shortest border of t 0 t 1 t 0 ûv 1 t 1 is t 0 t 1. Now, we have that every of the inifinitely many occurrences of t 0 t 1 in v to the right of v 1 is immediately preceded by t 0 t 1, and hence, v = v 1 t 1(t 0 t 1 ) ω. The claim w = ω xyz ω follows by symmetry. Lemma If w = ω xyz ω, where xyz A, such that x yz is bordered for all x s x and z p z, then w = ω fgf ω, where fg A.

58 42 Critical Factorizations Proof. Certainly, the word w can be factored into ω fgf ω such that every factor containing g is bordered and f and f are Lyndon words w.r.t. some order where a A is minimal in. Surely, we can assume that a occurs both in f and f and that a p f and a p f. Assume that f f by symmetry. It is easy to see that every factor of w containing fgf has to have a shortest border that is not longer than f. Assume that f f. If f f. Then the shortest border of fgf implies that a prefix f 0 of f is a suffix of f, and f 0 f implies that f is not minimal in ; a contradiction. If f f. Then f = f 0 cf 1 and f = f 0 bf 1 for some b, c A and b c and b c. It is clear that f 0 b does not occur in ff otherwise f is not minimal w.r.t.. Let f 0 b be the longest unbordered suffix of f 0b. Consider the factor f 0 cf 1fgf f 0 b of w with the shortest border s. If s f 0 then f 0 b is bordered; a contradiction. If s = f 0 b then b = c; a contradiction. If f 0 b < s f 0 b then f 0 b is not maximal; a contradiction. If f 0b < s then f 0 b occurs in ff; a contradiction. Therefore, f = f and w = ω fgf ω. Proof of Theorem ( ) Clearly, if w = ω fgf ω then there is a factorization w = suv, with u A such that every factor s uv, with s s s and v p v, is bordered. Take for example u = fgf. ( ) The claim follows from Lemma 3.28 and Comments In this chapter we introduced several properties of critical factorizations, in particular those about the density of critical points in words. We also provided a short proof of the critical factorization theorem and showed an application of it in giving a characterization of eventually periodic bi-infinite words. We would like to stress that the critical factorization theorem is a deep result that will also find its use in the solution of Duval s conjecture in Sect However, many interesting questions are still open with respect to a characterization of the density of critical points in words. For example, it would be very interesting to find further necessary and sufficient conditions for words with only one or two critical points. The investigation of the density of certain subsets of A for instance, for a given regular language would be of interest.

59 Chapter 4 Unbordered Conjugates A word w is called unbordered, or self-uncorrelated (Morita, van Wijngaarden, and Han Vinck, 1996) if the only border of w is the word itself, that is, if w = uv = v u for a nonempty word u, then u = w and, consequently, v = v = ε. Unbordered words and factors of words find a variety of applications. For instance, they play a significant rôle in some proofs concerning combinatorial properties of words. The questions involving periodicity of finite and infinite words are naturally related to the border structure of words; see for example (Choffrut and Karhumäki, 1997; Chuan, 1998; Costa, 2003; Duval, 1982; Ehrenfeucht and Silberger, 1979; Lothaire, 2002). Another example is that the existence of borders in words appear in the study of coding properties of sets of words as well as in unavoidability studies of words; see for example (Berstel and Perrin, 1985; Morita, van Wijngaarden, and Han Vinck, 1996). In this chapter we study the border structure of words with respect to conjugation. We shall consider solely binary words. For the rest of this chapter, we fix our alphabet to be A = {a, b}. The border correlation function β : A A is defined such that β(w) specifies which conjugates of w are unbordered: Let w A be a word of length n. Then β(w) = u = u (1) u (2) u (n), where u (i) = { a For example, let w = aabab. Then b if σ i 1 (w) is unbordered, if σ i 1 (w) is bordered. σ 0 (w) = w = aabab, σ 1 (w) = ababa, σ 2 (w) = babaa, σ 3 (w) = abaab, σ 4 (w) = baaba, and hence β(w) = ababb, since only σ 0 (w) and σ 2 (w) are unbordered. It is rather easy to show (see Lemma 4.1) that the image β(w) of a binary word w cannot have two consecutive a s except for some trivial words, that is,

60 44 Unbordered Conjugates except in some trivial cases, not both σ i (w) and σ i+1 (w) are unbordered for any i. In Sect. 4.1 we show that the bound given by this fact is optimal. Indeed, we prove that in every strongly cyclically overlap-free word every other conjugate, that is, either σ i (w) or σ i+1 (w) for each i, is unbordered. There is a close relationship between unbordered conjugates of a word and its critical points, when the latter are defined modulo cyclic shifts. This relation is elaborated on in Sect In Sect. 4.3 we investigate the dynamic system given by the border correlation function β. We prove that, for each word w of length n, the sequence w, β(w), β 2 (w),... terminates either in the word b n or in the cycle of the conjugates of the word ab k ab k+1 for k = (n 3)/2. The border correlation function provides a similarity function among the strings. Related functions of similarity are the auto-correlation function of Guibas and Odlyzko (1981), and the border-array function of Moore, Smyth, and Miller (1999). This chapter is based on (Harju and Nowotka, 2003c). 4.1 Optimal Words for Border Correlation Let w be a nonempty word of length n in A for A = {a, b}. If w is not primitive, then it is immediate that all conjugates of w are nonprimitive, and thus bordered. Therefore, β(w) = b n in this case. It is also clear that β is invariant under renaming. That is, if w is obtained from w by exchanging the letters a and b, then β(w ) = β(w). Therefore β is not injective, and thus not surjective. Indeed, there are at most 2 n 1 words of length n that are β-images. In fact, this number is much lower as we will show later with Corollary 4.7. The following lemma gives some useful properties of the images β(w). By the second case of the lemma, β(w) does not contain two adjacent letters a unless w is a conjugate of the special words ab n 1 or ba n 1. Note that β(ab n 1 ) = aab n 2 = β(ba n 1 ). Lemma 4.1. Let w A of length n 4. (i) If w is primitive, then β(w) a 2. (ii) For each i = 0, 1,..., n, σ i (w) or σ i+1 (w) is bordered, or w [ ab n 1] or w [ ba n 1]. (iii) The word w can have at most w /2 unbordered conjugates. Proof. For (i), we notice, as mentioned in Sect. 2.3, that each primitive word w of length larger than one has at least two conjugates which are Lyndon words

61 4.1 Optimal Words for Border Correlation 45 (see the comment after Proposition 2.13 on page 17). Since Lyndon words are unbordered (see page 19), the claim follows. For (ii), assume that w is not a conjugate of ab n 1 nor of ba n 1, and hence, it has at least two occurrences of a and of b. Let w = σ i (w) be any unbordered conjugate of w. Without loss of generality, we assume that w begins with a, and, consequently, w = ab k xab j, where j > k and the word xa begins with a, since w is unbordered. We may have x = ε. Now, σ(w ) = b k xab j a has a border b k a, and hence, σ i+1 (w) is bordered, as required. The claim (iii) is clear from (ii). In particular, if the length of w is an odd number 5, then w has two adjacent conjugates that are both bordered. Example 4.2. Consider w = abbabaa. Although the image β(w) = bababab does not contain b 2 as a factor, it has a conjugate that does so. Indeed, the adjacent conjugates σ 6 (w) = aabbaba and σ 7 (w) = w are both bordered. Lemma 4.1 (iii) states that a word of length 4 or more has at most w /2 many conjugates. The next example shows such words. Example 4.3. There are words for which the maximum w /2 is obtained. Every second conjugate of w is unbordered, for instance, in the following cases w = aabb and w = abaabbaababb. In these examples, β(w) = (ab) w /2. However, there is no word of length 10 that has 5 unbordered conjugates; see Theorem 4.6. For example, we also have for w = aabbbab of odd length that β(w) = ababbab, and hence, β(w) a = 3 = w /2 in this case. There is a close relationship between overlap-free binary words and the maximum number of unbordered conjugates. Theorems 4.5 and 4.6 clarify this relation. Before we prove these theorems, let us recall that the Thue- Morse morphism ψ : A A is defined by ψ(a) = ab and ψ(b) = ba; see also Example 2.24 on page 23. The following result is due to Thue (1912); see also (Harju, 1985). Lemma 4.4. Let w A be a cyclically overlap-free word. (i) Then, ψ(w) is cyclically overlap-free. (ii) Then, ψ 1 (w) is cyclically overlap-free if w {ab, ba}.

62 46 Unbordered Conjugates (iii) If w 7, then either w or σ(w) has a factorization in terms of ab and ba, that is, either w {ab, ba} or σ(w) {ab, ba}. (iv) For some u {a, b, aab, abb} and n 0, w [ψ n (u)]. w = 2 n or 3 2 n for some n 0. In particular, Note, that cyclically overlap-free words longer than 3 are of even length. Theorem 4.5 shows that cyclically overlap-free binary words have a maximum number of unbordered conjugates. In the theorem, every other conjugate of w is unbordered means, by Lemma 4.1(iii), that β(w) is (ab) n/2 or (ba) n/2 for some even n. Theorem 4.5. Let w A and w 7. Every other conjugate of w is unbordered if and only if w is a strongly cyclically overlap-free word. Proof. Note that w is a strongly cyclically overlap-free word, that is w is cyclically overlap-free and not a square. Let w be a word of length n that contains an overlapping factor, i.e., w = ucxcxcv, where c A and u, v, x A. Let i = ucx. Then the conjugates σ i (w) = cxcvucx and σ i+1 (w) = xcvucxc are both bordered, with borders cx and xc, respectively. In the other direction, suppose that both σ i (w) and σ i+1 (w), for some i, are bordered, where we can assume that i = 1. We derive a contradiction which proves the claim. Let u be the shortest border of σ(w) and v be the shortest border of σ 2 (w). Assume that a p w. The case b p w is symmetric. Case: aa p w. Then u = a, and σ(w) {ab, ba} by Lemma 4.4(iii). It follows that aab p w, and hence w = aabw 0 b where w 0 {ab, ba} and the ψ-factorization of σ(w) is given by σ(w) = (ab)w 0 (ba). Now, σ 2 (w) = bw 0 baa. Note that v baa for the border v of σ 2 (w), because w 0 {ab, ba}. Consequently, v = bv baa for some v A. Since σ 2 (w) = vzv for some nonempty z, and σ(w) {ab, ba}, w has a conjugate vvby where z = by. This is a contradiction, since v begins with b and so vvb is not overlap-free. Case: ab p w. We have now that bb is a suffix of w, since w is supposed to be unbordered. Therefore σ(w) {ab, ba} which implies that u = ba, and also aba p w. We have w = abaw 0 abb, since σ(w) {ab, ba}. Actually, we have w = abaabw 1 abb, since ψ 1 (σ(w)) is cyclically overlap-free by Lemma 4.4(ii) and thus it is also in {ab, ba}. We have the following factorization σ(w) = (ba)(ab)w 1 (ab)(ba), where w 1 {ab, ba}. Now, the shortest border v of σ 2 (w) is either v = aabbab or v = aabv abbab for some word v. In both cases, we have that vvay occurs in a conjugate of w since w is not a square, that is, there is no words v A such that w = v 2. This is a contradiction, since v begins with a, and thus vva is an overlapping factor. This completes the proof of the theorem.

63 4.1 Optimal Words for Border Correlation 47 The next theorem shows that words (of even length) with a maximum number of unbordered conjugates, that is w /2 many unbordered conjugates, are strongly cyclically overlap-free with two exceptions. Theorem 4.6. Let n 1. Every word of length 2n that has n unbordered conjugates is either strongly cyclically overlap-free or a conjugate of abbb or aaab. Proof. Note that β(abbb) = aabb and β(aaab) = abba. The claim follows now from Lemma 4.1 and Theorem 4.5. Theorems 4.5 and 4.6 show that every word with a maximum number of unbordered conjugates is strongly cyclically overlap-free, except for the conjugates of abbb and aaab. By Lemma 4.4(iv), each such word has length either 2 n or 3 2 n for some n 1. Lemma 4.6 and Theorem 4.5 give an upper bound on the number of β- images. Let A n denote all words over A of length n, and let B n denote the number of all β-images of length n. Recall that f n denotes the n-th Fibonacci number as defined in Subsect on page 33. Corollary 4.7. Let M = {2i i 0} \ {2 j, 3 2 j j 0}. Then for all n 3 β(a n ) [ aab n 2] { w w a 2, a 2 not in ww } \ { (ab) k, (ba) k k M } and where m = 2, if n M, and m = 0 otherwise. (4.1) B n f n + f n 2 m (4.2) Proof. Clearly, (4.1) follows from Lemma 4.6 and Theorem 4.5. We show how (4.2) follows from (4.1). Let A n denote the set of words of length n that have no factors a 2. Now, (i) each w A n 1 yields an element wb A n, and all elements of A n ending in b can be so obtained; (ii) each w A n 1 ending with b yields wa A n, and all elements of A n ending in a can be so obtained. By case (i), the number of required words w in case (ii) is equal to A n 2. Therefore, A n = A n 1 + A n 2. Since A 1 = 2, we have that A n = F n+1 for all n 1. Moreover, for n 5, the words w A n that begin and end in a are of the form w = abvba, where v A n 4. Hence the number of these words is f n 3. We conclude that there are f n+1 f n 3 = f n + f n 2 words of length n with n 5 whose conjugates do not have the factor a 2.

64 48 Unbordered Conjugates We do not consider the n different words of length n with exactly one a. Therefore, {w w a 2, a 2 not in ww} has f n + f n 2 n elements. Clearly, [aab n 2 ] has n elements. The claim then follows for n 5 from Lemma 4.1. By inspection, we see that (4.2) holds for n = 3 and 4, and thus the claim follows for all n 3. Remark 4.8. We have calculated B n for all n 30 using a computer; see Table 4.1. n m n m n m Table 4.1: The number m of β-images for lengths 1 n 30 It is remarkable that the bound (4.2) given in Corollary 4.7 is tight for all n 30 except if n = 12. That is B n = f n + f n 2 m for all 3 n 30 except if n = 12 where m = 2, if n M, and m = 0 otherwise. Actually, there exists no word w such that β(w) [abababbababb]. We have that B 12 = f 12 +f Unbordered Conjugates and Critical Points In this section we investigate the relation between the border correlation function and critical factorizations. There is no direct relationship between critical points and unbordered conjugates in general, since, for instance, the number of critical points does not commute with cyclic shifts whereas the border correlation function does; see Remark 4.13 in the next section. Moreover, if w = uv such that vu is unbordered, then u is not a critical point in general. Example 4.9. Consider the conjugacy class of w = ababa [w] = {ababa, babaa, abaab, baaba, aabab} with 4, 1, 2, 2, and 1 critical points, respectively. However, the word w has exactly two unbordered conjugates babaa and aabab.

65 4.3 Iterations of the Border Correlation Function 49 In general, it is not the case that there is a word w in the conjugacy class of some word w such that the critical points of w mark the unbordered conjugates of w like babaa and aabab in the above example. Marking an unbordered conjugate means here that if p is a critical point of w then σ p (w ) is unbordered. Example Consider the conjugacy class of w = abbabaab. We have exactly two critical points for every w [w] but four unbordered conjugates in [w]. However, if critical points are considered modulo cyclic shifts, the situation changes. Let w be a word of length n. We call an integer p, with 0 p < n, an internal critical point of w if p + n is a critical point of www. The following lemma shows that internal critical points are invariant under cyclic shifts. Lemma Let w be a word of length n. The point p is internal critical of w if and only if the point q = p i (mod n) is internal critical of u = σ i (w). Proof. Clearly, www contains all conjugates of ww. Moreover, it follows from σ(ww) = σ(w)σ(w) that uuu also contains all conjugates of ww. In fact, let v [w] such that v = σ j (w), then vv = σ j (ww) and www = xvvz where x = j (mod n). In particular, uuu = x vvz, where x = j i (mod n). Surely, the implication directions of the claim are symmetric to each other. Assume p is an internal critical point of w. Let v be the shortest repetition word at point p + n in www. We have that v is a conjugate of w, since p + n is critical. So, www = xvvz where x = p. Now, uuu = x vvz where x = p i (mod n), and hence, the point q + n is critical, and this proves the claim. Theorem Let w be a primitive word of length n, and let 0 p < n. Then the following statements are equivalent: p is an internal critical point of w. the conjugate σ p (w) is unbordered. Proof. Assume p is an internal critical point of w. Then www = xvvz where x = p and v is an unbordered factor of length n in ww. Hence, σ p (w) = v. Assume v = σ p (w) is an unbordered conjugate of w. Then www = xvvz with x = p, and p+n is a critical point of www. Hence, p is an internal critical point of w. 4.3 Iterations of the Border Correlation Function In this section we investigate iterations of the border correlation function. We start by considering the β-graph G β (n) for each n 1. It is the directed graph

66 50 Unbordered Conjugates with the set A n = {w w = n, w A } as vertices, and with edges determined by the border correlation function β, that is, there is a (directed) edge u v if and only if β(u) = v. In order to avoid trivial exceptions, we assume in this section that n 3. Remark It is straightforward to see that β(σ(w)) = σ(β(w)), that is, the following diagram commutes. σ w w β u σ β u The β-graph G β (n) consists of components where each component contains exactly one cycle. The images of all members of a conjugacy class [w] are mapped to the conjugacy class [β(w)]. In the following we show that any cycle in the graph G β (n) consists of exactly one conjugacy class. Moreover, we describe all conjugacy classes that form a cycle. Let κ: A N where κ(w) denotes the minimum k such that ab k a occurs in any conjugate of w, or w is a conjugate of ab k, or w = b k. Note that k = 0 if and only if a 2 occurs in w or σ(w). Let ν : A N N be defined such that ν(w) = ( w a, w κ(w)). Note that ν(w) = ν(σ(w)). Let < denote the extension of the ordering of natural numbers to the lexicographic order on N N, in other words (p, q) < (r, s) if p < r, or p = r and q < s. Theorem Let w be a word not in b and not in [ab k ] [ab k ab k+1 ], for all k 0. Then ν(β(w)) < ν(w). Proof. Let w be a word of length n that is not in b [ab n 1 ] and not in [ab k ab k+1 ], for k = (n 3)/2. Note that a occurs at least twice in w. If w is not primitive, then β(w) = b n and, in this case, it is clear that ν(β(w)) < ν(w). Assume then that w is primitive. Because ν(w) = ν(σ(w)), we can choose any conjugate of w without changing its ν image. Therefore, we can assume that w begins with a and that it is unbordered. For example, we may take the Lyndon word in the conjugacy class [w] with respect to the order a b. We have now a unique factorization in the form w = B 1 B 2 B r, where each B i = ab k i with r 2 and k i 0 for all 1 i r. Let m be the minimum of all k i. Note that β(w) a w a by Lemma 4.1 since if w (i) = a then not both σ i 1 (w) and σ i (w) are unbordered. So, every occurrence of letter a in w implies

67 4.3 Iterations of the Border Correlation Function 51 at most one a in β(w), since we can get an unbordered conjugate of w only either before or after that occurrence of a, but not in both cases by Lemma 4.1(ii). If an occurrence of a in w does not imply an a in β(w), we say that this occurrence of a is dropped. The claim follows if β(w) a < w a, and therefore, we can assume that β(w) a = w a, that is, no occurrence of a is dropped: for every i 1, if the i-th letter in w is an a, then either σ i 1 (w) or σ i (w) is unbordered. Since w begins with a and is unbordered, we have that β(w) = B 1 B 2 B r, where B i = abk i and k i 0 for all 1 i r. Note that the a in B i corresponds to the unbordered conjugate of w, if w is factored either before or after the occurrence of a in B i. We show that κ(w) < κ(β(w)) in this case. Let i+1 be modulo r in the following, and let j = B 1 B 2 B i. If k i = k i+1 then the a in B i+1 is dropped, that is, neither σ j (w) nor σ j+1 (w) is bordered; a contradiction. So, assume that k i k i+1. Note that if k i > k i+1 then σ j+1 (w) is bordered and σ j (w) is unbordered by assumption, and if k i < k i+1 then σ j (w) is bordered and σ j+1 (w) is unbordered by assumption. If k i > k i+1 then k i = k i, in case k i 1 > k i, and otherwise k i = k i 1. If k i < k i+1 then k i = k i + 1, in case k i 1 > k i, and otherwise k i = k i. Now, we have that k i k i 1. If k i = m then k i = m + 1. However, we get k i = m if and only if k i 1 = m and k i = m + 1 and k i+1 = m, and r 4, since w [ab k ab k+1 ] and, by assumption, β(w) a = w a. Therefore, we also have k i 2 > m and b m+1 ab m ab m+1 ab m a occurs in a conjugate of w, and both σ j (w) and σ j+1 (w) are bordered; a contradiction. So, k l > m for all 1 l r if β(w) a = w a, and therefore we have ν(β(w)) < ν(w). Lemma Let w [ab k ab k+1 ] with k 0. Then [ab k ab k+1 ] = {β i (w) 0 i < w }. Proof. We have that w = b r ab s ab t, where either r + t = k and s = k + 1, or r + t = k + 1 and s = k. Now β(w) = b r+1 ab s 1 ab t = σ s (w) in the former case and β(w) = b r ab s+1 ab t 1 = σ s+1 (w) in the latter case. That is, β(w) = σ k+1 (w), and the claim follows, since 2k + 3 and k + 1 are relatively prime. We are now ready to show that iterations of β on any binary word result in a word of a certain shape. Theorem For every word w with w 2, there exists an i 0 such that β i (w) b or β i (w) [ab k ab k+1 ].

68 52 Unbordered Conjugates Proof. Let w be a word of length n. Note that β(w) = b n if w is not primitive. Assume thus that w is primitive. Moreover, β(w) [ab n 1 ] since w has at least two unbordered conjugates. If w [ab n 1 ] then β(w) [aab n 2 ]. If w [ab k ab k+1 ] then β(w) [ab k ab k+1 ] by Lemma Suppose now that w is different from b n and w is not in [ab n 1 ] [ab k ab k+1 ] for k = (n 3)/2. Since the values of ν strictly decrease after an application of β, by Theorem 4.14, we conclude that there exists an i 1 such that β i (w) = b n or β i (w) [ab k ab k+1 ]. Consider then the graph G β (n), which consists of the conjugacy classes [w], for w = n, as its vertices and there is an edge [u] [v] if β(u) = v. By the above results, this graph is well defined, and it consists of trees when disregarding reflexive loops [u] [u]. (See Fig. 4.1 for the graph G β (7).) [bbbbbbb] [abbabbb] [aaabbbb] [aaaaaaa] [aaabaab] [aaaabab] [aaaabbb] [ababbbb] [aabbabb] [aabbbbb] [aaaaabb] [aababab] [aabaabb] [abbbbbb] [aaaaaab] [abababb] [aababbb] [aaababb] [aabbbab] [aaabbab] Figure 4.1: The graph G β (7). We have omitted the loops of the vertices [b7 ] and [abbabbb]. 4.4 Comments We have investigated the border correlation function β of binary words. The shape of β images for words with a minimal and maximal number of unbordered conjugates has been clarified. Nevertheless, the set β(a ) has not been

69 4.4 Comments 53 completely described. Corollary 4.7 seems to give a very good estimation. All β-images up to length 30 have been checked and only words of length 12 are exceptional. Apart from the border correlation function β one could investigate an extension β : A N of that function such that a word w of length n is mapped to (m 0 )(m 1 ) (m n 1 ) where the set of natural numbers N forms an alphabet and m i is the length of the shortest border of σ i (w) for all 0 i < n. We just notice here that β is injective, since, if u = wau and v = wbv, then clearly the shortest borders of the w -th conjugates au w and bv w are different, because one of them is equal to 1, and the other is not.

70 54 Unbordered Conjugates

71 Chapter 5 Unbordered Factors In this chapter we focus on the relationship between the length of a word and its unbordered factors. This line of research was introduced by Ehrenfeucht and Silberger (1979) and Assous and Pouzet (1979). It was carried further and culminated in a strong conjecture by Duval (1982). We will give a historical overview on this line of research, its main results and conjectures so far, in Sect This will lead to the concept of Duval extensions which are introduced in Sect Section 5.3 is devoted to the proof of Duval s conjecture. Finally, we conclude with some comments in Sect This chapter is based on work by Duval, Harju, and Nowotka (2002), and Harju and Nowotka (2003a; 2003b; 2003f; 2004b; 2004c). 5.1 On the Maximum Length of Unbordered Factors When the length of unbordered factors of a word is investigated, it is usually done in terms of the length of the word and its period. Clearly, the maximum length of unbordered factors µ(w) of w is bound by the period p(w) of w. We have µ(w) p(w). Indeed, we have for every factor v of w, with p(w) < v, that the prefix v (1) v (2) v ( v p(w)) of v is also a suffix of v by the definition of period. It is a natural question to ask at what length of w is µ(w) necessarily maximal, that is, µ(w) = p(w). Of course, the length of w is considered with respect to either µ(w) or p(w). Ehrenfeucht and Silberger (1979) as well as Assous and Pouzet (1979) addressed this question first. Ehrenfeucht and Silberger stated the following theorem. Theorem 5.1. If 2p(w) w then µ(w) = p(w).

72 56 Unbordered Factors They also established that every primitive word w has at least alph(w) unbordered conjugates, where alph(w) denotes the set of different letters occuring in w and M denotes the cardinality of a set M. This leads directly to the following theorem. Theorem 5.2. If 2p(w) alph(w) w then µ(w) = p(w). However, this result was stated later by Duval (1982). The real challenge, though, turned out to be giving a bound on the length of w with respect to µ(w). It was conjectured by Ehrenfeucht and Silberger (1979) that 2µ(w) w implies µ(w) = p(w). However, Assous and Pouzet (1979) gave the following counter example; see also Example 2.3 on page 12. Let w = a n ba n+1 ba n ba n+2 ba n ba n+1 ba n for which w = 7n + 10 and µ(w) = 3n + 6 and p(w) = 4n + 7 contradicting that conjecture. They themselves gave the following conjecture. Conjecture 5.3. Let f : N N such that Then f(n) 3n. f(n) = 1 + max{ w µ(w) = n and p(w) n}. Duval (1982) established the following theorem. Theorem 5.4. If 4µ(w) 6 w then µ(w) = p(w). He also stated Conjecture 5.6 (see next section) about what was later called Duval extensions that would imply a positive answer to Conjecture 5.3, namely 5.2 Duval Extensions If 3µ(w) w then µ(w) = p(w). In the previous section we recalled a question initially raised by Ehrenfeucht and Silberger (1979). The problem was to estimate a bound on the length of w, depending on µ(w), such that µ(w) = p(w). Duval (1982) introduced a restricted version of that problem by assuming that w has an unbordered prefix of length µ(w). Let w and u be nonempty words where w is also unbordered. We call wu a Duval extension of w if every factor of wu longer than w is bordered, that is, µ(wu) = w. We call a Duval extension wu of w trivial if p(wu) = µ(wu) = w. A nontrivial Duval extension wu of w is called minimal if u is of minimal length, that is, u = u a and w = u bw where a, b A and a b.

73 5.2 Duval Extensions 57 Example 5.5. Let w = abaabbabaababb and u = aaba. Then w.u = abaabbabaababb.aaba (for the sake of readability, we use a dot to mark where w ends) is a nontrivial Duval extension of w of length wu = 18, where µ(wu) = w = 14 and p(wu) = 15. However, wu is not a minimal Duval extension, whereas w.u = abaabbabaababb.aa is minimal, with u = aa p u. Note that wu is not the longest nontrivial Duval extension of w since w.v = abaabbabaababb.abaaba is longer, with v = abaaba and wv = 20 and p(wv) = 17. One can check that wv is a nontrivial Duval extension of w of maximum length, and at the same time wv is also a minimal Duval extension of w. We are concerned with nontrivial Duval extensions for the rest of this section. Duval (1982) stated the following conjecture. Conjecture 5.6. If wu is a nontrivial Duval extension of w, then u < w. It follows directly from this conjecture that for any word w the condition 3µ(w) w implies µ(w) = p(w), and, hence, Conjecture 5.3 by Assous and Pouzet (1979) would follow. Duval s conjecture has remained popular throughout the years, see for example Chap. 8 in (Lothaire, 2002). Actually, an even stronger version of Conjecture 5.6, where u < w 1 for a nontrivial Duval extension wu of w, was believed to be true. Indeed, this is true as we will show in Sect The following lemma reduces our focus to Duval extensions of length less than or equal to 2n when minimal Duval extensions are considered, like in Subsect Lemma 5.7. If an unbordered word w of length n has a nontrivial Duval extension wv such that v > w, then it has a nontrivial Duval extension wu such that u w. Proof. Take the maximum k 0 such that v = w k w. Let w 0 be the maximum common prefix of w and w. So, w = w 0 v. Clearly, v is not empty, since wv is a nontrivial Duval extension. Now, wu is a nontrivial Duval extension of w for any word u such that u p w 0 v and w 0 < u w. Duval extensions have also become a subject of interest on their own. We will investigate them further in the next subsections.

74 58 Unbordered Factors Words without Nontrivial Duval Extension In this subsection we investigate the structure of words that do not have a nontrivial Duval extension. Before we come to first result, we need to introduce the concept of Sturmian words. Sturmian words are infinite words of minimal subword complexity which are not eventually periodic, or, in other words, a Sturmian word contains exactly n + 1 different factors of length n for every n 1. Note that the Fibonacci word (see Example 2.20 on page 21) is a Sturmian word. There is a wealth of literature about Sturmian words. Let us just mention (Morse and Hedlund, 1940) and (Lothaire, 2002) for the sake of reference here. Note that Sturmian words are, by this definition, over a binary alphabet. Let us consider finite factors of Sturmian words in the following and let us simply call them Sturmian words. Mignosi and Zamboni (2002) showed the following uniqueness result for Duval extensions. Theorem 5.8. Unbordered Sturmian words only have trivial Duval extension. However, we are able to improve this result to Lyndon words. Recall that Lyndon words are unbordered. Theorem 5.9. Lyndon words only have trivial Duval extensions. Proof. Let w be a Lyndon word with respect to an order. Certainly, w is unbordered since it is a Lyndon word. Assume contrary to the claim that there exists a nonempty word u such that wu is a nontrivial Duval extension of w. Let u be of minimum length such that u p w. So, either u = va and vb p w or u = vb and va p w for some a, b A with a b and a b. We can assume that u w by Lemma 5.7. If v = ε then u = b since the first letter of w is minimal with respect to. Let the shortest border of wb be ayb, we have then that w is bordered with ay; a contradiction. Therefore, v ε in the following. Case: Suppose u = va. Then w = vbz. We have that va is not a factor of w since va is lexicographically smaller than vb. Let v = v v be such that v a is the longest unbordered suffix of va. Consider the suffix v bzva of wu. We have that v bzva > w. Let s be the shortest border of v bzva. Now s > v otherwise v a is bordered. Moreover, s v a since a b. If v a < s va then v a is not the longest unbordered suffix of va; a contradiction. But, if s va then va occurs in w and w is not a Lyndon word; a contradiction. Case: Suppose u = vb. Then w = vaz. Since va is a prefix of a Lyndon word and a b, we have that wvb is also a Lyndon word, and hence, unbordered. This contradicts the assumption that wu is a Duval extension.

75 5.2 Duval Extensions 59 However, Theorem 5.9 does not characterize words without a nontrivial Duval extension as the following example shows. Example Consider w = ababbaabb which is not a Lyndon word for any order. It is not hard to check that w has only trivial Duval extensions. We have that Theorem 5.9 improves Theorem 5.8 since Theorem 5.12 states that unbordered Sturmian words are indeed Lyndon words. The following lemma will be used to prove that result. But first, let us recall the definition of a Lyndon preserving morphism; see also page 20. Let φ: A B be a morphism, and A and B be lexicographic orders on A and B, respectively, such that a 0 A a 1 = φ(a 0 ) B φ(a 1 ) (5.1) for every a 0, a 1 A, and φ(a) is a Lyndon word w.r.t. B for every a A. Note that φ is nonerasing. Richomme (2002) shows that a nonerasing morphism preserves Lyndon words if and only if it preserves a lexicographic order and the image of every letter in A is a Lyndon word. Nevertheless, below we give a short proof for the if part of Richomme s theorem. Lemma If w A is a Lyndon word, then φ(w) is a Lyndon word. Proof. Let w = n. Assume φ(w) is not a Lyndon word. Therefore, φ(w) = xy such that yx is minimal w.r.t. B, and x and y are not empty. If x = φ ( w (1) w (2) w (i) ) and y = φ ( w(i+1) w (i+2) w (n) ) with 1 i < n, then we have an immediate contradiction by (5.1). Therefore, there exists an i, where 1 i n, and φ(w (i) ) = v 1 v 2 such that we have x = φ ( w (1) w (2) w (i 1) ) v1 and y = v 2 φ ( w (i+1) w (i+2) w (n) ) and v 1, v 2 ε. That implies v 2 B v 1 v 2, a contradiction since v 1 v 2 is a Lyndon word by assumption. The following theorem shows that Theorem 5.9 implies Theorem 5.8. Theorem Every unbordered Sturmian word is a Lyndon word. Proof. Let u {a, b} be an unbordered Sturmian word. Assume u begins with a and ends with b without restriction of generality. The case is clear if u = ab k or u = a k b for some k 1. Assume a occurs at least twice in u separated by b s. Now, u can be factored into ab k and ab k+1 (or a k+1 b and a k b, respectively), for some k 1, since Sturmian words are balanced; c.f. Proposition in Chap. 2 of (Lothaire, 2002). Then either u = ab k vab k+1 or u = a k+1 bva k b since u is an unbordered Sturmian word. Let φ: {a, b} {a, b}

76 60 Unbordered Factors such that φ(a) = ab k and φ(b) = ab k+1 (and φ(a) = a k+1 b and φ(b) = a k b, respectively). Now, let w = φ 1 (u) and we have that w is an unbordered Sturmian word, since the preimage of φ preserves Sturmian words, c.f. (Lothaire, 2002), that begins with a and ends in b. By induction w is a Lyndon word w.r.t. and u is a Lyndon word w.r.t. by Lemma The converse of Theorem 5.12 is certainly not true. Indeed, consider the word aabbab which is a Lyndon word but not a Sturmian word since it contains four factors of length two. It is worth noting that a close relationship between Sturmian words and Lyndon words has been shown by Borel and Laubie (1993), and Berstel and de Luca (1997), and Melançon (1999). In particular, the following holds a P b {a, b} = S L = C where a b and P is the set of all words w having two periods p and q which are coprimes and w = p + q 2, and S denotes the set of finite factors of a Sturmian word over {a, b} beginning with a, and C is the set of Christoffel primitive words, c.f. (Lothaire, 2002) and (Borel and Laubie, 1993). Theorem 5.9 gives the following corollary. Corollary Let wvwu be a nontrivial Duval extension of wv. Then vw is not a Lyndon word. Proof. Assume vw is a Lyndon word. Then vwu is a trivial Duval extension of vw, and hence, u p (vw) k for some k 1. Now, p(wvwu) = wv = µ(wvwu) and wvwu is a trivial Duval extension; a contradiction. Corollary 5.13 implies the following lemma which is an improvement of Lemma 5.7 and will be used later in Sect Lemma If an unbordered word w of length n has a nontrivial Duval extension wv such that v > w, then it has a nontrivial Duval extension wu such that n 2 u n and u p v. Proof. Note that every Duval extension of a single letter is trivial. So, n > 1. Let w 0 be the longest common prefix of w and v, and let w = w 0 w and v = w 0 v. If w 0 < n 2 each prefix u of v with n 2 u n gives that wu is a nontrivial Duval extension of w. Assume that w 0 n 2. Then w 2. Since at least two different letters occur in w then there are two Lyndon words which are conjugates of w, at least one of which occurring in w 0 w w 0. It follows from Corollary 5.13 that every Duval extension of w that has w 0 w w 0 as a prefix is trivial. In particular, wv is a trivial Duval extension of w; a contradiction.

77 5.2 Duval Extensions Minimal Duval Extensions The next theorem states a basic fact about minimal Duval extensions. Theorem Let wu be a minimal Duval extension of w. Then au occurs in w where a s w and a A. Proof. Let w = vbw and u = vc, by Lemma 5.14, with b, c A and b c for wu is a minimal Duval extension. Let xc be the longest unbordered suffix of vc. Consider the factor f = xbw vc of wu. We have that f > w, and hence, f is bordered. Let g be the shortest border of f. Note that g is unbordered. If g < xc then xc is bordered; a contradiction. Moreover, if g = xc then b = c which is a contradiction, too. If xc < g vc then xc is not maximal; a contradiction. So, vc < g, and hence, au occurs in w Maximal Duval Extensions Nontrivial Duval extensions of w of length 2 w 2 seem to be of a special shape. We propose the following conjecture. Conjecture Let w = w ab k for some k 1. If wu is a nontrivial Duval extension of w of length 2 w 2, then b k does not occur in w. The following theorem shows that Conjecture 5.16 implies the improved Duval s conjecture. Theorem If for every nontrivial Duval extension wv of w of length 2 w 2, with w = w ab k for some k 1, we have that b k does not occur in w, then every Duval extension wu of w where u w 1 is trivial. Proof. Let w be an unbordered word of length n 2 such that w = w ab k for some k 1. Assume that wu is a nontrivial Duval extension of w such that u n 1. Let p be the leftmost position where w is different from u, that is, u (1) u (2) u (p 1) p w and w (p) u (p). If u > n, we can assume that there exists a nontrivial Duval extension wu with u n and u p u by Lemma So, let s assume that n 1 u n. We can assume that u = n 1 if p n 1 since any prefix u of u such that u p gives a nontrivial Duval extension wu of w. Case: p < n 1. Let u = u (1) u (2) u (n 2). We apply conjecture Then wu is a nontrivial Duval extension of length 2n 2, and hence, b k does not occur in w. Neither does b k occur in u, since if u b k p u then wu b k is unbordered; a contradiction. Let u = u 0 ab l for some 0 l < k. If l < k 1 then b k u 0 a is longer than n and unbordered; a contradiction. Assume that l = k 1. Let q be the rightmost position where w ab k 1 is different from u, that is, u (q+1) u (q+2) u (n 1) s w ab k 1 and w (q) u (q).

78 62 Unbordered Factors Let w (q) r be the largest unbordered prefix of w (q) w (q+1) w (n 1). Consider the factor w (q) w (q+1) w (n) u (1) u (2) u (q) r which is longer than n, and let s be its shortest border. We have that s > w (q) r otherwise w (q) r is bordered. Moreover, s > w (q) w (q+1) w (n 1) otherwise w (q) r is not the largest unbordered prefix. Hence, w (q) w (q+1) w (n 1) b p s and b k occurs in u; a contradicition. Case: p n 1. Then w = w w (n 1) w (n) and u = w u, where u ε. Since there are at least two different letters in wu, we have that w w (n 1) w (n) w contains at least one Lyndon word which is a conjugate of w. By Corollary 5.13 wu is a trivial Duval extension; a contradiction. However, we will present a direct proof of Duval s conjecture in the next section. 5.3 Duval s Conjecture This section is devoted to the proof of the improved Duval s conjecture Preliminary Results The results of this subsection are mostly of technical nature and are used later in the proof of Theorem Lemma Let zf = gzh where f, g ε. Let az be the maximum unbordered prefix of az where a A. If az does not occur in zf, then agz is unbordered. Proof. Assume that agz is bordered, and let y be its shortest border. In particular, y is unbordered. If z y then y is a border of az which is a contradiction. If az = y or az < y then az occurs in zf which is again a contradiction. If az < y az then az is not maximum since y is unbordered; a contradiction. The proof of the following lemma is easy. Lemma Let w be an unbordered word and u p w and v s w. Then uw and wv are unbordered. The following Lemmas 5.20, 5.21 and 5.22 and Corollary 5.23 are given in (Duval, 1982). Let a 0, a 1 A, with a 0 a 1, and t 0 A. Let the sequences (a i ), (s i ), (s i ), (s i ), and (t i), for i 1, be defined by a i = a i (mod 2), that is, a i = a 0 or a i = a 1 if i is even or odd, respectively,

79 5.3 Duval s Conjecture 63 s i such that a i s i is the shortest border of a i t i 1, and s i = a i t i 1 if a i t i 1 is unbordered, s i such that a i+1s i is the longest unbordered prefix of a i+1s i, s i such that s i s i = s i, t i such that t i s i = t i 1. For any parameters of the above definition, the following holds. Lemma For any a 0, a 1, and t 0 there exists an m 1 such that s 1 < < s m = t m 1 t 0 and s i p s i+1, for all 1 i < m, and s m = t m 1 and t 0 s m + s m 1. Lemma Let z p t 0 such that a 0 z and a 1 z do not occur in t 0. Let a 0 z 0 and a 1 z 1 be the longest unbordered prefixes of a 0 z and a 1 z, respectively. Let m be the smallest integer such that s m = t m 1. Then 1. if m = 1 then a 1 t 0 is unbordered, 2. if m > 1 is odd, then a 1 s m is unbordered and t 0 s m + z 0, 3. if m > 1 is even, then a 0 s m is unbordered and t 0 s m + z 1. Lemma Let v be an unbordered factor of w of length µ(w). If v occurs twice in w, then µ(w) = p(w). Corollary Let wu be a Duval extension of w. If w occurs twice in wu, then wu is a trivial Duval extension A Solution This section contains a proof of the improved version of Conjecture 5.6 by Duval (1982). Theorem If wu is a nontrivial Duval extension of w, then u < w 1. Proof. Recall that every factor of wu which is longer than w is bordered since wu is a Duval extension of w. Let z be the longest suffix of w that occurs twice in zu. If z = ε then a s w where a A and a does not occur in u. Let w = u bw and u = u cu such that b, c A and b c. Then w = w 0 au cw 1 by Theorem Consider the factor au cw 1 u which is bordered (since otherwise au cw 1 u w and u w 0 < w 1), it has a shortest border g such that au g and g occurs in w. Hence, u < w 1.

80 64 Unbordered Factors So, assume z ε. We have z w since wu is otherwise trivial by Corollary Let a, b A be such that w = w az and u = u bzr and z occurs in zr only once, that is, bz matches the rightmost occurrence of z in u. Note that bz does not overlap az from the right, by Lemma 5.19, and therefore u exists, although it might be empty. Naturally, a b by the maximality of z, and w ε, otherwise azu bz p wu has either no border or w is bordered (if azu bz has a border not longer than z) or az occurs in zu (if azu bz has a border longer than z); a contradiction in any case. Let az 0 and bz 1 denote the longest unbordered prefix of az and bz, respectively. Let a 0 = a and a 1 = b and t 0 = zr and the integer m be defined as in Lemma We have then a word s m, with its properties defined by Lemma 5.21, such that t 0 = s m t. Consider azu bz 0. We have that az and azu bz 0 are both prefixes of a 0 zu, and bz 0 is a suffix of azu bz 0 and az does not occur in zu bz 0. It follows from Lemma 5.18 that azu bz 0 is unbordered (where z 0 is z in Lemma 5.18), and hence, azu bz 0 w. (5.2) w u a z u b z r z 0 z 0 s m t Case: Suppose that m is even. Then we have m 2 and as m (= a m s m ) is unbordered and t 0 s m + z 1 by Lemma Suppose t 0 = s m + z 1 and z 1 = z. Then z s m 1 by Lemma Note that we have an immediate contradiction if m = 2 since then s 1 < z which contradicts z s m 1. So, assume m > 2. But now, bz occurs in t 0 since bs m 1 is a border of bt m 2 and t i p t 0, for all 0 i < m, which is a contradiction. So, assume that t 0 < s m + z 1 or z 1 < z. Then t < z. Suppose now that s m z 0. Then azu bz 0 w and u = azu z 1 = azu bz 0 z 0 + t 0 z 1 < azu bz 0 z 0 + s m + z 1 z 1 w + z 1 z 1 w 1

81 5.3 Duval s Conjecture 65 if t 0 < s m + z 1, or u = azu z 1 = azu bz 0 z 0 + t 0 z 1 azu bz 0 z 0 + s m + z 1 z 1 w + z 1 z 1 < w 1 if z 1 < z. We have u < w 1 in both cases. Let then s m > z 0. We have that as m is unbordered, and since az 0 is the longest unbordered prefix of az, we have that az is a proper prefix of as m, and hence, z s m. Now, azu bs m is unbordered otherwise its shortest border is longer than az, since no prefix of az is a suffix of as m, and az occurs in u; a contradiction. So, azu bs m w and u < w 1 since t < z. Case: Suppose that m is odd. Then bs m (= a m s m ) is an unbordered word and t 0 s m + z 0 ; see Lemma Note that t 0 = s m and t = ε if m = 1 by Lemma Surely s m ε. Note in particular t z 0. If s m < z, then u < w 1 since u = azu bz 0 bz 0 + bt 0 az and azu bz 0 w, by (5.2), and t 0 s m + z 0. Assume thus that s m z, and hence, also z p s m. Since s m ε, we have bs m 2, and therefore, by the critical factorization theorem, there exists a critical point p in bs m such that bs m = v 0 v 1, where v 0 = p. w u a z u b z r z 0 z 0 s m t v 0 v 1 In particular, Let bz p v 0 v 1. (5.3) u = u 0v 0 v 1 u 1 be such that v 0 v 1 does not occur in u 0. Note that v 0v 1 does not overlap with itself since it is unbordered, and v 0 and v 1 do not intersect by Lemma 3.4.

82 66 Unbordered Factors Consider the prefix wu 0bz of wu which is bordered and has a shortest border g longer than z, and hence, bz s g, otherwise w is bordered since z s w. Moreover, g p w, for otherwise az would occur in u, and hence, bz occurs in w. Let w = w 0 bzw 1 such that bz occurs in w 0 bz only once, that is, we consider the leftmost occurrence of bz in w. Note that w 0 bz g u 0bz (5.4) where the first inequality comes from the definition of w 0 above and the second inequality from the fact that u 0bz < g implies that w is bordered. Let f = bzw 1 u 0v 0 v 1. If f is unbordered, then f w, and hence, u 0 v 0v 1 w 0. Now, we have u 0 < w 0 which contradicts (5.4). Therefore, f is bordered. Let h be its shortest border. w u a z u b z r u 0 b z v 0 v 1 t w 0 b z w 1 u 0 v 0 v 1 u 1 h h f w 0 v 0 v 1 b z Surely, bz < h otherwise v 0 v 1 is bordered by (5.3). So, bz p h. Moreover, v 0 v 1 h otherwise bz occurs in s m contradicting our assumption that bzr marks the rightmost occurrence of bz in u. So, v 0 v 1 s h, and v 0 v 1 occurs in w since w 0 h p w. Let w 0 bzv = w 0 h = w 0v 0 v 1. Note that v 0 v 1 does not occur in w 0 otherwise it occurs in u 0 as a consequence of (5.4) contradicting our assumption on u 0. We have h = bzv s u 0 v 0v 1 (see previous figure). Let u 0 v 0v 1 = u 0 h. Consider which has a shortest border h 0. f 0 = wu 0 bz

83 5.3 Duval s Conjecture 67 w u a z u 0 b z b z r v 0 v 1 v 0 v 1 t w 0 b z w 1 u 0 b z u 1 h 0 h 0 f 0 Surely, bz s h 0 otherwise w is bordered with a suffix of z. Moreover, we have w 0 bz h 0 and h 0 u 0 bz since bz does not occur in w 0 and w is unbordered. From that and w 0 h = w 0 v 0v 1 and u 0 h = u 0 v 0v 1 follows now w 0 u 0 and u 0v 0 v 1 = u 0 bzv and w 0 occurs in u 0. (5.5) Let now w = w 0v 0 v 1 w i v 0 v 1 w 2v 0 v 1 w 1v 0 v 1 w 2 for some word w 2 that does not contain v 0 v 1, and u = u 0v 0 v 1 u j v 0 v 1 u 2v 0 v 1 u 1v 0 v 1 t such that v 0 v 1 does not occur in w k, for all 0 k i, or v l, for all 0 l j. Note that these factorizations of w and u are unique, and, moreover, w 2 ε. (Indeed, if w 2 = ε then v 0 v 1 s w and az s v 0 v 1, and az would occur in u; a contradiction.) We claim that either i = j and w k = u k, for all 1 k i or u < w 1. Assume that k = 1. We show that w 1 = u 1. Consider f 1 = v 1 w 1v 0 v 1 w 2 u 0v 0 v 1 u j v 0 v 1 u 1v 0. If f 1 is unbordered, then u < w 1 since f 1 w and u = f 1 v 1 w 1v 0 v 1 w 2 + v 1 t and t z 0 z < bz v 0 v 1 and w 2 ε. Assume then that f 1 is bordered, and let h 1 be its shortest border. Clearly, h 1 = v 1 g 1 v 0 for some g 1 (possibly g 1 = ε) since v 0 and v 1 do not intersect. We show that h 1 p v 1 w 1 v 0. Indeed, otherwise either 1. az occurs in u, in case v 1 w 1 v 0v 1 w 2 p h 1, a contradiction to our assumption on az, or 2. v 0 and v 1 intersect, in case v 0 z and v 1 w 1v 0 v 1 w 2 az + v 0 < h 1 < v 1 w 1v 0 v 1 w 2 and then v 0 occurs in z, contradicting Lemma 3.4, or

84 68 Unbordered Factors 3. u < w 1, in case v 0 w 3 s w 2 and az v 0 w 3, then v 0 w 3 u v 0 v 1 is unbordered (since otherwise its border is at least as long as v 0 v 1 because v 0 and v 1 do not intersect, but then az occurs in u which is a contradiction) and the result follows from t < v 0 w 3 1, since t < az and also az < v 0 w 3, for v 0 does not begin with a. Moreover, h 1 s v 1 u 1 v 0 since v 0 v 1 does not occur in v 1 w 1 v 0. So, let w 1v 0 = g 1 v 0 w 1 and v 1 u 1 = u 1v 1 g 1. (5.6) w u v 0 v 1 w 1 v 0 v 1 w 2 v 0 v 1 u 1 v 0 v 1 t g 1 v 0 w 1 u 1 v 1 g 1 h 1 h 1 Consider, f 1 f 2 = v 0 w 1v 1 w 2 u 0v 0 v 1 u j v 0 v 1 u 1v 0 v 1. If f 2 is unbordered, then u < w 1 since f 2 w and u = f 2 v 0 w 1v 1 w 2 + t and t z 0 z < bz v 0 v 1 and w 2 ε. Assume then that f 2 is bordered, and let h 2 be its shortest border. Since v 0 and v 1 do not intersect, v 0 v 1 s h 2. Also h 2 p v 0 w 1 v 1 since v 0 v 1 does not occur in w 2 (and v 0 and v 1 do not intersect) and az does not occur in h 2 (and so h 2 does not stretch beyond w). We have v 0 w 1 v 1 p h 2 since v 0 v 1 occurs in v 0 w 1 v 1 only as a suffix. Hence, we have h 2 = v 0 w 1 v 1. Note that h 2 u 1 v 0v 1 since otherwise h 2 v 0 v 1 u 1 v 0v 1 (because v 0 and v 1 do not overlap) and v 0 v 1 occurs twice in h 2, but v 0 v 1 occurs only once in h 2 since it occurs only once in w 1 v 0v 1 w 2 and az does not occur in h 2. Hence, we have w 1v 0 v 1 = g 1 h 2 and h 2 s u 1v 0 v 1. (5.7) w u v 0 v 1 w 1 v 0 v 1 w 2 v 0 v 1 u 1 v 0 v 1 t g 1 v 0 w 1 u 1 v 1 g 1 v 0 w 1 h 2 h 2 f 2

85 5.3 Duval s Conjecture 69 Consider, f 3 = v 0 v 1 w 1v 0 v 1 w 2 u 0v 0 v 1 u j v 0 v 1 u 2v 0 u 1v 1. If f 3 is unbordered, then u < w 1 since f 3 w and u = f 3 v 0 v 1 w 1v 0 v 1 w 2 + g 1 v 0 v 1 t and t z 0 z < bz v 0 v 1 and g 1 w 1 and w 2 ε. Assume that f 3 is bordered. Then f 3 has a shortest border h 3 such that v 0 v 1 p h 3. We have h 3 = v 0 u 1 v 1 by the arguments from the previous paragraph. Moreover, v 0 v 1 u 1 = h 3 g 1 and v 0 v 1 w 1 p h 3. (5.8) w u v 0 v 1 w 1 v 0 v 1 w 2 v 0 v 1 u 1 v 0 v 1 t g 1 v 0 w 1 u 1 v 1 g 1 v 1 u 1 h 3 h 3 f 3 Observe, that (5.7) and (5.8) imply that the number of occurrences of v 0 and v 1, respectively, is the same in w 1 and u 1 since v 0 and v 1 do not intersect. Now, let h 1 = v 1 g 1 v 0 = h 1v 1 h 1v 0 = v 1 h 0v 0 h 0 where v 1 and v 0 occur only once in v 1 h 1 and h 0 v 0, respectively. w u v 0 v 1 w 1 v 0 v 1 w 2 v 0 v 1 u 1 v 0 v 1 t g 1 v 0 w 1 u 1 v 1 g 1 h 0 v 0 h 0 h 1 v 1 h 1 Now, let f 2 = v 0 h 0w 1v 1 w 2 u 0v 0 v 1 u j v 0 v 1 u 1v 0 v 1 and f 3 = v 0 v 1 w 1v 0 v 1 w 2 u 0v 0 v 1 u j v 0 v 1 u 2v 0 u 1h 1v 1 with the respective shortest borders h 2 and h 3 (which are both not empty if u w 1; as in the case of f 2 and f 3 ) and v 0 v 1 s h 2 and v 0v 1 p h 3. We have h 2 p v 0 h 0 w 1 v 1 since v 0 v 1 does not occur in w 2 and az does not occur in h 2 (and so h 2 does not stretch beyond w). We have v 0h 0 w 1 v 1 p h 2 since v 0 v 1 does not occur in w 1. Hence, we have h 2 = v 0h 0 w 1 v 1 and w 1v 0 v 1 = h 0v 0 h 0w 1v 1 = h 0h 2 and h 2 s u 1v 0 v 1.

86 70 Unbordered Factors w u v 0 v 1 w 1 v 0 v 1 w 2 v 0 v 1 u 1 v 0 v 1 t g 1 v 0 w 1 u 1 v 1 g 1 h 0 v 0 h 0 h 0 v 0 h 0 v 0 h 0 w 1 h 2 h 2 We have h 3 = v 0u 1 h 1 v 1 by the arguments from the previous paragraph. Moreover, v 0 v 1 u 1 = v 0 u 1h 1v 1 h 1 = h 3h 1 and h 3 p v 0 v 1 w 1. w u v 0 v 1 w 1 v 0 v 1 w 2 v 0 v 1 u 1 v 0 v 1 t g 1 v 0 w 1 u 1 v 1 g 1 h 1 v 1 h 1 u 1 h 1 v 1 h 1 v 1 h 1 h 3 h 3 f 2 f 3 It is now straightforward to see that w 1 = u 1 = ε for otherwise v 1 and v 0 occur more than once in v 1 h 1 and h 0 v 0, respectively. From (5.6) follows now w 1 = g 1 = u 1. Assume that 1 < k min{i, j} and w l = u l, for all 1 l < k. Let us denote both w l and u l by v l, for all 1 l < k. We show that w k = u k. Consider f 4 = v 1 w k v 0v 1 v k 1 v 0v 1 v 1v 0 v 1 w 2 u 0v 0 v 1 u j v 0 v 1 u k v 0. If f 4 is unbordered, then u < w 1 since f 4 w and u = f 4 v 1 w k v 0v 1 v k 1 v 0v 1 v 1v 0 v 1 w 2 + v 1 v k 1 v 0v 1 v 1v 0 v 1 t and t z 0 z < bz v 0 v 1 and w 2 ε. Assume that f 4 is bordered. Then f 4 has a shortest border h 4 such that v 0 v 1 h 4. Let h 4 = v 1 g 4 v 0.

87 5.3 Duval s Conjecture 71 If v 1 w k v 0 < h 4 then there exists an l < k such that where v l p v l. That implies u h 4 = v 1 w k v 0v 1 v k 1 v 0v 1 v l+1 v 0v 1 v l v 0 k = v l since v 0 v 1 occurs neither in v l nor in u k. Now, consider f 5 = v 1 w k v 0v 1 v k 1 v 0v 1 v 1v 0 v 1 w 2 u 0v 0 v 1 u j v 0 v 1 u k v 0v 1 v k 1 v 0v 1 v l v 0. If f 5 is unbordered, then u < w 1 since f 4 < f 5, see above. Assume that f 5 is bordered. Then f 5 has a shortest border h 5 such that h 4 < h 5 for otherwise h 4 is not the shortest border of f 4, since either h 4 p h 5 or h 5 p h 4, and the latter implies that h 4 is bordered, and hence, not minimal. But now, we have a l < l such that h 5 = v 1 w k v 0v 1 v k 1 v 0v 1 v l +1 v 0v 1 v l v 0 where v l p v l. We have f 4 < f 5 < f 6 where f 6 = v 1 w k v 0v 1 v k 1 v 0v 1 v 1v 0 v 1 w 2 u 0v 0 v 1 u j v 0 v 1 u k v 0v 1 v k 1 v 0v 1 v l v 0, which is either unbordered and u < w 1 since f 4 < f 5, or it is bordered with a shortest border h 6, and we have h 4 < h 5 < h 6 and a factor f 7, such that f 4 < f 5 < f 6 < f 7, and so on, until eventually an unbordered factor is reached proving that u < w 1. Assume then that h 4 p v 1 w k v 0. We also have that h 4 s v 1 u k v 0 since v 0 v 1 does not occur in w k. So, let w k v 0 = g 4 v 0 w k and v 1u k = u k v 1g 4. Consider, f 8 = v 0 w k v 1v k 1 v 0v 1 v 1v 0 v 1 w 2 u 0v 0 v 1 u jv 0 v 1 u k v 0v 1. If f 8 is unbordered, then u < w 1 since f 8 w and u = f 8 v 0 w k v 1v k 1 v 0v 1 v 1v 0 v 1 w 2 + v k 1 v 0v 1 v 1v 0 v 1 t and t z 0 z < bz v 0 v 1 and w 2 ε. Assume that f 8 is bordered. Then f 8 has a shortest border h 8 such that v 0 v 1 s h 8. If h 8 > v 0 w k v 1 then the same argument as in the case v 1 w k v 0 < h 4 above shows that u < w 1. If h 8 < v 0 w k v 1 then v 0 v 1 occurs in w k ; a contradiction. Hence, we have h 8 = v 0 w k v 1 and w k v 0v 1 = g 4 h 8 and h 8 s u k v 0v 1. (5.9)

88 72 Unbordered Factors Consider, f 9 = v 0 v 1 w k v 0v 1 v k 1 v 0v 1 v 1v 0 v 1 w 2 u 0v 0 v 1 u jv 0 v 1 u k+1 v 0u k v 1. If f 9 is unbordered, then u < w 1 since f 9 w and u = f 9 v 0 v 1 w k v 0v 1 v k 1 v 0v 1 v 1v 0 v 1 w 2 + g 4 v 0 v 1 v k 1 v 0v 1 v 1v 0 v 1 t and t z 0 z < bz v 0 v 1 and g 4 w k and w 2 ε. Assume that f 9 is bordered. Then f 9 has a shortest border h 9 such that v 0 v 1 p h 9. We have h 9 = v 0 u k v 1 by the arguments from the previous paragraph. Moreover, v 0 v 1 u k = h 9g 1 and h 9 p v 0 v 1 w k. (5.10) Observe, that (5.9) and (5.10) imply that the number of occurrences of v 1 and v 0, respectively, is the same in w k and u k since v 0 and v 1 do not intersect. Now, let h 4 = v 1 g 4 v 0 = h 1v 1 h 1v 0 = v 1 h 0v 0 h 0 where v 1 and v 0 occur only once in v 1 h 1 and h 0 v 0, respectively. Now, let and f 8 = v 0 h 0w k v 1v k 1 v 0v 1 v 1v 0 v 1 w 2.u 0v 0 v 1 u j v 0 v 1 u k v 0v 1 f 9 = v 0 v 1 w k v 0v 1 v k 1 v 0v 1 v 1v 0 v 1 w 2.u 0v 0 v 1 u j v 0 v 1 u k+1 v 0u k h 1v 1 with the respective shortest borders h 8 and h 9 (which are both not empty if u w 1; as in the case of f 8 and f 9 ). Analogously to the cases of f 8 and f 9, we have w k v 0v 1 = h 0h 8 and v 0 v 1 u k = h 9h 1. It is now straightforward to see that and h 8 = h 9 = v 0 v 1 h 4 = v 0 w k v 1 = v 0 u k v 1 and hence, w k = u k. In this case, we denote both w k and u k by v k. Now, we have where ι = min{i, j}. v = v 0 v 1 w ι v 0 v 1 w 2v 0 v 1 w 1 = v 0 v 1 u ι v 0 v 1 u 2v 0 v 1 u 1

89 5.3 Duval s Conjecture 73 If i < j then since w 0 u 0 by (5.5). Let w 0 < u 0v 0 v 1 u j v 0 v 1 u i+1 (5.11) f 11 = v 1 w 2 u 0v 0 v 1 u j v 0 v 1 u i+1 vv 0. Then w < f 11 by (5.11), and hence, f 11 is bordered. Let h 11 = v 1 g 11 v 0 be the shortest border of f 11. Recall, that w 2 ε and either az s v 1 w 2 or v 1 w 2 s az. If v 1 w 2 < az then v 1 necessarily occurs in z, and hence, it intersects with v 0 (since bz p v 0 v 1 ); a contradiction. So, we have az s v 1 w 2. Surely, h 11 < v 1 w 2 (and so h 11 p v 1 w 2 ) for otherwise az occurs in u which contradicts our assumption that z is of maximum length. Let w 2 = g 11 v 0 w 5. Note that v 0 w 5 az since az and v 0 begin with different letters. We have az < v 0 w 5 since otherwise v 0 occurs in z, and hence, intersects with v 1 which is a contradiction. Consider, f 12 = v 0 w 5 u 0v 0 v 1 u j v 0 v 1 u i+1 vv 0 v 1. If f 12 is unbordered, then u < w 1 since f 12 w and u = f 12 v 0 w 5 + t and az < v 0 w 5 and t z 0 z < bz < v 0 w 5. Assume that f 12 is bordered. Then f 12 has a shortest border h 12 = g 12 v 0 v 1 with az < h 12, for otherwise az occurs in u. Let v 0 w 5 = g 12 v 0 v 1 w 6. But, now w = w 0 vv 0 v 1 g 11 g 12 v 0 v 1 w 6 where v 0 v 1 w 6 s w 2, contradicting our assumption that v 0 v 1 does not occur in w 2. If i > j then w = w 0v 0 v 1 w i v 0 v 1 w j+1 vv 0 v 1 w 2 and u = u 0 vv 0 v 1 t and w u t + v 0 v 1 + w 2. We have u < w 1 since w 2 ε and t z 0 v 0 v 1 1. Assume i = j. Then Consider w = w 0 vv 0 v 1 w 2 and u = u 0 vv 0 v 1 t. f = v 1 w 2 u 0 vv 0. If f is bordered, then it has a shortest border h = v 1 g v 0.

90 74 Unbordered Factors w u a z b z r v 0 v 1 w 2 u 0 v v 0 v 1 t g v 0 v 1 g h h Recall, that w 2 ε and either az s v 1 w 2 or v 1 w 2 s az. If v 1 w 2 < az then v 1 occurs in z, and hence, intersects with v 0 since bz p v 0 v 1 ; a contradiction. So, we have az s v 1 w 2. Surely, h < v 1 w 2 for otherwise az occurs in u which contradicts our assumption. Let w 2 = g v 0 w 4. Note that v 0 w 4 az since az and v 0 begin with different letters. We have az < v 0 w 4 since otherwise v 0 occurs in z, and hence, intersects with v 1 which is a contradiction. Consider now, f = v 0 w 4 u 0 vv 0 v 1. If f is unbordered, then it easily follows that u < w 1 since we have t < az and az < v 0 w 4. f w u a z b z r v 0 v 1 g v 0 w 4 u 0 v v 0 v 1 t h h g v 0 v 1 g v 0 v 1 If f is bordered, then it has a shortest border h = g v 0 v 1 with az < h, for otherwise az occurs in u. Let v 0 w 4 = g v 0 v 1 w 5. But, now f w = w 0 vv 0 v 1 g g v 0 v 1 w 5 which contradicts our assumption that w = w 0 vv 0v 1 w 2 and v 0 v 1 does not occur in w 2. If f is unbordered, then f w, and hence, w 0 u 0. But, we also have w 0 u 0 ; see (5.5). That implies w 0 = u 0. Moreover, the factors w 0 and bzv have both nonintersecting occurrences in u 0 v 0v 1 by (5.5). Therefore, w 0 = u 0. Now, w = xaw 7 and u = xbt where w 0 vv 0v 1 p x and a, b A and a b and w 7 s w 2 and t s t. We have that xb occurs in w by Theorem Since xb is not a prefix of w and v 0 v 1

91 5.4 Comments 75 does not overlap with itself, we have xb + v 0 v 1 w. From t z 0 < v 0 v 1 and t < t we get u < w 1 and the claim follows. 5.4 Comments The bound u < w 1 on the length of a nontrivial Duval extension wu of w is tight, as the following example shows. Example Let w = a n ba n+m bb and u = a n+m ba n with n, m 1. Then w.u = a n ba n+m bb.a n+m ba n is a nontrivial Duval extension of w and u = w 2. Duval (1982) also noted that already w 3µ(w) implies p(w) = µ(w) for any word w, provided his conjecture holds. Corollary 5.26 follows now from Theorem Corollary If w 3µ(w) 2 then p(w) = µ(w). However, this bound is unlikely to be tight. The best example for a large bound that we could find is the one by Assous and Pouzet (1979). Example Let w = a n ba n+1 ba n ba n+2 ba n ba n+1 ba n. We have w = 7n + 10 and µ(w) = 3n + 6 and p(w) = 4n + 7. So, we have that the precise bound for the length of a word that implies p(w) = µ(w) is larger than 7/3µ(w) 4 and smaller than 3µ(w) 1. The characterization of the precise bound of the length of a word as a function of its longest unbordered factor is still an open problem. Finally, we would like to mention that some months after our proof was first made public in (Harju and Nowotka, 2003f) and by personal communication an alternative proof of Conjecture 5.6 (Holub, 2003a) and Theorem 5.24 (Holub, 2003b) has already been proposed. That proof uses a different technique relying on lexicographic orders and is shorter than the original one presented here. However, we think that our poof provides a more detailed insight into the structure of a nontrivial Duval extension by examining those words closely, and might therefore be very useful for answering further questions on this subject like the open problem mentioned above.

92 76 Unbordered Factors

93 Bibliography S. I. Adian. The Burnside problem and identities in groups, volume 95 of Ergebnisse der Mathematik und ihrer Grenzgebiete [Results in Mathematics and Related Areas]. Springer-Verlag, Berlin, J.-P. Allouche and J. Shallit. The ubiquitous Prouhet-Thue-Morse sequence. In Sequences and their applications (Singapore, 1998), Springer Ser. Discrete Math. Theor. Comput. Sci., pages Springer, London, K. I. Appel and F. M. Djorup. On the equation z 1 n z 2n z k n = y n in a free semigroup. Trans. Amer. Math. Soc., 134: , R. Assous and M. Pouzet. Une caractérisation des mots périodiques. Discrete Math., 25(1):1 5, J. Bernoulli. Sur une nouvelle espèce de calcul, volume 1, pages Berlin, J. Berstel. Axel Thue s papers on repetitions in words: a translation, volume 20 of Publications du LaCIM. Université du Québec à Montréal, J. Berstel and A. de Luca. Sturmian words, Lyndon words and trees. Theoret. Comput. Sci., 178(1 2): , J. Berstel and J. Karhumäki. Combinatorics on words A tutorial. Bull. EATCS, 79: , J. Berstel and D. Perrin. Theory of codes, volume 117 of Pure and Applied Mathematics. Academic Press, Orlando, FL, J. Berstel and P. Séébold. Sturmian Words, chapter 2 of Lothaire (2002), pages J.-P. Borel and F. Laubie. Quelques mots sur la droite projective réelle. J. Théor. Nombres Bordeaux, 5(1):23 51, 1993.

94 78 Bibliography D. Breslauer, T. Jiang, and Z. Jiang. Rotations of periodic strings and short superstrings. J. Algorithms, 24(2): , Y. Césari and M. Vincent. Une caractérisation des mots périodiques. C. R. Acad. Sci. Paris Sér. A, 286: , K.-T. Chen, R. H. Fox, and R. C. Lyndon. Free differential calculus. IV. The quotient groups of the lower central series. Ann. of Math. (2), 68:81 95, Ch. Choffrut. A Classical Equation: x n y m, z p, chapter 9.2 of Lothaire (1983), pages Ch. Choffrut and J. Karhumäki. Combinatorics of words. In A. Salomaa and G. Rozenberg, editors, Handbook of Formal Languages, volume 1, pages Springer-Verlag, Berlin, D. D. Chu and H. Sh. Town. Another proof on a theorem of Lyndon and Schützenberger in a free monoid. Soochow J. Math., 4: , W.-F. Chuan. Unbordered factors of the characteristic sequences of irrational numbers. Theoret. Comput. Sci., 205(2): , J. C. Costa. Biinfinite words with maximal recurrent unbordered factors. Theoret. Comput. Sci., 290(3): , M. Crochemore and D. Perrin. Two-way string-matching. J. Assoc. Comput. Mach., 38(3): , M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific Publishing, Hong-Kong, L. J. Cummings, D. Moore, and J. Karhumäki. Borders of Fibonacci strings. J. Combin. Math. Combin. Comput., 20:81 87, A. de Luca. A combinatorial property of the Fibonacci words. Inform. Process. Lett., 12(4): , V. Diekert. Makanin s Algorithm, chapter 12 of Lothaire (2002), pages J.-P. Duval. Périodes et répétitions des mots du monoïde libre. Theoret. Comput. Sci., 9(1):17 26, J.-P. Duval. Relationship between the period of a finite word and the length of its unbordered segments. Discrete Math., 40(1):31 44, J.-P. Duval, T. Harju, and D. Nowotka. Unbordered factors and Lyndon words. Submitted, 2002.

95 Bibliography 79 J.-P. Duval, F. Mignosi, and A. Restivo. Recurrence and periodicity in infinite words from local periods. Theoret. Comput. Sci., 262(1-2): , A. Ehrenfeucht and D. M. Silberger. Periodicity and unbordered segments of words. Discrete Math., 26(2): , N. J. Fine and H. S. Wilf. Uniqueness theorem for periodic functions. Proc. Amer. Math. Soc., 16: , C. F. Gauß. Nachlaß. Gauß (1900), pages Zur Geometria Situs, C. F. Gauß. Nachlaß. Gauß (1900), pages Zur Geometrie der Lage, für zwei Raumdimensionen, C. F. Gauß. Werke, volume VIII. B. G. Teubner, Leipzig, L. J. Guibas and A. Odlyzko. String overlaps, pattern matching, and nontransitive games. J. Combin. Theory, Ser A, 30(2): , D. Gusfield. Algorithms on strings, trees, and sequences. Computer science and computational biology. Cambridge University Press, Cambridge, UK, V. Halava, T. Harju, and L. Ilie. Periods and binary words. J. Combin. Theory, Ser A, 89(2): , T. Harju. On cyclically overlap-free words in binary alphabets. In G. Rozenberg and A. Salomaa, editors, The Book of L, pages Springer-Verlag, Berlin, T. Harju, A. Lepistö, and D. Nowotka. A characterization of periodicity of bi-infinite words. TUCS Tech. Rep. 545, Turku Centre of Computer Science, Finland, Submitted. T. Harju and D. Nowotka. Density of critical factorizations. Theor. Inform. Appl., 36(3): , 2002a. T. Harju and D. Nowotka. Duval s conjecture and Lyndon words. TUCS Tech. Rep. 479, Turku Centre of Computer Science, Finland, 2002b. Submitted. T. Harju and D. Nowotka. About Duval extensions. In T. Harju and J. Karhumäki, editors, WORDS 2003 (Turku), volume 27 of TUCS General Publications, pages , Finland, August 2003a. Turku Centre of Computer Science. T. Harju and D. Nowotka. About Duval s conjecture. In Z. Esik and Z. Fülöp, editors, DLT 2003 (Szeged), volume 2710 of Lecture Notes in Comput. Sci., pages , Berlin, 2003b. Springer-Verlag.

96 80 Bibliography T. Harju and D. Nowotka. Border correlation of binary words. TUCS Tech. Rep. 546, Turku Centre of Computer Science, Finland, 2003c. Submitted. T. Harju and D. Nowotka. On the independence of equations in three variables. Theoret. Comput. Sci., 307(1): , 2003d. T. Harju and D. Nowotka. Periodicity and unbordered segments of words. Bull. EATCS, 80: , 2003e. T. Harju and D. Nowotka. Periodicity and unbordered words. TUCS Tech. Rep. 523, Turku Centre of Computer Science, Finland, April 2003f. Submitted. T. Harju and D. Nowotka. The equation x i = y j z k in a free semigroup. Semigroup Forum, 68(3): , 2004a. T. Harju and D. Nowotka. Minimal Duval extensions. Internat. J. Found. Comput. Sci., 2004b. To appear. T. Harju and D. Nowotka. Periodicity and unbordered words. In STACS 2004 (Montpellier), volume 2996 of Lecture Notes in Comput. Sci., pages , Berlin, 2004c. Springer-Verlag. Ju. I. Hmelevskiĭ. Equations in a free semigroup. Trudy Mat. Inst. Steklov., 107:286, S. Holub. A proof of Duval s conjecture. In T. Harju and J. Karhumäki, editors, WORDS 2003 (Turku), volume 27 of TUCS General Publications, pages , Finland, August 2003a. Turku Centre of Computer Science. S. Holub. Unbordered words and lexicographic orderings. Personal communication, July 2003b. J. M. Howie. An Introduction to Semigroup Theory. Number 7 in L.M.S. Monographs. Academic Press, London, G. Lallement. Semigroups and Combinatorial Applications. Pure and Applied Mathematics. John Wiley & Sons, New York, A. Lentin. Sur l équation a M = b N c P d Q dans un monoïde libre. C. R. Acad. Sci. Paris, 260: , A. Lentin. Équations dans les monoïdes libres. Number 16 in Mathématiques et Sciences de l Homme. Mouton, Gauthier-Villars, Paris, A. Lentin and M.-P. Schützenberger. A combinatorial problem in the theory of free monoids. In Combinatorial Mathematics and its Applications (Proc. Conf., Univ. North Carolina, Chapel Hill, N.C., 1967), pages Univ. North Carolina Press, Chapel Hill, N.C., 1969.

97 Bibliography 81 A. Lepistö. On the Relations between Local and Global Periodicity. Number 43 of TUCS Dissertations, Turku Centre of Computer Science, Finland, F. W. Levi. On semigroups. Bull. Calcutta Math. Soc., 36: , M. Lothaire. Combinatorics on Words, volume 12 of Encyclopedia of Mathematics and its Applications. Addison-Wesley, Reading, MA, M. Lothaire. Algebraic Combinatorics on Words, volume 90 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, UK, M. Lothaire. Applied Combinatorics on Words In preparation, see R. C. Lyndon. On Burnside s problem. Trans. Amer. Math. Soc., 77: , R. C. Lyndon. On Burnside s problem II. Trans. Amer. Math. Soc., 78: , R. C. Lyndon and M.-P. Schützenberger. The equation a M = b N c P in a free group. Michigan Math. J., 9: , G. S. Makanin. The problem of the solvability of equations in a free semigroup. Mat. Sb. (N.S.), 103(145)(2): , 319, J. Maňuch. Defect Theorems and Infinite Words. Number 41 of TUCS Dissertations, Turku Centre of Computer Science, Finland, G. Melançon. Lyndon words and singular factors of Sturmian words. Theoret. Comput. Sci., 218(1):41 59, WORDS, Rouen, F. Mignosi, A. Restivo, and S. Salemi. Periodicity and the golden ratio. Theoret. Comput. Sci., 204(1 2): , F. Mignosi and L. Q. Zamboni. A note on a conjecture of Duval and Sturmian words. Theor. Inform. Appl., 36(1):1 3, D. Moore, W. F. Smyth, and D. Miller. Counting distinct strings. Algorithmica, 23(1):1 13, M. Morita, A. J. van Wijngaarden, and A. J. Han Vinck. On the construction of maximal prefix-synchronized codes. IEEE Trans. Inform. Theory, 42: , M. Morse. Recurrent geodesics on a surface of negative curvature. Trans. Amer. Math. Soc., 22(1):84 100, 1921.

98 82 Bibliography M. Morse and G. A. Hedlund. Symbolic dynamics II: Sturmian trajectories. Amer. J. Math., 61:1 42, D. Perrin. Words, chapter 1 of Lothaire (1983), pages E. Prouhet. Mémoire sur quelques relations entre les puissances des nombres. C. R. Acad. Sci. Paris Sér. I, 33:225, G. Richomme. Morphismes de Lyndon. In Actes de la 9-ème Conférence Internationale Journées Montoises d Informatique théorique (Montpellier), Bull. Belg. Math. Soc. Simon Stevin, Brussels, M.-P. Schützenberger. Une théorie algébrique du codage. C. R. Acad. Sci. Paris, 242: , M.-P. Schützenberger. A property of finitely generated submonoids of free monoids. In Algebraic theory of semigroups (Proc. Sixth Algebraic Conf., Szeged, 1976), volume 20 of Colloq. Math. Soc. János Bolyai, pages , Amsterdam, North-Holland. H. J. Shyr and G. Thierrin. Disjunctive languages and codes. In Fundamentals of Computation Theory (FCT), Poznań-Kórnik, volume 56 of Lecture Notes in Comput. Sci., pages , Berlin, Springer-Verlag. A. Thue. Über unendliche Zeichenreihen. Norske Vid. Skrifter I. Mat.-Nat. Kl., Christiania, 7:1 22, A. Thue. Über die gegenseitige Lage gleicher Teile gewisser Zeichenreihen. Norske Vid. Skrifter I. Mat.-Nat. Kl., Christiania, 1:1 67, 1912.

99 Author Index Adian, S. I., 1 Allouche, J.-P., 23 Appel, K. I., 24 Assous, R., 12, 55 57, 75 Bernoulli, J., 1 Berstel, J., 1, 5, 12, 21, 22, 34, 43, 60 Borel, J.-P., 60 Breslauer, D., 11, 30 Césari, Y., 11, 27, 28 Chen, K. T., 19 Choffrut, Ch., 23, 43 Chu, D. D., 23 Chuan, W.-F., 43 Costa, J. C., 28, 40, 43 Crochemore, M., 2, 3, 11, 28 Cummings, L. J., 34 de Luca, A., 34, 60 Diekert, V., 2 Djorup, F. M., 24 Duval, J.-P., 3, 11, 27 29, 43, 55 57, 63, 75 Ehrenfeucht, A., 3, 43, 55, 56 Fine, N. J., 12 Fox, H. R., 19 Gauß, C. F., 1 Guibas, L. J., 44 Gusfield, D., 2 Halava, V., 12 Han Vinck, H. J., 43 Harju, T., 12, 22, 23, 28, 44, 45, 55, 75 Hedlund, G. A., 58 Hmelevskiĭ, Ju. I., 2 Holub, S., 75 Howie, J. M., 5 Ilie, L., 12 Jiang, T., 11, 30 Jiang, Z., 11, 30 Karhumäki, J., 5, 12, 43 Karhumäki, J., 34 Lallement, G., 5 Laubie, F., 60 Lentin, A., 2, 24 Lepistö, A., 11, 28 Levi, F. W., 8 Lothaire, M., 2, 5, 43, Lyndon, R. C., 16, 17, 19, 23 Maňuch, J., 23 Makanin, G. S., 1 Melançon, G., 60 Mignosi, F., 11, 29, 58 Miller, D., 44 Moore, D., 34, 44 Morita, H., 43 Morse, M., 23, 58 Nowotka, D., 23, 28, 44, 55, 75 Odlyzko, A., 44

100 84 Author Index Perrin, D., 3, 11, 12, 28, 43 Pouzet, M., 12, 55 57, 75 Prouhet, E., 1, 23 Restivo, A., 11, 29 Richomme, G., 20, 59 Rytter, W., 2 Séébold, P., 21 Salemi, S., 11 Schützenberger, M.-P., 1, 2, 16, 23, 28 Shallit, J., 23 Shyr, H. J., 16 Silberger, D. M., 3, 43, 55, 56 Smyth, W. F., 44 Thierrin, G., 16 Thue, A., 1, 22, 23, 45 Town, H. Sh., 23 van Wijngaarden, A. J., 43 Vincent, M., 11, 27, 28 Wilf, H. S., 12 Zamboni, L. Q., 58

101 Index alphabet, 5 auto-correlation function, 44 avoidability, base, 8 border, array function, 44 correlation function, 43 shortest, 12 CFT, compatible prefix, 7, 10 suffix, 7, 10 conjugacy class, 16 conjugate, relation, 16 unbordered, critical factorization, 15, density, 36 internal, 49 theorem, cube-free, 22 cyclic shift, 16 cyclically overlap-free, 22 strongly, 22 density, 36 Duval extension, maximal, minimal, 56, 61 trivial, 56 without nontrivial, Duval s conjecture, 57 equidivisibility, 8 factor, 6, 10 critical, 15 proper, 6, 10 factorization, 6 Fibonacci morphism, 21, 33 number, 33 word, 21, 33 35, 58 Fine and Wilf s theorem, 12 fixed point, 20 function auto-correlation, 44 border correlation, 43 border-array, 44 index, 13 intersection, 9 inverse order, 18 length overlap, 9 word, 8 letter, 5 lexicographic order, 17 Lyndon preserving, 20, 59 word, 18, 58 monoid, 6 free, 8 morphism, Fibonacci, 21, 33 iterated, 21

102 86 Index Lyndon preserving, 20, 59 nonerasing, 20 prolongable, 21 Thue, 22 Thue Morse, 23 occurrence, 7 number of, 8 order, inverse, 18 lexicographic, 17 overlap, 8 -free, 22 -free, cyclically, 22 -free, strongly cyclically, 22 palindrome, 8, 32, 34 pattern, 22 period global, 12, 14 local, 15 periodic, ultimately, 14 point, 7, 14 critical, 15 internal critical, 49 position, 7 prefix, 7, 10 proper, 7 prefix-compatible, 7, 10 primitive, 13 Prouhet Thue Morse word, 23 repetition, word, 14 reverse, 8, 10 root, 13 self-uncorrelated, 43 semigroup, 6 free, 8 shift, cyclic, 16 square-free, 22 Sturmian word, 58 suffix, 7, 10 proper, 7 suffix-compatible, 7, 10 Thue morphism, 22 word, 22 Thue Morse morphism, 23 word, 23 ultimately periodic, 14 unbordered, 43 conjugate, word, 5 k-free, 22 cube-free, 22 cyclically overlap-free, 22 empty, 5 Fibonacci, 21, 33 35, 58 finite, 6 9 i-th letter, 7 infinite, 9 10 Lyndon, 18, 58 overlap-free, 22 palindrome, 8, 32, 34 primitive, 13 Prouhet Thue Morse, 23 repetition, 14 square-free, 22 strongly cyclically overlap-free, 22 Sturmian, 58 Thue, 22 Thue Morse, 23 X-factorization, 6 X-interpretation, 7

103

Turku Centre for Computer Science Lemminkäisenkatu 14 FIN-20520 Turku Finland http://www.tucs.

Akademi University Department of Computer Science Institute for Advanced Management Systems

104 Turku Centre for Computer Science Lemminkäisenkatu 14 FIN Turku Finland University of Turku Department of Information Technology Department of Mathematics Åbo Akademi University Department of Computer Science Institute for Advanced Management Systems Research Turku School of Economics and Business Administration Institute of Information Systems Science

About Duval Extensions

About Duval Extensions Tero Harju Dirk Nowotka Turku Centre for Computer Science, TUCS Department of Mathematics, University of Turku June 2003 Abstract A word v = wu is a (nontrivial) Duval extension