McCreight s Suffix Tree Construction Algorithm. Milko Izamski B.Sc. Informatics Instructor: Barbara König

1. Introution MCreight s Suffix Tree Constrution Algorithm Milko Izamski B.S. Informatis Instrutor: Barbara König The main goal of MCreight s algorithm is to buil a suffix tree in linear time. This is an auxiliary searh tree whih helps us to fin a speifi substring S i within a given main string S. 2. The Algorithm Before explaining the algorithm it is neessary to efine some basi rules an onstraints. At the beginning we hek whether the final harater (n) appears elsewhere in S, if it oes, then S has to be extene to a string that satisfies the rule suh that with no suffix S i of S is a prefix of a ifferent suffix S j of S. The suffix tree T represents all possible suffixes of S. An ege (also known as an ar) of the tree is lele with some non-empty string from the alphet that buils S (T1). Eah internal noe has at least two sons (T2). The sibling eges (sons) begin with a ifferent harater(t3). 2.1. Builing the suffix tree Basially T is reate in n steps, where length (S )= n. Let suf i to be suffix of S. Eah time it is inserte into T i-1. The leftmost harater is at position 1 of the string. Every suffix is built by taking off the first harater of its preeessor. Hene, suf 1 = S an suf n is the last harater. The next lines will explain how exatly the proess work. During this explanation some basis efinitions will be introue. For this task we onsier one example string S =. All possible suffixes of S are efine in the tle below. In Step 1 suf 1 is inserte in an empty tree T 0, so T 1 is proue. In Step 2 we take suf 2 an ompare it to suf 1. We efine hea i as the longest prefix of suf i that is also a prefix of suf j, for some j<i. In our situation, hea 2 is empty (ε). We efine tail i as suf i - hea i suf i = hea i + tail i. It means that tail 2 is b. To insert suf i into T i-1 we shoul fin the extene lous of hea i, if neessary the tree is split with some non-terminal noe ( that is the lous of hea i ), an then tail i is insert. During proessing in Step 3, suf 3 =, hea 3 = ( ompare suf 3 with suf 1 ), hene tail 3 =. We split the ege into two parts ( that is hea 3 ) an, then insert tail 3 as we sai before. The same proeure is repeate until we reah the last possible suffix. 1

S= hea i T i-1 Step i suf i hea i tail i α β γ T 0 root Step 1 ε T 1 Step 2 b ε b ε ε ε T 2 b Step 3 a b ε T 3 b Step 4 b b b ε ε T 4 b Step 5 ε ε ε ε T 5 b The main problem in the algorithm esribe ove is to fin the extene lous in T i-1 of hea i. All this shoul be ahieve in a onstant time. For this task we use a simple Lemma that says: If hea i-1 an be written as xδ for some harater x an some (possibly empty) string δ, then δ is a prefix of hea i. 2

2.2. Suffix links Using this property, auxiliary links (suffix link) an be ae to the tree struture. Eah lous of xδ, where x is a harater an δ is a string, an be linke to the lous of δ. These links enle us to spee up the searhing of the lous of hea i, starting at the lous of hea i-1, whih has been visite in the previous step. But before exploring exatly how that happens we enote every ege from the tree T with a pair of integers, the first element representing the start position of the ege s lel in the main string an the seon, the length of the ege s lel. Fousing on the following example we will explain how the suffix link is ae. Let S be b 4 3 a 2 b 4. Now we assume that T 10 (T i-1 ) alreay exists an suf 11 (suf i ) is to be ae - using the property of the suffix links. As we an easily see hea 10 (hea i-1 ) is bb. Following the efinition ove we an represent it as χαβ, where χ = x = a an αβ = δ = bbb. If the ontrate lous of hea 10 (hea i-1 ) in T 9 (T i-2 ) is the root then α is empty an we start resanning from the root, otherwise it is the lous of χα. The lous must have existe somewhere in T 9. In our speifi ase α = b, hene, β must be bb.the algorithm then follows the alreay existing suffix link ( between χα an α) an starts resanning. Resanning is a proess whih revisits a sequene of eges whih spell out β. We an be sure that the prefix of hea 11 (hea i ) will be αβ = bbb an some other (possibly empty) string γ. In our speifi ase γ = b. We alreay know the lous of α.so following the logi in the last paragraph β has to be within the tree. Thus we start resanning the tree for β ownwar from the lous of α. In this proess only the first harater ( see onstraint T3) an the length of the hil ege ρ are ompare (enoting of the eges helps us). If ρ is shorter than β then the proess starts reursively from the lous of ρ with (β-ρ), otherwise β is a prefix of ρ an the resan is omplete. A new nonterminal noe is onstrute, whih is the lous of αβ, if one oes not alreay exist. The onstrution of suh a noe is only possible if γ is empty. The only thing we on t know yet is, where the γ is loate within the tree. We alreay have information out the length of β beause of the hea i-1. In this situation we must to travel ownwar into the tree an ompare the haraters one by one from left to right until we fin γ an then reate a new non-terminal lous of αβγ = hea i, if it oes not alreay exist. The proess of fining γ in the tree is alle Sanning. At the en tail i is attahe to the lous of hea i an i-step is finishe. Remember that resanning of β was possible only beause in the previous step sanning on β has been performe. 3

3. Time omplexity analysis Let us efine a suffix of S res i = βγ+tail i, where resanning an sanning has been mae to β respetively γ. During the resanning of β there will always be a non empty string (ρ) that is in res i but not in res i+1. Hene, length(res i+1 ) is at most length(res i ) minus number of the noes we ha enountere by the resanning(int i ). We alreay know that length(res n ) = 0 an length(res 0 ) = n, hene by using reursive substitution we see than Σ n i=0int i is at most n. The number of omparisons by the san operation is (length(hea i ) - length(hea i-1 ) + 1) an in total algorithm Σ n i=0(length(hea i ) - length(hea i-1 ) + 1) = length(hea n ) - length(hea 0 ) + n = n. The algorithm nees n step for reating the suffix tree. For every step a onstant time is neee. That means the algorithm runs in linear time epening on the length of the given string. 4. Upating the suffix tree We assume that a substring of the main string S may nee to be replae by S. Thus, suffix tree T has to be hange too. Let S be efine as αβγ, for some strings α, β an γ, an it is to be hange to αδγ. Aopting a string element position numbering sheme, we onsier only those paths that are affete by the replaement of β with δ (or αβγ αδγ). Define α * to be the longest suffix of α that ours in at least two plaes in S. β-splitters are strings in the form εγ, where ε is a non empty suffix of α * β. Hene, β- splitters properly ontain the suffix γ. Equivalently, δ-splitters are in form ωγ, where ω is a non empty suffix of α * δ. Main goal of the algorithm is to fin the β-splitters an replae them by the δ-splitters. It is ahieve in three stages. 1. Disover α * βγ, the longest β-splitter. 2. Delete all paths εγ from the tree, ε = suf(α * β). 3. Insert all paths ωγ into the tree, ω = suf(α * δ). Disovering the longest β-splitter is the first stage. Let us assume our string S is xb, whih is to be hange to a. Hene, =α, xb =β, =γ, a =δ. The longest β-splitter is then bxb beause of b =α *. Note that xb is also a β- splitter but not the longest. In the seon stage we shoul are out eleting paths, whih are suffixes of the longest β-splitter. Deletions are one in sequene from the longest string to the shortest. Let us assume that all suffixes longer than εγ have been elete. We split up εγ into three (possibly empty) substrings u, v an w. Suppose we have the situation shown on the figure below. 4

xuv a root potential suffix link k h f u g v w If noe g has more that two sibling eges then we just elete w. If g has exatly two offspring eges ( it is not possible for g to have only one ege) then w-ege an g-noe are remove, k an v are joine together. The only problem is that there oul be a suffix link to g whih will be elete after merging of v an k into vk-ege. The last stage ares out inserting all paths in form ωγ, where ω = suf(α * δ). Now we onsier the remainer of the tree as pre-initialize suffix three, whih surely ontains all suffixes of γ. b xb Stage 2 Stage 3 a ba xb xb xb b ba a Figure. Tree mutation in a ifferent stages. 5. Conlusion In general MCreight s algorithm for builing suffix trees oes not iffer from the Weiner s or Ukonnen s one. They all reate a tree in at most linear time omparative to the length of the input string. A major avantage is MCreight a algorithm uses approximately 25 per ent less ata spae than similarly oe versions of the other algorithms. Of ourse in real omputers the ata movement (loa, store, opy ) takes up linear time. Hene, saving ata spae oul also mean saving time. 5