McCreight's suffix tree construction algorithm

Size: px

Start display at page:

Download "McCreight's suffix tree construction algorithm"

Bridget Daniels
6 years ago
Views:

1 McCreight's suffix tree construction algorithm b 2 $baa $ 5 $ $ba 6 3 a b 4 $ aab$ 1

2 Motivation Recall: the suffix tree is an extremely useful data structure with space usage and construction time in O(n). Today we see algorithms for constructing a suffix tree in time O(n).

3 Suffix trees A suffix tree of a sequence, x, is a compressed trie of all suffixes of the sequence x$. x=abaab 1: abaab$ 2: baab$ 3: aab$ 4: ab$ 5: b$ 6: $ 2 $baa $ 5 b $ $ba 6 3 a b 4 $ aab$ 1

4 McCreight's algorithm Iteratively, for i=1,...,n+1 build tries, T i, where... is a trie of sequences x$[1..n+1], x$[2..n+1],..., x$[i..n+1] b a b aab$ 1 2$baa a b aab$ 1 2 $baa b $ba a b 3 aab$ 1 T 1 T 2 T 3

5 McCreight's algorithm Iteratively, for i=1,...,n+1 build tries, T i, where... is a trie of sequences x$[1..n+1], x$[2..n+1],..., x$[i..n+1] For i=n+1, T i is the suffix tree for x. b 2 $baa $ 5 $ $ba 6 3 a b 4 $ aab$ 1

6 McCreight's algorithm Iteratively, for i=1,...,n+1 build tries, T i, where... is a trie of sequences x$[1..n+1], x$[2..n+1],..., x$[i..n+1] For i=n+1, T i is the suffix tree for x. The essential trick is being clever in how we insert x[i..n]$ into T i so we don't spend O(n 2 ) all in all.

7 Terminology A common prefix of x and y is a string, p, that is a prefix of both: x: y: p: The longest common prefix, p=lcp(x,y), is a prefix such that: x[ p +1] y[ p +1] x: y: p: a b a b

8 LCP and LCA x[i..n]: x[j..n]: LCP: For suffixes of x, x[i..n], x[j..n], their longest common prefix is their lowest common ancestor in the suffix tree: a b a b i a b LCP(x[i..n],x[j..n]) j LCA(i,j)

9 Head and tail Let head(i) denote the longest LCP of x[i..n]$ and x[j..n]$ for all j < i Let tail(i) be the string such that x[i..n]$=head(i)tail(i) Iteration i in McCreight s algorithm consist of finding (or inserting) the node for head(i), and appending tail(i) head(i) tail(i) i

10 The Trick The trick in McCreight's algorithm is a clever way of finding head(i)

11 Lemma Let head(i) = x[i..i+h]. Then x[i+1..i+h] is a prefix of head(i+1) i i+h head(i) tail(i) head(i+1) tail(i+1) i+1 0

12 Proof Trivial for h=0 (head(i) empty), so assume h>0: Let head(i) = ay: a By def. exists j<i such that LCP(i,j)=ay Thus suffix j+1 and i+1 share prefix y Thus y is a prefix of LCP(i+1,j+1) Thus y is a prefix of head(i+1) i+1 x[i..n]: x[j..n]: a a j+1

13 Suffix link Define s(u) = if u=, and v if u=av As a pointer from x[i..k] to x[i+1..k]: x[i..k] x[i+1..k] s i i+1 (ex 5.2.3: if u is a node, so is s(u))

14 Corollary of Lemma s(head(i)) is a prefix of head(i+1) Thus: s(head(i)) is an ancestor of head(i+1) head(i) s s(head(i)) i head(i+1) i+1 s(head(i)) can be used as a shortcut!

15 Slowscan and fastscan Slowscan: if we do not know if string y is in T i, we must search character by character Fastscan: if we do know that y is in x, we can jump directly from node to node At node u at (path-)depth d, follow the edge with label starting with y[d] Continue until we reach the end of y y[1] On a node (if y is in T i ) Or on an edge (if y is a prefix of a string in T i ) y[d 2 ] y[d 1 ] y[d 3 ]

16 Sketch of McCreight's algorithm Begin with the tree T 1 : For i=1,...,n, build tree T i+1 satisfying: T i+1 is a compressed trie for x[j..n]$, j i+1 All non-terminal nodes (with the possible exception of head(i)) have a suffix link s(-) Each iteration must: Add node i+1 Potentially add head(i+1) Add tail(i+1) Add suffix link head(i) s(head(i)) 1

17 At the beginning of iteration i... parent(head(i)) u v s s(u) s(parent(head(i))) s(head(i))=w head(i) Let head(i)=uv and parent(head(i))=u and w=s(u)v=s(head(i)) By the invariant, s(parent(head(i))) and the suffix link exists; by the lemma, w is an ancestor of head(i+1)

18 Steps in iteration i... parent(head(i)) u v s s(u) s(parent(head(i))) s(head(i))=w head(i) head(i+1) Move quickly to w, then search for head(i+1) starting there

19 Observe: w is in T i head(i) is a prefix of x[j..n] for some j < i Thus w is a prefix of x[j+1..n] for some j < i i.e w is a prefix of some suffix j i i.e. w is in T i Consequently: we can search for w from s(u) using fastscan! u s(u) v s s(head(i))=w

20 If w is a node Update s(head(i)) := w Then search for head(i+1) using slowscan

21 If w is on an edge If w is not a node, then all suffix j<i with prefix w agree on the next letter By definition of head(i) there is j<i such that suffix x[i..n] and x[j..n] differs after head(i) x[i+1..n] must also disagree at that character Thus head(i+1) must be w Add node w, update head(i):=w and set head(i+1)=w

22 McCreight's algorithm Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1

23 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 abaab$ 1

24 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=1 head(1)= tail(1)=abaab$ abaab$ 1

25 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=1 2 head(2)= tail(2)=baab$ $baab abaab$ 1

26 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=2 2 head(2)= tail(2)=baab$ $baab 3 head(3)=a tail(3)=ab$ $ba a baab$ 1

27 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=3 2 head(3)=a tail(3)=ab$ $baab 3 $ba a baab$ 1

28 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=3 2 head(3)=a tail(3)=ab$ $baab 3 $ba a head(i) u v = a baab$ 1

29 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 v[2.. v ]= i=3 w 2 head(3)=a tail(3)=ab$ $baab 3 $ba a head(i) u v = a baab$ 1

30 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=3 w 2 head(3)=a tail(3)=ab$ $baab 3 $ba a b aab$ head(i) head(i+1) 1

31 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=3 w 2 head(3)=a tail(3)=ab$ $baab 3 $ba a b aab$ head(i) head(i+1) 1

32 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=3 w 2 head(3)=a tail(3)=ab$ $baab 3 $ba a b aab$ head(i) head(i+1) 1

33 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=3 head(3)=a tail(3)=ab$ $baab $ba a 2 3 $ head(i+1) 4 b aab$ 1

34 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=4 2 head(4)=ab tail(4)=$ $baab 3 head(i) a a $ba 4 b aab$ $ u v=b 1

35 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=4 2 w head(4)=ab tail(4)=$ s(u) $baa b 3 head(i) a a $ba 4 b aab$ $ u v=b 1

36 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=4 2 w head(i+1) head(4)=ab tail(4)=$ $baa b 3 head(i) a a $ba 4 b aab$ $ 1

37 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=4 2 w head(i+1) head(4)=ab tail(4)=$ $baa b 3 head(i) a a $ba 4 b aab$ $ 1

38 Example: x = abaab Construct tree for x[1..n] for i = 1 to n do if head(i)= then ((( slowscan(,s(tail(i head(i+1) = add i+1 and head(i+1) as node if necessary continue (( label(u,head(i u = parent(head(i)) ; v = ( fastscan(s(u),v if u then w = ([ fastscan(,v[2.. v else w = if w is an edge then add a node for w head(i+1) = w else if w is a node then (( slowscan(w,tail(i head(i+1) = add head(i+1) as node if necessary s(head(i)) = w add leaf i+1 and edge between head(i+1) and i+1 i=4 head(i+1) 2 head(4)=ab tail(4)=$ $baa b 5 3 head(i) a a $ba 4 b aab$ $ 1

39 Done Nothing new for i=5, inserting $... 2 $baa 5 b $ $ $ba 6 3 a b aab$ $ 4 1

40 Correctness Correctness follows from the invariant: At iteration i we have a trie of all suffixes j < i. After the final iteration we have a trie of all suffixes of x$, i.e. we have the suffix tree of x.

41 Time and space usage Everything but searching is constant time per suffix, so the running time is O(n + slowscan + fastscan ). We are not using more space than time, so the space usage is the same.

42 Slowscan time usage We use slowscan to find head(i+1) from w=s(head(i)), i.e. time head(i+1) - head(i) +1 for iteration i head(i) tail(i) head(i+1) tail(i+1) A telescoping sum Sum = head(n+1) - head(1) +n Equal to n since head(n+1)=head(1)= = head(i+1) - head(i) +1

43 Fastscan time usage Fastscan uses time proportional to the number of nodes it process Define d(v) as the (node-)depth of node v Fastscan increases the node depth Following parent and suffix pointers decreases the node depth Time usage of fastscan is bounded by the total depth-increase (amortized analysis)

44 Proposition d(v) d(s(v))+1 Proof: For any ancestor u of v, s(u) is an ancestor of s(v) Except for the empty prefix and the single letter prefix of v, the s(u) s are different u v s(u) s(v)

45 Corollary In each step, before calling fastscan, we decrease the depth by at most 2: d(u) = d(head(i))-1; d(w) d(u)-1 u w head(i) The total decrease is thus 2n

46 Time usage of fastscan The time usage of fastscan is bounded by n plus the total decrease of depth, i.e. the time usage of fastscan is O(n)

47 Summary We iteratively build tries of suffixes of x. Using suffix links and fastscan we can quickly find where to insert the next suffix in our current trie. By amortized analysis, the total running time becomes linear.

48 Ukkonen's suffix tree construction algorithm aba$ $ababa$ $ab 3 a ba$ $ $ $ab a ba$ $

49 Motivation Yet another suffix tree construction algorithm... Why?

50 Motivation Yet another suffix tree construction algorithm... Why? An online algorithm, i.e. one we can update with new sequences

51 Sketch of algorithm Recall McCreight s approach: For i = 1.. n+1, build compressed trie of {x[j..n]$ j i} Ukkonen s approach: For i = 1.. n+1, build compressed trie of {x$[j..i] j i} Compressed trie of all suffixes of prefix x$[1..i] of x$ A suffix tree except for leaf property

52 McCreight's algorithm x=aba T 1 : {x$[j..n] j 1 } T 2 : {x$[j..n] j 2 } 2 aba$ $ab 1 aba$ 1 T 3 : {x$[j..n] j 3 } $ab a ba$ $ T 4 : {x$[j..n] j 4 } 2 $ab $ 4 a ba$ $ 1 3

53 Ukkonen's algorithm x=aba T 1 : {x$[j..1]} a 1 T 2 : {x$[j..2]} b ab 2 1 T 3 : {x$[j..3]} ab a ba 2 1 Note: no node for x[3..3] = a T 4 : {x$[j..n]} 2 $ab $ 4 a ba$ $ 1 3

54 Tasks in iteration i In iteration i we must Update each to x[j..i+1] Add string x[i+1] (special case of above) Leaf: Inner node: Edge: j x[i+1] x[i+1] x[i+1] j x[i+1] x[i+1] j x[i+1] x[i+1] j

55 First attempt... Obvious algorithm: For i=1,...,n+1: for j=1,...,i: find append x[i+1]

56 First attempt... Obvious algorithm: For i=1,...,n+1: for j=1,...,i: find append x[i+1] Running time O(n 3 ) Need lots of tricks to get O(n)!

57 Updating leaves for free If we label leaves with (k, ) denoting k to the current i, updating a leaf is automatic Leaf: Inner node: Edge: x[i+1] x[i+1] j x[i+1] j x[i+1] x[i+1] j x[i+1] x[i+1] j

58 Updating existing strings is free If x[j..i+1] is already in the tree, the update is automatic Leaf: Inner node: Edge: x[i+1] x[i+1] j x[i+1] j x[i+1] x[i+1] j x[i+1] x[i+1] j

59 Real operations If we can recognize the free operations, we need only deal with the remaining Leaf: Inner node: Edge: x[i+1] x[i+1] j x[i+1] j x[i+1] x[i+1] j x[i+1] x[i+1] j

60 Lemma Let j denote suffix of x[1..i] a)if j>1 is a leaf node in T i, then so is j-1 b)if, from j<i, there is a path in T i that begins with a, then there is a path in T i from j+1 beginning with a

61 Lemma Let j denote suffix of x[1..i] a)if j>1 is a leaf node in T i, then so is j-1 b)if, from j<i, there is a path in T i that begins with a, then Once there an edge, is a path always in T an i from edge j+1 beginning with a

62 Lemma Let j denote suffix of x[1..i] a)if j>1 is a leaf node in T i, then so is j-1 b)if, Once from a j<i, leaf, there always is a path been in a T leaf i that begins with a, then there is a path in T i from j+1 beginning with a

63 Lemma Let j denote suffix of x[1..i] a)if j>1 is a leaf node in T i, then so is j-1 b)if, from j<i, there is a path in T i that begins with a, then there is a path in T i from j+1 beginning with a

64 Proof of lemma a) If j>1 is a leaf node in T i, then so is j-1 Assume j-1 is not a leaf. Then there exists k<j-1 such that: x[k..i] x[j-1..i] x[k..i] x[j-1..i] j-1 k Then j j-1 k k+1 thus: x[k+1..i] x[k+1..i]

65 Lemma Let j denote suffix of x[1..i] a)if j>1 is a leaf node in T i, then so is j-1 b)if, from j<i, there is a path in T i that begins with a, then there is a path in T i from j+1 beginning with a

66 Proof of lemma b) If, from j<i, there is a path in T i that begins with a, then there is a path in T i from j+1 beginning with a Assume j is followed by a, then there exists k<j such that: j k a thus: j+1 k+1 a Hence j+1 is followed by a.

67 Corollary In iteration i, there exist indices j L and j R such that: All suffixes j j L are leaves All suffixes j j R are already in the trie I II III j L j R

68 Consequence of corollary I and III are free operations x[i+1] x[i+1] j x[i+1] j x[i+1] x[i+1] j x[i+1] x[i+1] j I II III j L j R

69 Updated algorithm Implicitly handling free operations: For i=1,...,n+1: for j=j L,...,j R : find append x[i+1]

70 Updated algorithm Implicitly handling free operations: For i=1,...,n+1: for j=j L,...,j R : find append x[i+1] How do we find j L and j R?

71 Updated algorithm Implicitly handling free operations: For i=1,...,n+1: for j=j L,...,j R : find append x[i+1] j L in iteration i is the last leaf inserted in iteration i-1 (all smaller indices are already leaves) j R in iteration i is the first index where x[j..i+1] is already in the trie (all larger indices are already in the trie)

72 Suffixes in II are made into leaves Whenever j L < j < j R, j is made a leaf: x[i+1] j x[i+1] j

73 Suffixes in II are made into leaves Whenever j L < j < j R, j is made a leaf: x[i+1] j x[i+1] j Once j is a leaf, it will be in I and never in II again

74 Time to go from II to I We handle j in II or implicitly in III time 2n: 2 2 Index j in II n+1 Path length is 2n Iterations n+1

75 Updated running time Only 2n of these: For i=1,...,n+1: for j=j L,...,j R : find append x[i+1] Constant time Running time is 2n*T(find )

76 Updated running time Only 2n of these: For i=1,...,n+1: for j=j L,...,j R : find append x[i+1] Constant time? Constant time Running time is 2n*T(find ) ( O(1 We just have to deal with T(find ) in

77 What do we know about? Is it already in the tree? Will that help us?

78 Using fastscan and s(-) When searching for, it is already in the trie We can use fastscan for the search T(find ) in O(d) where d is the (node-)depth of If we keep suffix links, s(-), in the tree we can use these as shortcuts

79 Suffix links Invariant: All inner nodes have suffix links

80 Suffix links Invariant: All inner nodes have suffix links Ensuring the invariant? For i=1,...,n+1: for j=j L,...,j R : find append x[i+1]

81 Suffix links Invariant: All inner nodes have suffix links Ensuring the invariant: We only insert inner nodes when adding leaves j Whenever we insert a new node, for some j<i, we also find or insert x[j+1..i], and can update s() := x[j+1..i] If we insert x[i..i], then s(x[i..i]) := ε

82 Finding x[j+1..i] from parent(j) j s s(parent(j)) x[j+1..i] Starting from here (initial j is j L and we can keep a pointer to that node between iterations) Using fastscan here

83 Bound on fastscan Time usage by fastscan is bounded by n for the maximal (node-)depth in the trie plus total decrease of (node-)depth Decrease in depth: Moving to parent(j): 1 Moving to s(parent(j)): max 1 Restarting at j L :?

84 Bound on fastscan Time usage by fastscan is bounded by n for the maximal (node-)depth in the trie plus total decrease of (node-)depth Decrease in depth: Moving to parent(j): 1 Moving to s(parent(j)): max 1 Restarting at j L :? parent(j) j s s(parent(j)) x[j+1..i] Using fastscan here

85 Hacking the suffix link When searching for x[j L +1..i], update s(x[j L..i]) to point to the nearest ancestor of x[j L +1..i] parent(j L ) s s(parent(j L )) j L x[j L +1..i] Nearest ancestor of x[j L +1..i]

86 Hacking the suffix link When searching for x[j L +1..i], update s(x[j L..i]) to point to the nearest ancestor of x[j L +1..i] parent(j L ) s s(parent(j L )) j L x[j L +1..i] Restart in constant time Nearest ancestor of x[j L +1..i]

87 Iterations Searching time Vertical steps are paid for by the previous horizontal step (free restarting) Horizontal steps are total fastscan bounded by O(n) Runtime O(n) Index j in II 2 2 n+1 n+1

88 Summary We have seen a new suffix tree construction algorithm Ukkonen s algorithm is an online algorithm: As long as no suffix is a prefix of another, the intermediate trees are suffix trees Generalized suffix trees can be built one string at a time

Ukkonen's suffix tree construction algorithm

Ukkonen's suffix tree construction algorithm aba$ $ab aba$ 2 2 1 1 $ab a ba $ 3 $ $ab a ba $ $ $ 1 2 4 1 String Algorithms; Nov 15 2007 Motivation Yet another suffix tree construction algorithm... Why?