Trie: A dt-structure for set of words Tries nd suffixes trees Alon Efrt Comuter Science Dertment University of Arizon All words over the lhet Σ={,,..z}. In the slides, let sy tht the lhet is only {,,c,d} S set of words = {,,, c, ddd} Need to suort the oertions insert(w) dd new word w into S. delete(w) delete the word w from S. find(w) is w in S? Future oertion: Given text (mny words) where is w in the text. The time for ech oertion should e O(k), where k is the numer of letters in w Usully ech word is ssocited with ddition info not discussed here. Trie (Tree+Retrive) for S A tree where ech node is struct consist Struct node { chr[4] *r; chr flg ; /* if word ends t this node. Otherwise */ } flg r r c d flg A trie - exmle S={,,d} Rule: Ech node corresonds to word w Corr. to w= d (w which is in S iff the flg is ) 3 4 ->r[ - ] Note: The lel of n edge is the lel of the cell from which this edge exits d Corresonding to w= d Corr. To w= d (not in S, flg=) In S, so flg=
=root; i = While(){ Finding if word w is in the tree If w[i] == \ // we scnned ll letters of w then return the flg of ; // True/Flse If the entry of corresond to w[i] is NULL return flse; Set to e the node ointed y this entry, nd set i++; } Inserting word w Recll we need to modify the tree so find(w) would return TRUE. Try to erform find(w). If runs into NULL ointers, crete new node(s) long the th. The flg fields of ll new node(s) is. Set the flg of the lst node to 5 6 Inserting c Deleting word w Corr. to w= d S={,,d, c} ->r[ - ] Try to erform find(c). If runs into NULL ointers, crete new node(s) long the th. The flg fields of ll new node(s)=. Set the flg of the lst node to Note: The lel of n edge is the lel of the cell from which this edge exits d w= c Corr. to w= d Find the node corresonding to w (using `find oertion). Set the flg field of to. If is ded (I.e. flg== nd ll ointers re NULL ) then free(), set =rent() nd reet this check. Corr. to w= d In S, so flg= 7 w= c 7 8
Sce requirements Heuristics for sce sving Let m e is the sum of chrcters of ll words in S The sce required might e Θ( Σ m ) (for ech letter of ech words of S, we need n rry of size Σ (Might e n issue y itself, nd might slow down erformnces) z flg To sve some sce, if Σ is lrger, there re few heuristics we cn use. Assume Σ={,..z}. We use two tyes of nodes Tye A, which is used when the numer of children of node is more thn 3 tye z flg Note the letters re not stores exlicit lly Note the letters re not stores exlicit lly 9 Heuristics for sce sving Another Heuristics th comression Tye B is used if there re 3 or less children: The letter of the child is lso stored: tye letter ointer letter ointer letter ointer B F R flg Relce long sequence of nodes tht hens to hve only single child, with single node (of tye ointer to string ) tht kees oint to the next node, nd oint to string. The rule of the flg is the sme s in tye A nodes. We only store the 3 ointers, ut we need to know to which letters they corresonds to tye c\ 3
Suffix tree. Suffix tree. Assume B (for ook) is long text. Wnt to rerocess B, so when word w is given, we could quickly find if it is in B. (incrementl serch) (s well s loctions, how mny etc) We cn find it in O( w ). Ide: Consider B s long string. Crete trie T of ll suffixes of B. In ddition to the flg (secifying if word ends t node), we lso stored the index in B where this word egins. Exmle B= S={,,,, } To know where word er in B, we store with ech node the index of the eginning of the suffix in B. (we cn store only the first ernce of the word in Exmle B= the text) 3 4 S={,,,, } Size of suffix tree Exmle B= S={,,,, } Size of suffix tree 345 Exmle B= S={,,,, } Assume n= B. Totl length of ll string Θ(n ) Size of node is Σ So size of the tree is Θ(n Σ ). Time to construct the tree Θ(n ) Rther thn flg, we store the first index where the suffix er Exmle B= S={,,,, } Assume n= B. Totl length of ll string Θ(n ) Size of node is Σ So size of the tree is Θ(n Σ ). Time to construct the tree Θ(n ) In ddition to the flg, we store the first index (in the ook) where the suffix strts (in red) 3 3 3 Exmle B= S={,,,, } 5 6 4
Suffix tries on diet Def: shred is th from node u to node v in the trie, consisting of nodes of outdegree (excet mye the lst one) nd flg=. Os: There is contiguous rt of B, identicl to the string the shred reresents. We cll this rt the shred-string We stores B itself s n rry. We use new tye of nodes, clled shred-nodes, tht mintin only the indexes of the first (id) nd lst (id) letters of the shred-string in B. Suffix tries on diet - cont Algorithm for constructing thin trie: Given B crete n emty trie T, nd insert ll n suffixes of B into T --- generting trie of size Θ(n ). Trverse the tries, nd ech time tht shred is seen, relce ll nodes of the shred with single shred-node. tye id id 7 Exmle for shred of dd flg 7 B= cdddd 7 8 Suffix tries on diet - cont Clerly the use of shred nodes sves some-ut cn we rove something? Oservtions: The # numer of leves of T is t most n (every lef is the end of one refix). In ddition there re nodes hve single child, ut their flg= ( suffixed hve ended). We cll them secil nodes. Oservtions: There re n secil nodes. Thnks for tience. See you t the review 9 5
Suffix tries on diet - cont Proof of lemm (just FYI) Lemm: Let T e tree where ech internl node hs outdegree or more, nd m leves. Then T hs t most m internl nodes. Bck to thin suffix tries: T does not hve exctly this roerty, ut it is very close (no long shreds), so mssged lemm still works, so But #internl_nodes is #lefs_nodes+#secil_nodes, #lefs_nodes + #secil_nodes #suffixes_of_b = n So the size of the trie is only constnt more thn the size of the ook. Lemm: Let T e tree where ech internl node hs outdegree or more, nd m leves nd k internl nodes. Then k m Proof: Assume true for ll trees with strictly less w thn m leves, nd ssume T hs m leves. Find lef u whose distnce from root is mximum. u Assume it hs hs exctly one siling v. Note tht v is lef (why?). Let w e their common rent. Remove oth u nd v from T. Let T e the resulting tree. Let k, m denote # internl nodes nd leves in T. Now in T w is lef. m =m-+=m-. 3 k =k-. 4 The outdegree of every internl node From induction, k m. Hence k m v 6