Longest Common Prefixes - PDF Free Download

Longes Common Prefixes The sandard ordering for srings is he lexicographical order. I is induced by an order over he alphabe. We will use he same symbols (, <,,, ec.) for boh he alphabe order and he induced lexicographical order. We can define he lexicographical order using he concep of he longes common prefix. Definiion 1.4: The lengh of he longes common prefix of wo srings A[0..m) and B[0..n), denoed by lcp(a, B), is he larges ineger l min{m, n} such ha A[0..l) = B[0..l). Definiion 1.5: Le A and B be wo srings over an alphabe wih a oal order, and le l = lcp(a, B). Then A is lexicographically smaller han or equal o B, denoed by A B, if and only if 1. eiher A = l 2. or A > l, B > l and A[l] < B[l]. 19

The concep of longes common prefixes can be generalized for ses: Definiion 1.6: For a sring S and a sring se R, define lcp(r) = max{l A[0..l) = B[0..l) for all A, B R} lcp(s, R) = max{lcp(s, T ) T R} Σlcp(R) = lcp(t, R \ {T }) T R The concep of disinguishing prefix is closely relaed and ofen used in place of he longes common prefix for ses. The disinguishing prefix of a sring is he shores prefix ha separaes i from oher srings in he se. For a prefix free se R he sum of he lenghs of he disinguishing prefixes is Σdp(R) = Σlcp(R) + R. For a non-prefix free se, he disinguishing prefixes are no always really fully defined. However, even more ineresing is a hird measure of longes common prefixes in a se defined nex. I is slighly differen from boh Σlcp(R) and Σdp(R). 20

Definiion 1.7: Le R = {S 1, S 2,..., S n } be a se of srings and assume S 1 < S 2 < < S n. Then he LCP array LCP R [1..n] is defined by Furhermore, he LCP array sum is LCP R [i] = lcp(s i, {S 1,..., S i 1 }). ΣLCP (R) = i [1..n] LCP R [i]. Example 1.8: Le R = {po$, poao$, poery$, aoo$, empo$}. Then Σlcp(R) = 11, Σdp(R) = 16, ΣLCP (R) = 7 and he LCP array is: LCP R 0 po$ 3 poao$ 3 poery$ 0 aoo$ 1 empo$ 21

Theorem 1.9: The number of nodes in rie(r) is exacly R ΣLCP (R) + 1, where R is he oal lengh of he srings in R. Proof. Consider he consrucion of rie(r) by insering he srings one by one in he lexicographical order using Algorihm 1.2. Iniially, he rie has jus one node, he roo. When insering a sring S i, he algorihm execues exacly S i rounds of he wo while loops, because each round moves one sep forward in S i. The firs loop follows exising edges as long as possible and hus he number of rounds is LCP R [i] = lcp(s i, {S 1,..., S i 1 }). This leaves S i LCP R [i] rounds for he second loop, each of which adds one new node o he rie. Thus he oal number of nodes in he rie a he end is: 1 + S i LCP R [i] = R ΣLCP (R) + 1. i [1..n] The proof reveals a close connecion beween LCP R and he srucure of he rie. We will laer see ha LCP R is useful as an acual daa srucure in is own righ. 22

The LCP array LCP R and is sum have oher ineresing properies: ΣLCP (R) Σlcp(R) 2 ΣLCP (R). For i [2..n], LCP R [i] = lcp(s i, S i 1 ). Le π : [1..n] [1..n] be an arbirary permuaion. Define LCP R,π [i] = lcp(s π(i), {S π(1),..., S π(i 1) }) ΣLCP π (R) = LCP R,π [i]. i [1..n] In oher words, LCP R,π and ΣLCP π (R) are he same as LCP R and ΣLCP (R) excep he order of he srings is differen. Then ΣLCP π (R) = ΣLCP (R) and LCP R,π is a permuaion of LCP R. The proofs are lef as exercises. 23

Compac Trie Tries suffer from a large number of nodes, close o R in he wors case. The space requiremen can be problemaic, since ypically each node needs much more space han a single symbol. Pah compaced ries reduce he number of nodes by replacing branchless pah segmens wih a single edge. Leaf pah compacion applies his o pah segmens leading o a leaf. The number of nodes is now R + Σlcp(R) ΣLCP (R) + 1 (exercise). Full pah compacion applies his o all pah segmens. Then every inernal node (excep possibly he roo) has a leas wo children. In such a ree, here is always a leas as many leaves as inernal nodes. Thus he number of nodes is a mos 2 R. The full pah compaced rie is called a compac rie. 24

Example 1.10: Pah compaced ries for R = {po$, poao$, poery$, aoo$, empo$}. p o po $ aoo$ empo$ $ aoo$ empo$ ao$ ery$ ao$ ery$ The egde labels are facors of he inpu srings. If he inpu srings are sored separaely, he edge labels can be represened in consan space using poiners o he srings. The ime complexiy of he basic operaions on he compac rie is he same as for he rie (and depends on he implemenaion of he child operaion in he same way), bu prefix and range queries are faser on he compac rie (exercise). 25

Ternary Trie Tries can be implemened for ordered alphabes bu a bi awkwardly using a comparison-based child funcion. Ternary rie is a simpler daa srucure based on symbol comparisons. Ternary rie is like a binary search ree excep: Each inernal node has hree children: smaller, equal and larger. The branching is based on a single symbol a a given posiion as in a rie. The posiion is zero (firs symbol) a he roo and increases along he middle branches bu no along side branches. Ternary rie has varians similar o he sandard (σ-ary) rie: A basic ernary rie, which is a full represenaion of he srings. Compac ernary ries reduce space by compacing branchless pah segmens. 26

Example 1.11: Ternary ries for R = {po$, poao$, poery$, aoo$, empo$}. $ p o a o $ e r y a o o $ e m p o $ $ p o a o$ a oo$ ery$ empo$ $ p o a o$ a oo$ ery$ empo$ $ Ternary ries have he same asympoic size as he corresponding ries. 27

A ernary rie is balanced if each lef and righ subree conains a mos half of he srings in is paren ree. The balance can be mainained by roaions similarly o binary search rees. b roaion d A B d b D E C D E A B C We can also ge reasonably close o a balance by insering he srings in he ree in a random order. 28

In a balanced ernary rie each sep down eiher moves he posiion forward (middle branch), or halves he number of srings remaining in he subree (side branch). Thus, in a balanced ernary rie soring n srings, any downward raversal following a sring S passes a mos S middle edges and a mos log n side edges. Thus he ime complexiy of inserion, deleion, lookup and lcp query is O( S + log n). In comparison based ries, where he child funcion is implemened using binary search rees, he ime complexiies could be O( S log σ), a muliplicaive facor O(log σ) insead of an addiive facor O(log n). Prefix and range queries behave similarly (exercise). 29

Sring Soring Ω(n log n) is a well known lower bound for he number of comparisons needed for soring a se of n objecs by any comparison based algorihm. This lower bound holds boh in he wors case and in he average case. There are many algorihms ha mach he lower bound, i.e., sor using O(n log n) comparisons (wors or average case). Examples include quicksor, heapsor and mergesor. If we use one of hese algorihms for soring a se of n srings, i is clear ha he number of symbol comparisons can be more han O(n log n) in he wors case. Deermining he order of A and B needs a leas lcp(a, B) symbol comparisons and lcp(a, B) can be arbirarily large in general. On he oher hand, he average number of symbol comparisons for wo random srings is O(1). Does his mean ha we can sor a se of random srings in O(n log n) ime using a sandard soring algorihm? 30

The following heorem shows ha we canno achieve O(n log n) symbol comparisons for any se of srings (when σ = n o(1) ). Theorem 1.12: Le A be an algorihm ha sors a se of objecs using only comparisons beween he objecs. Le R = {S 1, S 2,..., S n } be a se of n srings over an ordered alphabe Σ of size σ. Soring R using A requires Ω(n log n log σ n) symbol comparisons on average, where he average is aken over he iniial orders of R. If σ is considered o be a consan, he lower bound is Ω(n(log n) 2 ). Noe ha he heorem holds for any comparison based soring algorihm A and any sring se R. In oher words, we can choose A and R o minimize he number of comparisons and sill no ge below he bound. Only he iniial order is random raher han any. Oherwise, we could pick he correc order and use an algorihm ha firs checks if he order is correc, needing only O(n + ΣLCP (R)) symbol comparisons. An inuiive explanaion for his resul is ha he comparisons made by a soring algorihm are no random. In he laer sages, he algorihm ends o compare srings ha are close o each oher in lexicographical order and hus are likely o have long common prefixes. 31

Proof of Theorem 1.12. Le k = (log σ n)/2. For any sring α Σ k, le R α be he se of srings in R having α as a prefix. Le n α = R α. Le us analyze he number of symbol comparisons when comparing srings in R α agains each oher. Each sring comparison needs a leas k symbol comparisons. No comparison beween a sring in R α and a sring ouside R α gives any informaion abou he relaive order of he srings in R α. Thus A needs o do Ω(n α log n α ) sring comparisons and Ω(kn α log n α ) symbol comparisons o deermine he relaive order of he srings in R α. Thus he oal number of symbol comparisons is Ω ( α Σ k kn α log n α ) and α Σ k kn α log n α k(n n) log n n σ k k(n n) log( n 1) = Ω (kn log n) = Ω (n log n log σ n). Here we have used he facs ha σ k n, ha α Σ n k α > n σ k n n, and ha α Σ n k α log n α > (n n) log((n n)/σ k ) (see exercises). 32

The preceding lower bound does no hold for algorihms specialized for soring srings. Theorem 1.13: Le R = {S 1, S 2,..., S n } be a se of n srings. Soring R ino he lexicographical order by any algorihm based on symbol comparisons requires Ω(ΣLCP (R) + n log n) symbol comparisons. Proof. If we are given he srings in he correc order and he job is o verify ha his is indeed so, we need a leas ΣLCP (R) symbol comparisons. No soring algorihm could possibly do is job wih less symbol comparisons. This gives a lower bound Ω(ΣLCP (R)). On he oher hand, he general soring lower bound Ω(n log n) mus hold here oo. The resul follows from combining he wo lower bounds. Noe ha he expeced value of ΣLCP (R) for a random se of n srings is O(n log σ n). The lower bound hen becomes Ω(n log n). We will nex see ha here are algorihms ha mach his lower bound. Such algorihms can sor a random se of srings in O(n log n) ime. 33

Sring Quicksor (Mulikey Quicksor) Quicksor is one of he fases general purpose soring algorihms in pracice. Here is a varian of quicksor ha pariions he inpu ino hree pars insead of he usual wo pars. Algorihm 1.14: TernaryQuicksor(R) Inpu: (Muli)se R in arbirary order. Oupu: R in ascending order. (1) if R 1 hen reurn R (2) selec a pivo x R (3) R < {s R s < x} (4) R = {s R s = x} (5) R > {s R s > x} (6) R < TernaryQuicksor(R < ) (7) R > TernaryQuicksor(R > ) (8) reurn R < R = R > 34

In he normal, binary quicksor, we would have wo subses R and R, boh of which may conain elemens ha are equal o he pivo. Binary quicksor is slighly faser in pracice for soring ses. Ternary quicksor can be faser for soring mulises wih many duplicae keys. Soring a mulise of size n wih σ disinc elemens akes O(n log σ) comparisons (exercise). The ime complexiy of boh he binary and he ernary quicksor depends on he selecion of he pivo (exercise). In he following, we assume an opimal pivo selecion giving O(n log n) wors case ime complexiy. 35

Sring quicksor is similar o ernary quicksor, bu i pariions using a single characer posiion. Sring quicksor is also known as mulikey quicksor. Algorihm 1.15: SringQuicksor(R, l) Inpu: (Muli)se R of srings and he lengh l of heir common prefix. Oupu: R in ascending lexicographical order. (1) if R 1 hen reurn R (2) R {S R S = l}; R R \ R (3) selec pivo X R (4) R < {S R S[l] < X[l]} (5) R = {S R S[l] = X[l]} (6) R > {S R S[l] > X[l]} (7) R < SringQuicksor(R <, l) (8) R = SringQuicksor(R =, l + 1) (9) R > SringQuicksor(R >, l) (10) reurn R R < R = R > In he iniial call, l = 0. 36

Example 1.16: A possible pariioning, when l = 2. al p habe al i gnmen al l ocae al g orihm al ernaive al i as al ernae al l = al i gnmen al g orihm al i as al l ocae al l al p habe al ernaive al ernae Theorem 1.17: Sring quicksor sors a se R of n srings in O(ΣLCP (R) + n log n) ime. Thus sring quicksor is an opimal symbol comparison based algorihm. Sring quicksor is also fas in pracice. 37

Proof of Theorem 1.17. The ime complexiy is dominaed by he symbol comparisons on lines (4) (6). We charge he cos of each comparison eiher on a single symbol or on a sring depending on he resul of he comparison: S[l] = X[l]: Charge he comparison on he symbol S[l]. Now he sring S is placed in he se R =. The recursive call on R = increases he common prefix lengh o l + 1. Thus S[l] canno be involved in any fuure comparison and he oal charge on S[l] is 1. Only lcp(s, R \ {S}) symbols in S can be involved in hese comparisons. Thus he oal number of symbol comparisons resuling equaliy is a mos Σlcp(R) = Θ(ΣLCP (R)). (Exercise: Show ha he number is exacly ΣLCP (R).) S[l] X[l]: Charge he comparison on he sring S. Now he sring S is placed in he se R < or R >. The size of eiher se is a mos R /2 assuming an opimal choice of he pivo X. Every comparison charged on S halves he size of he se conaining S, and hence he oal charge accumulaed by S is a mos log n. Thus he oal number of symbol comparisons resuling inequaliy is a mos O(n log n). 38