Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search

Size: px

Start display at page:

Download "Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search"

Felix Sherman
6 years ago
Views:

1 Two Birds With One Stone: An Efficient Hierarchica Framework for Top-k and Threshod-based String Simiarity Search Jin Wang Guoiang Li Dong Deng Yong Zhang Jianhua Feng Department of Computer Science and Technoogy, TNList, Tsinghua University, Beijing 84, China. Abstract String simiarity search is a fundamenta operation in data ceaning and integration. It has two variants, threshodbased string simiarity search and top-k string simiarity search. Existing agorithms are efficient either for the former or the atter; most of them can t support both two variants. To address this imitation, we propose a unified framework. We first recursivey partition strings into disjoint segments and buid a hierarchica segment tree index (HS-Tree) on top of the segments. Then we utiize the HS-Tree to support simiarity search. For threshod-based search, we identify appropriate tree nodes based on the threshod to answer the query and devise an efficient agorithm (HS-Search). For top-k search, we identify promising strings with arge possibiity to be simiar to the query, utiize these strings to estimate an upper bound which is used to prune dissimiar strings, and propose an agorithm (HS-Topk). We aso deveop effective pruning techniques to further improve the performance. Experimenta resuts on rea-word datasets show our method achieves high performance on the two probems and significanty outperforms state-of-the-art agorithms. I. INTRODUCTION As an important operation in data ceaning and integration, string simiarity search has attracted significant attention from the database community. It has a widespread rea appications such as web search, spe checking, and DNA sequence discovery in bio-informatics [], []. Given a set of strings and a query, string simiarity search aims to find a the strings from the string set that are simiar to the query. There are many metrics to quantify the simiarity between strings, such as Jaccard, Cosine and edit distance. Among them edit distance is one of the most widey-used metrics to toerate typographica errors [5], [6], [], [4]. In this paper we focus on using edit distance to quantify string simiarity. There are two variants of the string simiarity search probem. The first is threshod-based string simiarity search, which requires users to input a threshod and identifies a the strings from the string set whose edit distances to the query is within the given threshod. The second is top-k string simiarity search, which finds k strings from the string set that have the smaest edit distances to the query. Existing studies on the threshod-based simiarity search probem [2], [], [2], [6] empoy a fiter-verification framework. In the fiter step, they utiize the threshod to devise effective fiters in order to prune arge numbers of dissimiar strings and generate a set of candidates. In the verification step, they verify the candidates by computing their rea edit distances to the query. These threshod-based agorithms are rather expensive for top-k search, because to support top-k search, they have to enumerate the threshod incrementay, execute the search operation for each threshod, and thus invove penty of unnecessary computations. On the other hand, existing studies on top-k simiarity search [26], [27], [5] are inefficient for threshod-based search because they cannot make fu use of the given threshod to do pruning. To summarize, existing methods are efficient either for threshodbased search or for top-k search and most of them can t efficienty support both of the two probems. Thus it cas for a unified framework to efficienty support the two variants of string simiarity search. To address this probem, we propose a unified framework which can efficienty support these two variants. We first recursivey partition data strings in the string set into disjoint segments and buid a hierarchica segment tree index (HS-Tree) on top of the segments. Then we utiize the HS-Tree to answer a threshod-based query or top-k query. For threshod-based simiarity search, based on the pigeonhoe principe, if a data string is simiar to the query, the data string must have enough segments matching some substrings of the query. We can utiize the HS-Tree to identify such strings and devise an efficient agorithm for threshod-based simiarity search. For top-k simiarity search, we first access the promising data strings that have arge possibiity to be simiar to the query (e.g., the strings sharing argest number of common segments with the query). Based on the promising strings, we can accuratey estimate an upper bound of the edit distances of top-k answers to the query and utiize the bound to prune dissimiar strings. We utiize the HS-Tree to identify the promising data strings and devise an efficient agorithm for top-k simiarity search. Experimenta resuts on rea datasets show our method achieves high performance on both of the two probems and significanty outperforms stateof-the-art agorithms. In this paper, we make the foowing contributions. We propose a hierarchica segment index HS-Tree which can be utiized to support both threshod-based simiarity search and top-k simiarity search. We devise the HS-Search agorithm based on the HS-Tree index to faciitate threshod-based simiarity search. We propose the HS-Topk agorithm based on HS-Tree index to support top-k simiarity search. We deveop batch-based and greedy-match-based pruning strategies to prune dissimiar strings. We have conducted an extensive set of experiments. Experimenta resuts show our method achieves high performance on both threshod-based simiarity search and top-k simiarity search and significanty outperforms state-of-the-art agorithms. The rest of the paper is organized as foows. We formuate our probem and review reated works in Section II. We intro /5/$3. 25 IEEE 59 ICDE Conference 25

2 duce our hierarchica index in Section III. The HS-Search agorithm is proposed to support threshod-based simiarity search in Section IV and the HS-Topk agorithm is presented to support top-k simiarity search in Section V.Experimenta resuts are reported in Section VI. We concude in Section VII. II. PRELIMINARIES In this section, we first formuate the probem in Section II-A and then review reated works in Section II-B. A. Probem Definition Given two strings r and s, the edit distance between r and s, denoted as ED(r, s), is the minimum number of edit operations (incuding substitution, insertion, and deetion) needed to transform r to s. There are two variants in string simiarity search. The first identifies the strings from a string set whose edit distances to the query are not arger than a given threshod. The second finds top-k strings with the smaest edit distances to the query. Next we formuate the two probems. Definition (Threshod-based Simiarity Search): Given a string set S, a query q, and a threshod τ, threshod-based simiarity search finds a strings s S such that ED(s, q) τ. Definition 2 (Top-k Simiarity Search): Given a string set S, a query q, and an integer k, top-k simiarity search finds a subset R S, such that R = k and for any r R and s S R, ED(r, q) ED(s, q). Exampe : Consider the string set in Tabe I. Suppose the query q= brothor, τ = and k = 2. The threshodbased simiarity search returns { brother } since the edit distance between brother and q= brothor is and the edit distances between other strings and q are arger than. The top-2 simiarity search returns R={ brother, brothe }, because the edit distances between the two strings and q are respectivey and 2, and the edit distances for other strings to the query are not smaer than 2. TABLE I. A STRING COLLECTION ID String Length s brother 7 s 2 brothe 7 s 3 broathe 7 s 4 breathes 8 s 5 swingabe 9 s 6 deduction 9 s 7 abna evina s 8 christopher swenson 9 B. Reated Works ) Threshod-based Simiarity Search: There are many simiarity search agorithms [2], [2], [6], [27], [4]. Most of existing studies empoy a fiter-verification framework to address the string simiarity search probem, and many effective fiters have been devised to prune dissimiar strings. Sarawagi et a. [7] proposed count fiter based on n-grams. Li et a. [] extended the count fiter and deveoped severa ist-merge agorithms to improve the performance. Wang et a. [2] proposed an adaptive prefix fitering framework to improve the performance. Zhang et a. [27] proposed B ed - tree which utiized B + -tree structure to index strings. Li et a. [2] used variabe ength grams as signatures to support string simiarity search. Qin et a. [6] devised an asymmetry signature. Hadjieeftheriou et a. [7] proposed a hash-based method to estimate the number of resuts. These methods have two imitations. First, the n-grambased signature has ower pruning than our segment-based signature, because to avoid missing resuts, n cannot be arge and a sma n has imited pruning power. Second, athough we can extend such methods to support top-k simiarity search by enumerating different threshods, they are rather expensive as they require to perform the search agorithms many times. 2) Top-k Simiarity Search: There have aready been some studies on top-k simiarity search. Yang et a. [26] proposed an n-gram based method to support top-k simiarity search. It dynamicay tuned the ength of grams according to different threshods. However, it needed to buid dupicate indexes for each n and ed to ow efficiency. Deng et a. [5] proposed a range-based agorithm by grouping specific entries to avoid dupicated computations in the dynamic-programming matrix when the edit distance is computed. This agorithm used a trie index to share the common prefixes of strings. However for ong strings, there are fewer common prefix and the pruning power wi be imited. B ed -tree [27] can aso be used to support top-k simiarity search. However this method utiized n-grams to group simiar strings together, which needed to enumerate many different threshods and thus ed to ow efficiency. Wang et a. [22] designed a nove fiter-and-refine pipeine approach that utiized approximate n-gram matchings to compute top-k resuts. Compared with these agorithms, our method has two advantages. First, we can utiize the HS-Tree to identify promising strings so as to estimate a tighter upper bound and use the bound to eiminate dissimiar strings. Thus our method achieves higher performance on top-k simiarity search. Second, our method can aso efficienty support the threshodbased search, because our method can seect appropriate nodes in the HS-Tree based on the given threshod and utiize these nodes to efficienty identify the answer. 3) Simiarity Join: There are many studies on string simiarity join. Given two string sets, simiarity join finds a simiar string pairs [], [4], [9], [24], [25], [2], [6]. An experimenta survey is made in [9]. Bayardo et a. [] proposed the prefix fiter based method for simiarity join. Xiao et a. proposed the position fiter [25] and mismatch fiter [24] to improve the prefix fiter. Wang et a. [9] proposed a trie-based method to support simiarity join. Li et a. [4] proposed the segment fiter with efficient substring seection and verification methods to perform simiarity join. Athough Adapt [2] and QChunk [6] extended join techniques to support search, they perform much worse than our method (see Section VI). In this paper, we extend the segment-based fiter in [4] to support simiarity search. Different from PassJoin which requires to first specify a threshod and then buid the index based on the threshod, we can buid an HS-Tree index in an offine step and utiize the index to answer simiarity search queries with arbitrary threshods. For top-k simiarity search, since it is rather hard to predefine an appropriate threshod, PassJoin has to enumerate the threshod and thus achieves ow performance. Our HS-Tree addresses the imitations of PassJoin and can efficienty support top-k and threshodbased simiarity search. 4) Other Reated Works: Some previous studies focus on approximate entity extraction, which gives a dictionary of entities and a document, finds a substrings in the document that are simiar to some entities. Wang et a. [2] proposed 52

3 S min i=2,j= S i=2,j= L S i=,j= S -τ i=,j= L Group By Length i=2,j=2 L S S i=2,j=2 S i=2,j=3 i=2,j=3 L i=,j=2 L S +τ i=,j=2 S i=2,j=4 L S S max i=2,j=4 Agorithm : HS-Tree Construction (S) Input: S: The string set Output: The HS-Tree index begin 2 Group strings in S by ength; 3 for = min to max do 4 Cacuate the maximum eve L = og 2 ; 5 Generate sets S,, S,2, nodes n,, n,2, indexes L,, L,2 ; 6 for i = 2 to n do 7 8 for j = to 2 i do Generate sets S i,2j, n i,2j n i,2j, S i,2j, nodes, L i,2j, indexes L i,2j ; i=n,j= L i=n,j=2 n L 9 end Fig.. The HS-Tree Index. b,2,3 bro ther the athe,2,3 2 3 ro,2,3 Fig. 2. An HS-Tree exampe for S 7. th,2 at 3 er e he 2 3 a neighborhood deetion based method to sove this probem. Li et a. [3] proposed a unified framework to support approximate entity extraction under various simiarity functions. Recenty Kim et a. [] proposed efficient agorithms to sove the probem of substring top-k simiarity search, which finds the top-k approximate substrings matching with a given query, which is different from our top-k simiarity search probem. Another reated topic is query autocompetion. Ji et a. [8] proposed a trie-based structure to support fuzzy prefix search and Li et a. [5] improved it by removing unnecessary nodes. Chaudhuri et a. [3] proposed a simiar soution. Xiao et a. [23] extended the neighborhood deetion method to improve the performance of query autocompetion. III. THE HIERARCHICAL SEGMENT INDEX In this section, we introduce a hierarchica segment tree (HS-Tree) to index the data strings. We first group the strings by ength and et S denote the group of strings with ength. For each group S, we buid a compete binary tree, where the root is a dummy node. We partition each string s S into two disjoint segments where the first segment is the prefix of s with ength s 2 and the second segment is the suffix of s with ength s 2, where s is the ength of s. Let S, and S,2 respectivey denote the set of the first segments and that of the second segments for the strings in S. Based on these two sets, we generate two chidren of the root, i.e., two nodes in the first eve, n i=,j= and n i=,j=2, where i denotes the eve and j denotes the sibing. (The eve of the root is.) For each tree node n i,j, we buid an inverted index, where entries are segments in S i,j and each segment with segment id j is associated with an inverted ist L i,j which is a ist of strings that contain the segment. Next we recursivey construct the tree. For each node n i,j, we partition each segment in S i,j into two segments (using the same partition method above) and et S i+,2j and S i+,2j respectivey denote the set of the first segments and that of the second segments. Then node n i,j has two chidren n i+,2j and n i+,2j with respect to S i+,2j and S i+,2j. We aso buid the inverted indexes L i+,2j and L i+,2j. The procedure is terminated if there exists a segment in the eve with ength of. In other words, the maximum eve is og 2. Figure iustrates the index structure. Agorithm shows the agorithm to buid the HS-Tree index. It first groups strings by ength (ine 2). For each group S, it buids a hierarchica index with L eves, where L = og 2 (ine 4). In eve i, it iterativey partitions the segments in eve i into 2 segments and constructs invert indexes L i,j for the j th segment in eve i (ines 6-8). Exampe 2: Consider the string set in Tabe I. We first group them by ength and then iterativey buid HS-Tree for each ength. Take group S 7 in Figure 2 as an exampe. The group ength is 7. The maxima eve is L = og 2 7 = 2. Consider string s = brother. In eve, s is partitioned into two segments { bro, ther }. Then in eve 2, these two segments are iterativey partitioned into 2 segments { b, ro }, and { th, er }. Simiary, we can iterativey partition strings s 2 and s 3 to buid the index. Space Compexity: We anayze the space compexity of the HS-Tree. Suppose min is the minimum string ength and max is the maximum string ength. For each group S, the HS-Tree index incudes segments and the inverted index. S contains og eves. In the i th eve, there are 2 i segments. Thus each string is partitioned into at most og j= 2j = O() segments. Obviousy each string is contained in at most O() inverted ists, and thus the space compexity of the HS-Tree is O( = max = min S ), which means the tota number of characters in a input strings. Discussion: For ease of presentation, in the HS-Tree, we assume each node has two chidren and our method can be easiy extended to support the case that each node has more than two chidren, ike B-tree. In addition, we focus on inmemory setting and our method can be extended to support disk-based setting (ike B-tree). Due to space constraints, we eave it as a future work. 52

4 IV. THRESHOLD-BASED SIMILARITY SEARCH In this section, we devise an efficient agorithm HS- Search to efficienty answer threshod-based simiarity search queries using the HS-Tree index (see Section IV-A). We first introduce a fiter-verification framework and then deveop nove techniques to improve both the fiter step (see Section IV-B) and the verification step (see Section IV-C). A. The HS-Search Agorithm Consider a query q with a threshod τ. Based on the ength fiter: two strings cannot be simiar if their ength difference is arger than τ, we ony need to access the HS-Tree with engths between q τ and q + τ. Consider the HS-Tree with ength [ q τ, q + τ]. In the i th eve, the strings are partitioned into 2 i segments. For 2 i τ +, any string s in S cannot be simiar to q if q has no substring matching a segment of s based on the pigeonhoe principe. Moreover, if s is simiar to q, q must contain at east 2 i τ segments of s, as formaized in Lemma. On the contrary, if 2 i < τ +, any string in S may be simiar to q even if q has no substring matching a segment of the string. Lemma : Given a string s with 2 i segments, a query q and a threshod τ, if s is simiar to q, q must contain at east 2 i τ segments of s. Exampe 3: Consider a query q={ swaingbe }, the data string s 5 = {swingabe}, s 6 = {deduction}, and τ = 2. In eve 2, the 4 segments of s 5 are { sw, in, ga, be }, and q has = 2 common substrings sw and in with s 5, thus s 5 is a candidate for query q and τ = 2. On the contrary, the 4 segments of s 6 = {deduction} are { de, du, ct, ion }. And s 6 has no matched segments with q. So we can safey prune s 6. Generay, consider the eve i og 2 (τ +). If a string has ess than 2 i τ segments that match substrings of the query, we can prune it. In other words, we ony need to check the candidate strings which share at east 2 i τ common segments with q. To faciitate the checking, we can utiize the inverted index on each node to identify the candidates, which wi be discussed in detai ater. As there are many eve i such that i og 2 (τ + ), we can use any of such eves to identify candidates. Obviousy the deeper the eve is, the shorter the segment is, and thus the ower pruning power is. Thus we use the minima eve og 2 (τ + ) with ongest segments. Formay, consider a query q and group S. We first ocate the i = og 2 (τ + ) th eve. For each node n i,j ( j 2 i ), we compute the ength of segments in this node Len i,j. (It is worth noting that the segments in each node have the same ength.) To check whether q has a substring matching a segment in node n i,j, we ony need to enumerate the set of substrings of q with ength Len i,j, denote as W(q, L i,j ). We wi discuss how to reduce the size of W(q, L i,j ) in Section IV-B. Next we find the strings which have at east 2 i τ segments matching the query. To this end, for each substring in W(q, L i,j ), we identify the substring from the inverted index of the node and retrieve the inverted ist of the substring. Next we compute the strings that appear more than 2 i τ times on the invited ists. Such strings wi be regarded as candidates and then we wi verify them and get the resuts. Agorithm 2: HS-Search (S, q, τ) Input: S: The string set q: The query string τ: The given edit-distance threshod Output: R = {(s S) ED(s, q) τ} begin 2 Cacuate the maximum eve i = og 2 (τ + ) ; 3 for = q τ to q + τ do 4 for j = to 2 i do 5 Generate substrings set W(q, L i,j ) ; 6 for w W(q, L i,j ) do 7 for s L i,j [w] do 8 N i (s, q) = N i (s, q) + ; 9 if N i (s, q) 2 i τ then if ED(s, q) τ then R = R {s}; 2 end Based on Lemma, we devise the HS-Search agorithm to support threshod-based string simiarity search and the pseudo-code is shown in Agorithm 2. HS-Search first cacuates the eve i = og 2 (τ + ). Then HS-Search utiizes ength fiter to reduce the number of visited HS-Tree: for a query string q and threshod τ, ony groups S ( q τ q + τ) are visited. Next HS-Search generates the set of substrings of q, W(q, L i,j ), for j = to 2 i (ine 5). According to Lemma, if a string s is simiar to q, s shoud appear at east 2 i τ times in the inverted ists L i,j [w] for j 2 i and w W(q, L i,j ). To this end, HS-Search checks whether w W(q, L i,j ) is in L i,j. If so, for any string s in the inverted ist L i,j [w], s and q shares a common segment w and we increase N i (s, q) by, where N i (s, q) denotes the number of matched segments between s and q in eve i(ine 8). If N i (s, q) exceeds 2 i τ, s is a candidate and HS-Search verifies candidate s (ine ). To improve the performance of HS-Search, we design efficient techniques to reduce the substring-set size W(q, L i,j ) in Section IV-B and improve the verification cost in Section IV-C. Exampe 4: Consider the string set in Tabe I. Suppose we have buit the HS-Tree index as shown in Figure 2. Assume the query string is q = brethor and the threshod is τ = 2. First, i = og 2 (τ + ) = 2. In eve 2 there are 4 segments. If ED(s, q) 2, s and q must have at east = 2 common segments. As q = 7 and min = 7, we ony need to visit the strings with ength between 7 and 9 for τ = 2. First we check L 2, 7, which contains segments { b }. String q has a matched substring b in L 2, 7. As strings s, s 2 and s 3 in the inverted ist of b share a common segment with q, we increase N 2 (s, q), N 2 (s 2, q), N 2 (s 3, q) by. Then we check L 2,2 7, which contains segments { ro }, string q has no matched substring. Simiary we check L 2,3 7 and L 2,4 7. As q has a matched substring th in L 2,3 7 with invited ist of {s, s 2 }, we increase N 2 (s, q) and N 2 (s 2, q) by. Now strings s and s 2 have two matched segments with q, so we verify them and get ED(q, s ) = 2 and ED(q, s ) = 3. We put s into the resut set. As N 2 (s 3, q) = < 2, we can safey prune s 3. Then we perform simiar procedure on L i,j 8 and L i,j 9, no more answers are found and the agorithm terminates. 522

5 Compexity: We anayze the time compexity of Agorithm 2. It consists of two parts: the fitering time and the verification time. Before performing Agorithm 2, we need to group strings by ength, which is O( max = min S ). This can be regarded as offine time and not incuded in the search time. The time to generate set W(q, L i,j ) is O(τ) as discussed in Section IV-B. The tota time of seecting substrings for [ q τ, q + τ and j [, 2 i ] is O(τ 2 ). The fitering cost is to visit the inverted ists which is [ q τ, q +τ] Li,j [w W(q, L i,j )]. The verification cost of q and s for threshod τ is O(τ min( q, s )) [4], and we wi further improve the verification cost in Section IV-C. B. Improving The Fitering Step To find the candidate of a given query q, we need to first generate the set of substrings W(q, L i,j ) and then count how many common segments between strings in S and q. To reach such goa efficienty, we reduce the fitering cost through two directions: () reduce the size of W(q, L i,j ) for j 2 i ; and (2) remove invaid segment matchings. Reduce W(q, L i,j ). It is obvious smaer W(q, L n,j ) wi ead to higher performance. Based on the position fiter, a segment in s cannot be matched with the substrings of q with arge position difference. For exampe, for s 5 = swingabe, query q= bending and threshod τ = 2. The segments of s in the second eve is correspondingy { sw, in, ga, be }. The substring be cannot be matched with the fourth segment because their position difference is arger than τ. The position fiter coud be strengthened by considering the ength difference of s and q, denoted as. Thus suppose the start position of segment w L i,j is Pos i,j and its ength is Len i,j. We can easiy get the ower bound of start positions of substrings in q, denoted as LB = max(, Pos i,j τ ), and the upper bound, denoted as UB = min( q Len i,j +, Pos i,j + τ + ). We ony need to check the substrings starting within the range [LB, UB]. Moreover, by ooking both from the eft-side and right-side perspective [4], we can further reduce the vaue of LB and UB to LB = max(, Pos i,j (j ), Pos i,j + (τ + j)) and UB = min( s Len i,j +, Pos i,j + (j ), Pos i,j + + (τ + j)). Thus W(q, L i,j ) = {q[pos i,j, Len i,j ]} where q[pos i,j, Len i,j ] is the substring of q with start position Pos i,j [LB, UB] and ength Len i,j. The correctness is stated in Lemma 2. Lemma 2: Given a query string q and a threshod τ, using the set W(q, L i,j ) = {q[pos i,j [LB, UB], Len i,j ]} to find matching candidates, our method wi not miss any resut. To identify string s with N i (s, q) 2 i τ, we use the ist-merge agorithm [] to improve the performance which utiizes a heap to efficienty identify the candidates without accessing every strings on the inverted ists. Remove Invaid Matchings. The above method identifies the candidates by simpy counting the number of matched common segments. However it is worth noting that the substrings of q matching with different segments may confict with each other, where confict means that the two matched substrings overap. This is because the segments of the data strings are disjoint and the matched substrings of q shoud aso be Agorithm 3: SEGCOUNT() Input: M i (s, q): Matched segments of s, q in eve i. Output: Number of matched segments without confict. begin 2 D[] = ; 3 for j = 2 to N i (s, q) do 4 D[j] = max <t j {γ(j, t) D[t]} + ; 5 6 return D[N i (s, q)]; end disjoint. For exampe, if q= acompany, s= accompish and threshod τ = 2. If the substring ac of q matches the first segment ac, the substring com cannot match the second segment, because ac and com are overapped in q but they are disjoint in s. If we do not eiminate such confict matching, it wi invove fase positives. For exampe, we get N 2 (s, q) = 2 in the segment count step, but string s has ony one common segment with string q. This fase positive wi resut in arger candidate size and extra verification cost. To sove this probem, we design a dynamic-programming agorithm to cacuate the maximum number of matched segments whie eiminating the confict between matched segments (removing overapped matching). Let D[j] denote the maximum number of matched segments without confict among the first j segments. To cacuate D[j], we need to find the ast matched segment t without confict for t < j, and compute the number of matched segments without confict using this segment. Then we consider whether the current matched segment conficts with the ast matched segment. We use a function γ(j, t) to judge whether two matches j and t confict: γ(j, t) = if there is no confict; γ(j, t) = otherwise. Then we can get the foowing recursion formua: {, j = D[j] = () max <t j {γ(j, t) D[t]} +, otherwise Then we devise a dynamic-programming agorithm based on the above formua as shown in Agorithm 3. The agorithm takes as input the set of matched segments between s and q in eve i, denoted as M i (s, q), which can be easiy gotten when computing N i (s, q). The agorithm outputs the number of matched segments without confict based on Equation. The time compexity of this agorithm is O(x 2 ), where x = N i (s, q). The maximum vaue of x is 2 i in eve i, but in practica cases the number of matched segments is far smaer than 2 i with the hep of efficient substring seection methods. Thus the cost of this agorithm is negigibe. We can integrate this observation into Agorithm 2 (repace ine 8 of Agorithm 2 with Agorithm 3) to enhance the pruning power. Exampe 5: Suppose string s = are accommodate to, q = were acomofortabe and τ = 5. In the segment matching step, we search in eve 3 and get M 3 (s, q) = { ac, com, mo } and N 3 (s, q) = However, when we perform Agorithm 3 on M 3 (s, q), we get D[] =, D[2] = and D[3] = 2. So there are 2 rather than 3 matched segments. As 2 < = 3, we can safey prune s. C. Improving The Verification Step In our HS-Search agorithm, after generating the candidate strings, we verify whether their rea edit distances to the 523

6 Fig. 3. The MutiExtension Verification (y =4) query are within the threshod τ. In this section we devise nove techniques to improve the verification step. To compute the edit distance between two strings q and s, a naive method is to use the dynamic-programming agorithm. Given two strings q and s, it utiizes a matrix C with q + rows and s + coumns where C[i][j] is the edit distance between the substring q[,i] and s[,j] [8]. Actuay, if we ony want to check whether the edit distance between two strings is within a given threshod τ, we can further reduce the compexity by ony computing the C[i][j] vaues for i j τ, with the cost of (2τ +) min( q, s ). A ength-aware method has been proposed to improve the time compexity to τ min( q, s ) [4]. These agorithms can aso do eary termination when the vaues in a row are a arger than τ to further improve the time compexity. For our HS-Search agorithm, we can utiize the set of matched segments in M i (s, q) to avoid unnecessary computations. As we can see from Section IV-A, given a candidate string s, y = N i (s, q). We aign q and s based on these matched segments and partition them into 2 y + parts incuding y matched segments and y + unmatched segments. We ony need to verify whether ED(q, s) τ in this aignment. Suppose the set of matched segments is M i (s, q) = {m,m 2,,m y }, and the set of unmatched segments of query q (string s) is Q = {q,q 2,,q y+ } (S = {s,s 2,,s y+ }), where q j (s j ) is the substring between m i and m i + for j [,y] and q y+ (s y+ ) is the substring after p y (s y ). If two matched segments m j and m j+ are consecutive, s j (q j )=. We denote the tota edit distance as TED = y+ i= ED(s i,q i ).IfTED is arger than τ, s is not simiar to q in this aignment and we can discard s; otherwise we add s into the resut set. We ca this method as SingeThreshod. We can further improve the performance of SingeThreshod by assigning each ED(q j,s j ) with a tighter threshod bound. As we can see, the part s j is between the segments m j and m j+. Let p j denote the order of m j among the 2 i segments in s. For string s, s j consists of p j+ p j segments (for j =, the vaue is p and for j = y +, the vaue is 2 i p y ). For a given threshod τ, ify is exacty 2 i τ, the τ edit operations must be distributed in each segment according to the pigeon hoe principe. In this case, if we find Agorithm 4: MutiExtension (s, q, τ, M i (s, q)) Input: s, q: the strings to be verified τ: the given threshod M i (s, q): matched segments of s, q in eve i. Output: The verification resut begin 2 Generate sets S and Q based on M i (s, q); 3 Generate threshods based on M i (s, q); 4 for j =to N i (s, q)+ do 5 Cacuate ED(q j,s j ) with ength-aware method; 6 if ED(q j,s j ) >τ j then return; 7 Add s into the resut set; 8 end two consecutive errors in one segment (or in other words, more than τ j errors appear in part j), we can safey terminate the verification step. The threshod τ j of each part j is cacuated as foows: τ j = p, j = 2 i p y, j = y + p j+ p j, otherwise We propose an eary termination technique (Lemma 3). Lemma 3: Consider two strings s and q with exacty y = 2 i τ common segments. If ED(q j,s j ) >τ j, ED(q, s) >τ. Based on Lemma 3, we can devise the verification agorithm MutiExtension as shown in Agorithm 4. In this agorithm, we can use the ength-aware method to efficienty verify ED(q j,s j ). Next we wak through the two verification agorithms SingeThreshod and MutiExtension using the foowing exampe as shown in Figure 3. Exampe 6: Consider two strings s 7 = abna evina in Tabe II and q= ovner oevi, and the threshod is 5. We perform HS-Search on eve 3 with 8 segments. The segments of s on eve 3 are respectivey { a, b, n, a,, ev, i, na }. As we can see from Figure 3, s and q have matched segments m = n,m 2 =,m 3 = ev. p = 3, p 2 = 5, p 3 = 6. Then we can generate the set of S={s = ab,s 2 = a,s 3 =,s 4 = ina } and Q={q = ov,q 2 = er,q 3 = o,q 4 = i }. ForMutiExtension, the threshods of each part are respectivey 2,,,2. Then we cacuate ED(s,q )=2 2, ED(s 2,q 2 )=2>, then MutiExtension terminates and discards s. But SingeThreshod wi continue to cacuate ED(s 3,q 3 )=, ED(s 4,q 4 )=2 and TED = = 7 > 6, then it discards s. Thus MutiExtension outperforms SingeThreshod. V. TOP-K SIMILARITY SEARCH In this section, we study the top-k simiarity search probem. Different from the threshod-based simiarity search, the top-k simiarity search has no fixed threshod. Athough we can extend the threshod-based method to find top-k answers by enumerating threshods incrementay unti k resuts found, this agorithm is rather expensive because it executes mutipe (unnecessary) search operations for each threshod and invoves many dupicated computations. To address this issue, we devise an efficient agorithm HS-Topk to support top-k simiarity search using our HS-Tree index. The basic idea is to first access the promising strings with arge possibiity (2) 524

7 to be simiar to the query so as to prune arge numbers of dissimiar strings effectivey. To this end, we first propose a batch-pruning-based method in Section V-A and then present an effective pruning strategy in Section V-B. Finay we devise our HS-Topk agorithm in Section V-C. A. The Batch-Pruning-Based Method We maintain a priority queue Q to keep the current k promising resuts. Let UB Q denote the argest edit distance between the strings in Q to the query, i.e., UB Q = max{ed(s Q, q)}. Obviousy UB Q is an upper bound of the edit distances of top-k resuts to the query. In other words, we can prune a string if its edit distance to the query is arger than UB Q. Next we discuss how to utiize the queue to find top-k answers. Given a query q, we sti access the HS-Tree in a top-down manner. Consider the i th eve of the HS-Tree with ength. For each node n i,j, we generate the substrings W(q, L i,j ) for j 2 i and identify the corresponding inverted ists L i,j. We group the strings in the inverted ists based on the number of substrings they contain in W(q, L i,j ). Let B x denote the group of strings containing x substrings. As each string contains at most 2 i segments, there are at most 2 i groups. For strings in B x, they share x common segments with the query. If 2 i < UB Q, for x [, 2 i ] a strings in B x can be regarded as candidates. Otherwise, there are 2 i x mismatch segments for strings in B x. As a mismatch segment eads to at east edit error, and thus the ower bound of the edit distances of strings in B x to query q is LB Bx = 2 i x. Obviousy if LB Bx UB Q, we can prune the strings in B x based on Lemma 4. In other words, we ony need to visit the groups such that x 2 i UB Q. On the other hand, the arger x is, the strings in B x have arger possibiity in the top-k answers. Thus, we want to first access the strings in groups with arger x. These two observations motivate us devise a batch-pruning-based method. Lemma 4: If LB Bx UB Q, strings in B x can be pruned. For eve i, if 2 i UB Q +, we can terminate because we have found a top-k answers within threshod UB Q using the nodes in the first i eves; otherwise, we retrieve the inverted ists of substrings in W(q, L i,j ) from the i th eve and identify the substrings with N i (s, q) 2 i UB Q from these ists. Next we group the strings into B x (x [2 i UB Q, 2 i ]) based on the number of matched segments. Then, we visit the groups based on the number x in descending order. For each string s B x, we compute the rea edit distance between s and q. If ED(s, q) < UB Q, we update the priority queue Q and UB Q using s. Iterativey, we can correcty find the top-k answers. Obviousy this batch-pruning-based method not ony reduces the fitering cost, because we ony need to do segment counting once for a eve i whie we need to perform 2 i times for each threshod using the threshod-based method, but aso the verification time, because we can use a tighter bound UB Q to do verification. Exampe 7: Consider a top-2 query q = breahers on the dataset in Tabe I. Suppose string s 4 = breathes is aready in Q as it has a common segment brea in eve with q. ED(q, s 4 ) = 2. UB Q =. Then, consider Leve i Fig. 4. HS-Tree # matched segments i argerthan2-ub Q The Batch-Pruning-Based Method Sorted B 2 i... B 2 i-ub Q Groups s ed(s,q) < UB Q UB Q Q update the HS-Tree for strings with ength 7 (S 7 ) in Figure 2. We start from eve. As there is no matched segment, we move to eve 2. There are two matched segments br, er. Their inverted ists are {s } and {s, s 2, s 3 } where s = brother, s 2 = brothe, s 3 = breathe. we have N 2 (s, q) = 2, N 2 (s 2, q) = and N 2 (s 3, q) =. We put s into B 2 and verify B 2. As ED(q, s )=3 < UB Q, we add s into Q and update UB Q = 3. As 2 2 UB Q +, the agorithm is terminated and strings in B = {s 2, s 3 } are pruned. B. The Greedy-Match Strategy The batch-pruning-based method can effectivey prune strings without enough common segments. If each mismatch segment ony contains one edit error, this method is very effective as it can effectivey estimate the ower bound. However, if one mismatch segment invoves more than one consecutive errors, the estimation is not accurate, and this method fais to fiter such candidates. For exampe, consider query q = broader on the dataset in Tabe I. For UB Q =, string s = brother and s 2 = brothe can pass the segment fiter as they share a common segment bro, but it is obvious that their edit distances to q are arger than as the second segment contains 3 errors. To address this issue, we devise a greedy-match strategy to prune strings with consecutive errors by utiizing our hierarchica index structure. Consider a string s in eve i with N i (s, q) 2 i UB Q. In this case, we cannot prune s. Instead of directy verifying string s, we go to the next eve i + and estimate a tighter bound by counting the number of matched segments in eve i + (i.e., N i+ (s, q)). If the number is smaer than 2 i+ UB Q, we can prune string s based on Lemma 4. If the string is not pruned in eve i+. We check the eve i+2. Iterativey, if the string is sti not pruned in the eaf eve, we wi compute the rea edit distance based on the method in Section IV-C. It is worth noting that the arger the eve is, the shorter a segment is, and the higher probabiity that those dissimiar strings with consecutive errors can be pruned. Next we discuss how to efficienty compute the number of matched segments between s and q. A naive method enumerates each segment of s and checks whether it appears as a substring of q. This method shoud enumerate many segments. Aternativey, we propose an effective method. Based on the characteristics of the HS-Tree, if the j th segment in eve i matches a substring of q, the 2 j th and 2 j th segments must match two substrings of q in eve i +. Thus we do not need to check them again. Thus, we ony need to check the mismatch segments in eve i. 525

8 Agorithm 5: GREEDYMATCH (s, q, i, τ) Input: s, q: the strings to be verified i: eve; τ: the current threshod M i (s, q): matched segments of s, q in eve i. Output: True or Fase begin 2 for r = i + to n do 3 for segments w M r (s, q) do 4 put w s two subsegments into M r (s, q); for j = to 2 r do if segment j / M r (s, q) then check the segment j; if find a matched substring then put segment j into M r (s, q); if N r (s, q) < 2 r τ then return Fase; return True; end The pseudo-code of the greedy-match strategy is shown in Agorithm 5. It gets the set of matched segments M i (s, q) in current eve i and then use M i (s, q) to generate M i+ (s, q). This iterativey-matching procedure for different eves wi not invove heavy fitering cost, because segment j in eve i corresponds to segments 2 j and 2 j in eve i+ and such matched segments can be passed down to ower eves. Besides, as we have generated the substrings W(q, L r,j s ) for each eve r ( r n), when we ook for a matched substring for segment j, we just check W(q, L r,j s ) and do not need to scan inverted ists any more. Exampe 8: Consider s 8 = christopher swenson and query q= atrmstophbwcmrense. The ength of s 8 is 9, so there are 4 eves. Suppose the current threshod UB Q is 3. It requires = matched segments in eve 2, = 5 segments in eve 3 and = 3 segments in eve 4. We find a matched segments stoph in eve 2. Instead of verification, here we continue to ook for another 5 2 = 3 matched segment in eve 3. We find a matched segment en in eve 3. As there are totay 3 matched segments in eve 3, which is smaer than the required number of matched segment 5, s 8 wi be pruned and we do not need to compute its rea edit distance to q. C. The HS-Topk Agorithm We combine the batch-pruning-based method and greedymatch strategy together and devise a top-k simiarity search agorithm HS-Topk. The pseudo-code is shown in Agorithm 6. It first initiaizes the queue Q and sets the threshod UB Q = (ine 2). Then it searches the HS-Tree from the root (ine 3). If the vaue of UB Q is no arger than the minimum threshod supported by current eve (2 i UB Q + ), the agorithm terminates and we can safey prune remainder strings (ine 4). For each eve, we ony visit HS-Tree with engths between q UB Q and q + UB Q based on ength fitering (ine 5). For each HS-Tree, it identifies the matched segments, groups strings with different numbers of matched segments into different groups, and visits the group sorted by the number Agorithm 6: HS-Topk (S, q, k) Input: q: the query string; k: the size of resut set S: The string set Output: R: the top-k answer begin 2 Initiaize queue Q and UB Q ; 3 for i = to L do 4 if 2 i UB Q + then return; 5 for [ q UB Q, q + UB Q ] do 6 Identify strings with occurrence number arger than 2 i UB Q and group them to B x ; 7 for x = 2 i to 2 i UB Q do 8 for each string s B x do 9 if GREEDYMATCH(s, q, i, 2 i UB Q ) then Verify ED (s,q); if ED(s, q) < UB Q then Update Q and UB Q ; 2 3 Fig. 5. end christoph erswenson LV chri stoph er sw enson LV 2 Match Greedy q=atrmstophbwcmrense q=atrmstophbwcmrense ch ri st oph er sw en son LV 3 An Exampe for Greedy-Match Strategy in ascending order (ine 6). For each string in the current group B x, we perform the greedy-match strategy (ine 9). If the string passes the fiter, we verify the candidate using threshod UB Q based on the techniques in Section IV-C (ine ). If ED(s, q) < UB Q, we use s to update Q and UB Q (ine 2). VI. EXPERIMENTAL STUDY In this section, we conduct an extensive set of experiments and our experimenta goa is to evauate the efficiency of our agorithms and compare with state-of-the-art methods. A. Experiment Setup We used three pubicy avaiabe rea datasets in our experiments: DBLP Author, DBLP pubication records and Query Log 2, which are widey used in previous studies [9]. The detais of data sets are shown in Tabe II. Author contains short strings, Query Log contains medium-ength strings, and DBLP contains ong strings. We compared our agorithms with state-of-the-art methods. For threshod-based simiarity search, we compared our HS- ey/db/

9 TABLE II. DATASETS Datasets # Avg Len Max Len Min Len Author 62, Query Log 464, DBLP,385, Search agorithm with Adapt [2], QChunk[6] and B ed - tree[27]. Athough there are some other simiarity search agorithms, e.g., Famingo[] and VChunk [2], it has been shown in the previous studies that both Adapt and QChunk outperformed them [2], [6]. So we ony compared HS- Search with Adapt and QChunk. For top-k simiarity search, we compared our HS-Topk agorithm with Range [5], B ed - tree and AQ[26]. We obtained the source code of Adapt, Range, B ed -tree from the authors and impemented QChunk and AQ by ourseves [9]. A the agorithms were impemented in C++ and compied using GCC with -O3 fag. A the experiments were run on a Ubuntu server machine with two Inte Xeon E542 CPUs (8 cores, 2.5GHz) and 32GB memory. B. Evauation on Threshod-based Search ) Evauating Different Verification Agorithms: We first evauated the verification methods. We impemented three methods Length-aware, SingeThreshod and MutiExtension. Length-aware is the ength-aware method [4]; SingeThreshod is the method that has ony one threshod τ; MutiExtension is our method that has separate threshods for each matched part based on Lemma 3. A the three methods were impemented with eary-termination techniques. Figure 6 shows the resuts by varying edit-distance threshods on the three datasets. We can observe that SingeThreshod invoves ess verification time than Length-aware because it can avoid dupicated computations on aready matched segments and divided the two strings into different parts. For each part, MutiExtension has a different threshod and wi terminate as soon as the edit distance of one part is arger than the given threshod of that part. Thus comparing with SingeThreshod, MutiExtension wi terminate earier.for exampe, on the Query Log dataset, for τ = 5, Length-aware took 55 miiseconds on average, and SingeThreshod decreased the time to 39 miiseconds,whie MutiExtension further reduced it to 24 miiseconds. 2) Fiter Time vs Verification Time: Next we evauated the cost of the fiter step and the verification step and the resut is show in Figure 7. We can see that our segment fiter has great fitering power for sma threshods, and a arge number of dissimiar strings can be pruned in the fiter step, so the verification time is further reduced. For arge threshods e.g., τ = 5 and τ = 2 in Figure 7(c), the verification time is dominant in the overa time. This is because when the threshod becomes arge, we need to do segment matching in ower eves and the segments woud be much shorter. It is obvious shorter segments have more chance to be matched, so the number of candidates is arger than that of sma threshods. 3) Comparison with state-of-the-art methods: We compared our HS-Search agorithm with state-of-the-art agorithms Adapt,QChunk and B ed -tree by varying different editdistance threshods on the three datasets Author, Query Log and DBLP. Figure 8 shows the resuts. We can see that HS- Search achieved the best performance on a the datasets and outperformed existing agorithms by 3 to 2 times. For exampe, on the Query Log dataset for τ =, HS-Search took 6 miiseconds. And the average search time for Adapt, TABLE III. THRESHOLD-BASED SIMILARITY SEARCH: INDEX Dataset Method Index Size(MB) Index Construction Time(s) HS-Search Author Adapt QChunk B ed -tree HS-Search Query Adapt Log QChunk 72.7 B ed -tree HS-Search DBLP Adapt QChunk B ed -tree QChunk and B ed -tree were 32, 5 and 4 miiseconds respectivey. Among a the baseines, the overa performance of B ed -tree was the worst because it had poor fitering power to prune dissimiar strings. Adapt had better performance than QChunk because Adapt took advantage of the adaptive prefix ength to reduce a arge number of candidates. Our method achieved the best performance for the foowing reasons. First, existing agorithms are based on n- grams, and our method is based on segments which have much stronger fitering power than gram-based methods (as segments are onger than n-grams). Moreover, the segments are seected across the string and not restricted to the prefix. Thus, our method aways generates the east number of candidates. Second, comparing with other simiarity search agorithms we aso design efficient verification mechanism. In this way, we can take advantage of the resuts of fiter step and avoid redundant computations. Third, since previous agorithms are based on n-grams, they need to tune the parameter n for different datasets even for different threshods on the same dataset to achieve the best performance. Our HS-Search agorithm does not need any parameter tuning. So the utiity of HS-Search is much better than the existing agorithms. In addition, we compared the index construction time and index size and the resut is shown in Tabe III. We have the foowing observations. First, HS-Search had the east index time because it can iterativey divide the segments in different eves and do not need to buid a arge inverted index ike n-gram based methods. Second, the index size of HS-Tree is neary the same with state-of-the-art n-grambased methods, because HS-Tree partitions each string with ength into disjoint segments in each eve with totay og = O() segments; and n-gram based methods generate n + grams. Thus they generate simiar number of n-grams/segments and thus the index sizes are aso simiar. In addition, in the inverted ists, for a segment, HS-Tree ony needs to maintain the string containing it but n-gram-based methods aso need to store the position of n-gram, so our index size can be smaer than state-of-the-art methods. Third, B ed -tree organizes simiar strings into a B-tree node based on specific orders and does not need to maintain inverted ists, thus its index size is smaer. 4) Scaabiity: We evauated the scaabiity of HS-Search. We varied the number of strings in each dataset and tested the average search time. Figure 9 shows the resut on the three datasets. We can see that as the size of a dataset increased, our method scaed very we for different edit-distance threshods and achieved near inear scaabiity. For exampe, on the DBLP data set, when the threshod was 2, the average search time for 4, strings, 5, strings and 6, strings were respectivey 2, 27, and 33 miiseconds. 527

MassJoin: A MapReduce-based Method for Scalable String Similarity Joins

MassJoin: A MapReduce-based Method for Scaabe String Simiarity Joins Dong Deng, Guoiang Li, Shuang Hao, Jiannan Wang, Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China {dd11,