MassJoin: A MapReduce-based Method for Scalable String Similarity Joins

Size: px

Start display at page:

Download "MassJoin: A MapReduce-based Method for Scalable String Similarity Joins"

Edgar McKenzie
6 years ago
Views:

1 MassJoin: A MapReduce-based Method for Scaabe String Simiarity Joins Dong Deng, Guoiang Li, Shuang Hao, Jiannan Wang, Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China {dd11, hs13, wjn8}@mais.tsinghua.edu.cn, {iguoiang,fengjh}@tsinghua.edu.cn Abstract String simiarity join is an essentia operation in data integration. The era of big data cas for scaabe agorithms to support arge-scae string simiarity joins. In this paper, we study scaabe string simiarity joins using MapReduce. We propose a MapReduce-based framework, caed MASSJOIN, which supports both set-based simiarity functions and characterbased simiarity functions. We extend the existing partition-based signature scheme to support set-based simiarity functions. We utiize the signatures to generate key-vaue pairs. To reduce the transmission cost, we merge key-vaue pairs to significanty reduce the number of key-vaue pairs, from cubic to inear compexity, whie not sacrificing the pruning power. To improve the performance, we incorporate ight-weight fiter units into the key-vaue pairs which can be utiized to prune arge number of dissimiar pairs without significanty increasing the transmission cost. Experimenta resuts on rea-word datasets show that our method significanty outperformed state-of-the-art approaches. I. INTRODUCTION Data integration has received significant attention in the ast three decades, because it can combine heterogenous data from different sources and provide a unified view of these data. The string simiarity join, which, given two coections of strings, finds a simiar string pairs, is an essentia operation in data integration. The simiarity between two strings is usuay quantified by simiarity functions. There are two main types of simiarity functions (see Section II): set-based simiarity functions (e.g., Jaccard) and character-based simiarity functions (e.g., Edit distance). String simiarity joins have many reaword appications, e.g., entity resoution, copy detection, and document custering. Most of existing simiarity-join methods used in-memory agorithms. The era of big data poses new chaenges for arge-scae string simiarity joins and cas for new scaabe agorithms. MapReduce provides a programming mode for processing arge-scae data, and in this paper we study scaabe string simiarity joins using MapReduce. A naive method is to enumerate a string pairs from the two coections and use MapReduce to process these string pairs. However this method is rather expensive for arge string sets. To address this probem, Vernica et a. [14] proposed a prefix fitering based method, which used a fiter-and-verification framework. In the fiter step, they seected some tokens from each string and generated a set of candidate pairs who share a common token. In the verification step, they verified the candidate pairs to generate the fina answers. One big imitation of this method is ow pruning power. As a singe token is very short and usuay has ow seectivity, many dissimiar pairs wi share a same token and cannot be pruned. To address this imitation, we propose a MapReduce-based framework, caed MASSJOIN, for scaabe string simiarity joins, which supports both set-based simiarity functions and character-based simiarity functions. We aso adopt a fiterand-verification framework. In the fiter step, we generate the signatures for each string and prove that two strings are simiar ony if they share a common signature. We utiize this property to generate the candidate pairs. In the verification step, we verify the candidate pairs to generate the fina resuts. One big chaenge is to generate high-quaity signatures. PASSJOIN [11] proposed a high-quaity partitionbased signature scheme for the edit distance function. We extend the partition-based signature scheme to support setbased simiarity functions. It is worth noting that the extension is nontrivia (see Section III), because the partition number is fixed for edit distance whie the partition numbers for set-based simiarity functions are not. To use MapReduce, we take the signatures as keys and the strings as vaues to generate key-vaue pairs. Then we use the key-vaue pairs to compute the candidate pairs that share a same key (see Section IV). However this method may generate arge numbers of key-vaue pairs and eads to huge transmission cost. For exampe, considering the Jaccard function, this method generates O( 3 ) key-vaue pairs for a string with ength. To address this issue, we merge key-vaue pairs to significanty reduce the number of key-vaue pairs but without sacrificing the pruning power (see Section V). For exampe, we can reduce the number from O( 3 ) to O(). To improve the performance, we incorporate ight-wight fiter units into the key-vaue pairs which can be utiized to prune arge numbers of dissimiar pairs without significanty increasing the transmission cost (see Section VI). In summary, we make the foowing contributions. We extend the partition-based signature scheme to support set-based simiarity functions. We propose a scaabe MapReduce-based framework to support both set-based and character-based simiarity functions. We devise a merge-based agorithm to significanty reduce the number of key-vaue pairs without sacrificing the pruning power. We deveop a ight-weight fiter unit based method to prune arge numbers of dissimiar pairs whie not significanty increasing the transmission cost.

2 We have impemented our method and experimenta resuts on rea-word datasets show that our method significanty outperforms state-of-the-art approaches. The structure of the rest paper is as foows. We formuate our probem and review reated works in Section II. We extend existing partition-based signature scheme to support set-based simiarity functions in Section III. Section IV presents our MASSJOIN framework. We devise a merge-based method to reduce the number of key-vaue pairs in Section V and deveop a ight-weight fiter unit based method to reduce the number of candidate pairs in Section VI. We show the experimenta resuts in Section VII and give the concusion in Section VIII. A. Probem Definition II. PRELIMINARY The simiarity join probem finds a simiar string pairs from two given string coections, where the simiarity between two strings is usuay quantified by simiarity functions. Given a simiarity function SIM and a simiarity threshod δ, two strings r and s are simiar if and ony if SIM(r, s) δ. There are two main types of simiarity functions, set-based simiarity functions and character-based simiarity functions. Set-based simiarity functions first tokenize each string into a set of tokens, where a token can be either a word or a n-gram. In the paper we suppose each token is a tokenized word. For simpicity, strings and token sets are interchangeaby used if there is no ambiguous. There are three we-known set-based simiarity functions, Jaccard Simiarity, Dice Simiarity, and Cosine Simiarity, defined as beow. JAC(r, s)= r s r s, r s COS(r, s)=, r s DICE(r, s)= 2 r s r + s, where denotes the size of a set. For exampe, suppose r = Barack H Obama II and s = Barack Obama II. Then we have r = 4, s = 3, r s = 3, and r s = 4. Thus, JAC(r, s) = 3 3 4, COS(r, s) = 12 and DICE(r, s) = 6 7. Character-based simiarity functions are defined based on the number of character operations to transform one string to another. Edit Distance (ED) is a we-known character-based simiarity function. The ED of two strings is the minimum number of edit operations needed to transform one string to another, where the permitted edit operations incude insertion, deetion and substitution. For exampe, given r = michea and s = michae, we have ED(r, s) = 2. Next we formuate the simiarity-join probem. Definition 1 (Simiarity Joins): Given two string coections R and S, a simiarity function SIM (or a distance function DIS) and a simiarity threshod δ (or a distance threshod τ), a simiarity join finds a string pairs r R, s S such that SIM (r, s) δ (or DIS (r, s) τ). For exampe, consider the two string sets in Tabe I. Suppose the Jaccard threshod is δ =.8. The simiarity join finds a simiar pair r 2, s 2 with JAC(r 2, s 2 ) =.8. R S TABLE I TWO STRING COLLECTIONS: R AND S id string size r 1 conference on service information management 5 r 2 poicy on service information management 5 r 3 the poicy on service 4 B. Reated Work s 1 the conference on information management 5 s 2 service information management poicy 4 s 3 conference on information management poicy 5 Partition-based Method: PASSJOIN [11], [1] is the state-ofthe-art simiarity-join agorithm for edit distance. It won the champion in the recent simiarity-join competition organized by EDBT PASSJOIN empoys a fiter-and-verification framework. In the fiter step, it generates signatures for strings. If two strings are simiar, they must share a common signature. It prunes arge numbers of dissimiar pairs based on this idea and the survived pairs are caed candidate pairs. In the verification step, it verifies the candidate pairs to generate the fina answers. One big chaenge in PASSJOIN is to generate high-quaity signatures for two strings. Suppose the edit-distance threshod is τ. Given two strings r and s, it partitions r into τ + 1 disjoint segments, and proves that if s is simiar to r, s must contain a substring that matches one of r s segments, based on the pigeon-hoe theory. Thus for string r, it generates τ +1 signatures. For string s, it seects some substrings of s as its signatures. However PASSJOIN cannot support set-based simiarity functions, because the number of segments (e.g., τ + 1) is not fixed for these functions. We extend PASSJOIN to support set-based simiarity functions and propose a scaabe framework MASSJOIN for arge-scae string simiarity join using MapReduce. Moreover, to reduce the transmission cost, we merge the seected substrings and decrease the number of signatures from O( 3 ) to O() for set-based simiarity functions, where is the string ength, and from O(τ 3 ) to O(min(τ 2, )) for character-based simiarity functions. Simiarity Join Using Map-Reduce: Vernica et a. [14] proposed a simiarity join method using MapReduce which utiized the prefix fitering to support set-based simiarity functions. They seected a subset of tokens as signatures and proved that two strings are simiar ony if their signatures share common tokens. For each string, they used each token in its prefix as a key and the string as a vaue to generate the key-vaue pairs. As a singe token is very short and usuay has ow seectivity, these methods have two disadvantages. First, many dissimiar pairs may share the same token, and they wi generate many fase positives and thus ead to poor pruning power. Second, a singe token may ead to a skewed probem and there may be arge numbers of strings sharing the same key. Thus the reducer with such keys wi have a heavy workoad. Metway et a. [12] proposed a 2-stage agorithm V- SMART-Join for simiarity joins on sets, mutisets and vectors. They aso used a singe token as a key and had the same 1 wandet/searchjoincompetition213 2

3 issues as [14]. Afrati et a. [1] proposed mutipe agorithms to perform simiarity joins in a singe MapReduce stage. They anayzed the map, reduce and communication cost. However for ong strings, it is rather expensive to transfer the strings using a singe MapReduce stage. Kim et a. [9] addressed the top-k simiarity join probem using MapReduce, which is different from our probem. In-Memory Simiarity Joins: There are many studies on in-memory simiarity join agorithms [19], [4], [15], [18], [2], [8], [7], [6], [17], [16], [2], [21], [3], [13], [1], [2]. Bayardo et a. [3] first proposed the prefix-fiter framework for simiarity joins. Xiao et a. [21] improved the prefix-fiter framework by introducing position fitering. ED-Join [19] extended prefix fitering to support edit distance and used ocation-based mismatch techniques to shorten prefixes and content-based mismatch to fiter dissimiar pairs. Wang et a. [17] proposed an adaptive prefix-fiter framework to dynamicay seect prefixes with different engths. Wang et a. [15] proposed Trie-Join which utiizes a trie structure to recursivey perform prefix fiter. Arasu et a. [2] proposed PartEnum which partitions strings into pieces and enumerates deetion neighborhoods on the pieces as signature to do simiarity joins. Wang et a. [18] proposed VChunkJoin to use qchunks with different engths as signatures and empoy the prefix-fiter framework to do simiarity joins. Wang et a. [16] devised a new kind of simiarity functions and proposed efficient agorithm on the new simiarity functions. Qin et a. [13] proposed an asymmetric signature for both simiarity join and search probems. Chaudhuri et a. [4] proposed a primitive operator for simiarity joins. Gravano et a. [7] proposed to perform simiarity joins inside RDBMS. Xiao et a. [2] studied top-k simiarity joins using adaptive prefix fitering. Jacox et a. [8] studied simiarity joins under metric space. Different from these studies, in this paper we focuses on how to support arge-scae simiarity joins using MapReduce. MapReduce: MapReduce [5] is a famous framework proposed by Googe to faciitate processing arge-scae data in parae. The MapReduce program runs on a arge custer with mutipe nodes. The input data fies are divided into sma fie spits and distributed in a distributed fie system (DFS). MapReduce incudes a map phase, a shuffe phase and a reduce phase to process the data. Each map node reads the fie spits on the node, processes the input key, vaue pairs, and emits intermediate key, vaue pairs. The intermediate key, vaue pairs are shuffed based on the keys and transferred to the reduce nodes. A the intermediate key, vaue pairs with same key must be shuffed to the same reduce node. Each reduce node receives a key-vaue pair key, ist(vaue), where ist(vaue) is a ist of vaues sharing the same key, processes the pair, and writes its output to DFS. III. SIGNATURES FOR SET-BASED SIMILARITY FUNCTIONS To avoid enumerating a string pairs from the two given string sets, we adopt a fiter-and-verification framework. We extend the partition-based signature scheme designed for edit distance [11] to support set-based simiarity functions in this section. Notice that the extension is non-trivia, because PASSJOIN reies on a given edit-distance threshod and generates a fixed number of signatures. However for set-based simiarity functions, the number depends on the string engths. We need to deduce the number of signatures and devise new agorithms to generate signatures. To address this probem, we first discuss how to generate signatures for two strings in Section III-A and then extend the method to support two string sets in Section III-B. Finay, we give the signature compexity in Section III-C. A. Signatures for Two Strings r and s Given two token sets r and s, and a jaccard threshod δ, we sort each token set based on a universa order, e.g., aphabetica order, and obtain two sorted token ists. For simpicity, r and s are referred to their corresponding sorted token ists if there is no ambiguous. If r and s are simiar w.r.t Jaccard threshod r s r + s r s δ, i.e., JAC(r, s) = r s r s = δ, we can deduce r s δ ( r + s ). The tota number of different tokens between r and s is r s + s r. Since r s and s r are two integers, r s r r s r δ s and s r s r s s δ r. Let U = r δ s + s δ r which shoud be an upper bound of r s + s r. We spit r into U + 1 disjoint segments. Based on the pigeon-hoe theory, if s is simiar to r, s must contain a substring which matches a segment of r. We use the segments of r as r s signatures and seect some substrings of s as s s signatures. Then if r and s are simiar, they must share a common signature. Next we formay discuss how to generate the signatures for r and s. Signatures for r: There are mutipe ways to partition r into U + 1 segments. Here we use an even partition scheme as an exampe, where a U + 1 segments have neary the same ength. Formay, given a string r with ength = r, et k = U+1 (U +1). The ast k segments of string r have a ength of U+1 and the first U + 1 k segments have a ength of U+1. Let (U + 1, i, ), abbreviated as i, denote the ength of the i-th segment of r. Obviousy i = U+1 if i U +1 k; otherwise U+1. Let p(u +1, i, ), abbreviated as p i, denote the start position of the i-th segment, i.e., p 1 = 1 and p i = j<i j=1 j +1. The i-th segment of r is the substring of r starting from p i and with ength i, denoted by r[p i, i ]. Thus the signatures of r are tripes r[p i, i ], i, for 1 i U +1. Signatures for s: If s is simiar to r, it requires to have a signature matching one of signatures of r, e.g., r[p i, i ], i,. Thus the signature of s shoud be in the form of s[x i, i ], i, such that r[p i, i ], i, = s[x i, i ], i,. Intuitivey, the start position of the i-th substring of s, e.g., x i, can be any integer in [1, s i + 1]. However this method wi generate arge numbers of signatures. To address this issue, we propose two signature generation methods, a position-aware method and a muti-match-aware method. Position-aware method. Consider the i-th signature of r with start position p i, and a signature of s that matches the i-th 3

4 signature of r with start position x i. If x i < p i, we can deduce that p i x i r s, because r and s are sorted by a universa order, and if r[p i, i ] = s[x i, i ], among the first p i 1 tokens of r and the first x i 1 tokens of s, there are at east p i x i tokens that appear in r and do not appear in s. Since there are totay r s such tokens, p i x r δ s i r s. Let U r s = r δ s, we have p i x i U r s and x i p i U r s. Simiary, if x i p i, we have x i p s δ r i s r. Let U s r = s δ r, we have x i p i + U s r. Thus the position-aware method sets x i [p i U r s, p i +U s r] which wi not invove fase negatives as stated in Lemma 1. Lemma 1: The position-aware signature generation method wi not invove fase negatives. 2 Proof Sketch: We prove it by contradiction. Suppose r and s are simiar and s has a substring with start position x i matching the i-th segment of r and x i < p i U r s. Obviousy r s p i x i > U r s which contradicts with that U r s r s. Simiary we can prove x i p i + U s r. Muti-match-aware method. Sti consider the i-th signature of r with start position p i, and a signature of s that matches the i-th signature of r with start position x i. We have x i p i (i 1), because if x i < p i (i 1), r[1, p i 1] s[1, x i 1] > i 1 and there are at east i different tokens in r and s before the i-th segment based on the ength difference. In other words, if they are simiar, after the i-th segment, there are at most U + 1 i 1 mismatch tokens. Since there are U + 1 i segments, there must be a matching signature after the i-th segment based on the pigeon-hoe theory. Obviousy we can skip this one and use the atter one. Simiary, if we consider the reversed strings of r and s, we have s x i r p i (U + 1 i). We can deduce that x i p i + s r + (U + 1 i). Thus the muti-match-aware method sets x i [p i (i 1), p i + s + (U + 1 i)] where = r, and this method wi not invove any fase negative as stated in Lemma 3. Lemma 2: The muti-match-aware signature generation method wi not invove fase negatives. Proof Sketch: Consider any string r with ength. The start position of its i-th segment is p i. If the first i 1 segments of r contain more than i 1 different tokens from s, regardess of the i-th segment, they must share another common segment in the ast U + 1 i segments and we can drop the i-th segment. Meanwhie, if r and s are simiar and s contains a substring with start position x i matching the i- th segment of r, the number of different tokens in the first i 1 segments of r cannot be smaer p i x i. Thus we have p i x i i 1, i.e., x i p i (i 1). Symmetricay, we can get x i p i + s + (U + 1 i). Interestingy, we can use the two methods simutaneousy and propose a hybrid method which sets x i [max(p i (i 1), p i U r s), min(p i + s + (U + 1 i), p i + U s r)]. 2 For space constraints, we omit detaied proofs of emmas in this paper This hybrid method wi not invove fase negatives as stated in Lemma 3, because the two methods are independent. Lemma 3: The hybrid signature generation method wi not invove fase negatives. Proof Sketch: Based on Lemma 1, we have for any two strings r and s, if s has a substring with start position x i matching the i-th segment of r and x i [p i U r s, p i +U s r], r s > U r s or s r > U s r. That is to say r and s cannot be simiar. Thus for any two simiar strings, a there matching signatures must be within the range generated by the position-aware method. Simiary this caim is aso true for the muti-match-aware method. Thus the hybrid signature generation method wi not invove fase negatives. Thus the signatures of s with respect to r are s[x i, i ], i, for 1 i U + 1 and i x i i, where i = max(p i (i 1), p i U r s ), i = min(p i + s + (U + 1 i), p i + U s r ). Exampe 1: Consider two strings r 1 and s 1 in Tabe I. Suppose the JAC threshod is δ =.8. To generate the signatures for r 1, we compute = r = 5, s = 5, U =, p 1 = 1 and 1 = 5, and obtain the signature r 1 [1, 5], 1, 5 for r 1. To generate the signatures for s 1, we compute U r s =, U s r =, 1 = 1, and 1 = 1, and obtain the signature s 1 [1, 5], 1, 5 for s 1. B. Signatures for Two String Sets When we join two datasets R and S, many strings in R may be simiar to many strings in S, and we need to generaize our method to support two string sets. Based on the ength pruning, if a string s is simiar to r, r s i.e., r s δ, we can deduce s r s r s δ r δ. Simiary we can deduce s r s δ r s δ r. Thus the upper bound of engths of the possibe simiar strings to r is r /δ and the ower bound is δ r. Obviousy for string r we need to consider the strings with engths in the range and generate vaid signatures of r for any possibe simiar string s with ength in [ δ r, r /δ ]. To this end, we derive an upper bound of the number of different tokens to r for a possibe simiar strings that are simiar to r, denoted by U. Obviousy U = max{ r s + s r s is simiar to r}. As r s + s r r + s 2 r s 1 δ ( r + s ) 1 δ r 1 δ ( r + δ ) = δ r, we can deduce a bound U = 1 δ δ r. Thus if a string s is simiar to r, we have r s + s r U. Notice that we can deduce a much tighter bound of U for Jaccard as stated in Lemma 4. Lemma 4: We can deduce a tighter upper bound U = r δ r /δ + r /δ δ r. Based on the upper bound U, we can devise a partition scheme for two string sets as beow. Signatures for r R: We partition r into = U + 1 segments as discussed in Section III-A. The signatures of r are 4

5 TABLE II BOUNDS FOR A SIMILAR STRING PAIR r, s WITHIN THRESHOLD δ JAC COS DICE ED U r s r δ s r δ r s r δ( r + s ) - 2 U s r s δ r s δ r s s δ( r + s ) - 2 U U r s + U s r - U 1 δ 1 δ2 r δ δ 2 r 2(1 δ) r δ τ 1 δ 1 δ2 r + 1 δ δ 2 r + 1 2(1 δ) r + 1 δ τ+1 L o δ s δ 2 s δ 2 δ s s τ L u s s 2 δ s s +τ g δ δ 1 δ δ δ 2 1 δ 2 δ 2 ( f 3 )(1 δ) 3 ( 6 )(1 δ 2 ) 3 δ δ 3 δ 6 δ 2(1 δ) δ - 16(1 δ) 3 (3δ 2 6δ+4) (2 δ) 3 δ 3 - tripes r[p i, i ], i, for 1 i. Simiary, we can deduce the bounds and signatures for other functions as shown in Tabes II-IV. Signatures for s S: s may be simiar to many strings in R. Based on the ength pruning, the upper bound of engths of the possibe simiar strings to s is L u = s /δ ] and the ower bound is L o = δ s. Thus for s, we need to consider strings with ength [L o, L u ] and generate signatures s[x i, i ] for each L o L u and 1 i. In concusion, the signatures generated are shown in Tabe IV. Note that to make the formua consistent for different simiarity functions, we set U r s = s + ( i) and U s r = i 1 for ED. Exampe 2: Consider the two string coections R and S in Tabe I. Suppose the Jaccard Simiarity threshod is δ =.8. For r 1 R, we can deduce = 5, U = 1, = 2, 1 = 2, 2 = 3, p 1 = 1 and p 2 = 3. Thus, we need to generate two signatures r 1 [1, 2], 1, 5 and r 1 [3, 3], 2, 5 for r 1. For string s 1 S, we can deduce L o = 4 and L u = 6, i.e., 4 6. For = 4, we can deduce that = 2, 1 = 2, 2 = 2, U r s =, U s r = 1, 1 = 1, 1 = 2, 2 = 3 and 2 = 4, thus it generates four signatures s 1 [1, 2], 1, 4, s 1 [2, 2], 1, 4, s 1 [3, 2], 2, 4 and s 1 [4, 2], 2, 4. Simiary, for = 5, we generate two signatures s 1 [1, 2], 1, 5 and s 1 [3, 3], 2, 5. For = 6, we generate two signatures s 1 [1, 3], 1, 6 and s 1 [3, 3], 2, 6. In tota, we need to generate eight signatures for s 1. C. Signature Compexity For string r S, as we generate one signature for 1 i, it totay has signatures. For any string s, the tota number of its generated signatures is O((L u s ) 3 + ( s L o ) 3 ) as shown in Lemma 5. Lemma 5: The tota number of signatures for any string s is O((L u s ) 3 + ( s L o ) 3 ). For exampe, suppose we use JAC simiarity and the simiarity threshod is δ, we have the tota number of generated signatures for string s is O( (3 )(1 δ) 3 δ 3 s 3 ). IV. THE MASSJOIN FRAMEWORK In this section we present a scaabe MapReduce-based string simiarity join agorithm, caed MASSJOIN, which can support both set-based simiarity functions and character-based i p i sig sig TABLE III SIGNATURES GENERATED FOR r R JAC COS DICE ED r for i ( ) for i > ( ) p 1 = 1 and p i = j<i j=1 j + 1 r[p i, i ], i, for 1 i TABLE IV SIGNATURES GENERATED FOR s S (FOR ED, U r s = s + ( i) AND U s r = i 1). JAC COS DICE ED L o L u i for i ( ) for i > ( ) i max(p i (i 1), p i U r s) i min(p i + s + ( i), p i + U s r) sig s[x i, i ], i, for i x i i, 1 i, Lo Lu sig f δ s 3 τ 3 simiarity functions. For simpicity, we use Jaccard as an exampe in this section. MapReduce contains two main stages: the fiter stage and the verification stage. We wi introduce the two stages respectivey in Section IV-A and Section IV-B. Then we give the compexity in Section IV-C. Finay we discuss how to support sef joins in Section IV-D. A. Fiter Stage In this fiter phase, we generate candidate pairs using the fiter techniques in Section III: if two strings r and s are simiar, r and s must share a signature. In the map phase, we use the signatures as keys and the strings as vaues. Thus for r R, the key-vaue pairs are r[p i, i ], i,, r for 1 i. For s S, the key-vaue pairs are s[x i, i ], i,, s for i x i i, 1 i and L o L u. As two simiar strings must share a same key, they must be shuffed to the same reduce task. To reduce the transmission cost, in the key-vaue pairs, we use string ids to repace strings (an id is much smaer than its corresponding string.). The pseudocode of our MASSJOIN agorithm is shown in Agorithm 1. It first generates key-vaue pairs for strings in R (ines 2-4) and then for strings in S (ines 5-9). In the reduce phase, each reduce node takes a key-vaue pair sig, ist(sid/rid) as input, where sig is a signature and ist(sid/rid) is the ist of strings that contain the signature. It first spits the ist into two groups: ist(sid) and ist(rid) (ine 11). Notice that for any pair sid, rid from the two ists, it is a candidate pair. To reduce the transmission cost, for each sid ist(sid), we generate a key-vaue pair sid, ist(rid) (ines 12-13). The map and reduce functions are as foows. map: sid/rid, string signature, sid/rid reduce: signature, ist(sid/rid) sid, ist(rid) Comparing with existing works [14], [12] that used a singe token as a key, our method utiizes signatures with mutipe tokes as keys and thus has higher pruning power. Moreover, the number of pairs that share mutipe tokens is smaer than that of a singe token, thus our method can address the skewed token distribution probem as discussed in Section II-B. 5

6 Agorithm 1: MASSJOIN Agorithm // the fiter stage 1 Map( rid, r / sid, s ) 2 for r R do 3 for 1 i do 4 emit( r[p i, i], i, = r, rid ); 5 for s S do 6 for L o L u do 7 for 1 i do 8 for i x i i do 9 emit( s[x i, i], i,, sid ); 1 Reduce ( sig, ist(id) ) 11 spit ist(id) into two groups ist(sid) and ist(rid) ; 12 foreach sid ist(sid) do 13 output( sid, ist(rid) ); // first phase of verification stage 14 Map( sid, ist(rid) / sid, s ) 15 emit( sid, ist(rid) ); emit( sid, s ) ; 16 Reduce ( sid, ist(ist(rid)/s) ) 17 Identify s and ist(ist(rid)) ; 18 Compute distinct ist(rid) from ist(ist(rid)) ; 19 output s, ist(rid) ; // second phase of verification stage 2 Map( s, ist(rid) / rid, r ) 21 emit( rid, r ) ; 22 for each rid in ist(rid) do 23 emit( rid, s ) ; 24 Reduce ( rid, ist(s/r) ) 25 Identify r and ist(s) from ist(s/r); 26 foreach s ist(s) do 27 if SIM(r, s) δ then output( r, s, SIM(r, s) ); Exampe 3: Consider the two string coections in Tabe I. Suppose we use JAC function and the threshod is δ =.8. Figure 1 shows the running exampe. The map functions in the fiter stage generate 2 key-vaue pairs for each r 1, r 2 and r 3 in R and 8, 4, and 8 key-vaue pairs for s 1, s 2 and s 3 in S as shown in the figure. The shuffe phase groups the intermediate key, vaue pairs and sends them to the reduce tasks. In the reduce phase, we get 18 key-vaue pairs. However ony two of them, [ conference information, 1, 5], {r 1, s 1, s 3 } and [ information management, 1, 5], {r 2, s 2 } can generate outputs, which are s 1, {r 1 }, s 2, {r 2 }, and s 3, {r 1 }. B. Verification Stage In the verification stage, we verify the candidate pairs generated from the fiter stage. As two strings may share mutipe signatures, there may be many dupicate candidate pairs, and we want to remove the dupicate ones. In addition, we need to repace the id in candidate pairs with its rea string to verify the candidate pairs. To achieve these two goas, we propose a two-phase method. In the first phase, the map function takes dataset S and sid, ist(rid) as input and emits two types of key-vaue pairs (ine 15). The first one is sid, s from the dataset S which is used to repace sid with string s. The second one is sid, rid which is generated from input sid, ist(rid) gotten from the fiter stage. The reduce function gathers the ist ist(rid) and the string s for the key sid. It first identifies the string s, and then removes the dupicates from ist(rid) to generate a ist of distinct rids, and finay emits key-vaue pair s, ist(rid) (ines 17-19). The map and reduce functions are as foows. map: sid, ist(rid) sid, ist(rid) ; sid, s sid, s reduce: sid, ist(ist(rid)/s) s, ist(rid) In the second phase of verification, we need to repace the rid with string r and verify the candidate pairs. The map function takes dataset R and s, ist(rid) as input and emits two types of key-vaue pairs: rid, r and rid, s. The first one is generated from the dataset R (ine 21). The second one is generated from s, ist(rid) that is the output of the first phase. For each key s, it generates a key-vaue pair rid, s for each rid in ist(rid) (ine 23). In the reduce phase, we have a string r and a ist of string s such that r, s is a candidate pair. We first identify the string r and ist(s), then verify the candidate pairs, and finay output the fina resuts (ines 25-27). The map and reduce functions are as foows. map: s, ist(rid) rid, s, rid, r rid, r ; reduce: rid, ist(r/s) (r, s), SIM(r, s) Exampe 4: Reca the running exampe in the ast subsection and here we discuss the verification stage. In the fiter step, we get three candidate pairs. In the first phase, we remove dupicates from the three pairs, repace sid with its string and output the key-vaue pairs, e.g., service information management poicy, {r 2 }. In the second phase, we repace rid with its string and verify the candidate pairs. Finay, we output r 2, s 2 as a resut. The detais are shown in Figure 1. C. Compexity In this section, we anayze the compexity of our approach in each stage, incuding the space/time compexity and IO cost of the map and reduce phase, and the transmission compexity of the shuffe phase. For each string with ength in R, we need to generate O(U) = O(g δ ) signatures, where g δ is the factor of in the formua of U as iustrated in Tabe II. For each string with ength in S, we generate O((L u ) 3 + ( L o ) 3 ) = O(f δ 3 ) signatures where f δ is the factor of 3 in (L u ) 3 + ( L o ) 3 as iustrated in Tabe II. Let m (n ) denote the number of the strings with ength in R (S). Based on the discussion in Section III, the number of generated signatures for strings in R (S) is O( R max g = R δ m ) (O( S max f = min S δ 3 n )), where min R min (S min ) and S max ( S max) are the minimum and maximum string engths in R (S) respectivey. For ease of presentation, et N = R max g = R δ m and M = S max f = min S δ 3 n. min Fiter Stage: The map function oads the dataset from DFS and the IO cost is O(R + S). The space/time compexity 6

7 dm dm dm dm dm em em UZZZZZ VZZZZZ UZZZZZ VZZZZ U ZZZZ V ZZZZZ & D & D & D >ZZ@U!>ZZZ@U! >ZZ@V!>ZZ@V! >ZZ@V!>ZZ@V! >ZZ@V!>ZZZ@V! >ZZZ@V!>ZZZ@V! >ZZ@U!>ZZZ@U! >ZZ@V!>ZZ@V! >ZZ@V!>ZZZ@V! >ZZ@U!>ZZ@U! >ZZ@V!>ZZ@V! >ZZ@V!>ZZ@V! >ZZ@V!>ZZZ@V! >ZZZ@V!>ZZZ@V! & ^ ' < >ZZ@^UVV`! >ZZZ@^U`! >ZZ@^VV`! >ZZ@^VVV`! >ZZ@^VV`! >ZZ@^V`! >ZZZ@^V`! >ZZZ@^VV`! >ZZZ@^V`! >ZZ@^UV`! >ZZZ@^U`! >ZZ@^V`! >ZZZ@^V`! >ZZ@^U`! >ZZ@^U`! >ZZ@^V`! >ZZZ@^V`! >ZZZ@^V`! & Z & Z V^U`! V^U`! V^U`! U>VZZZZZ@! U>VZZZZZ@! U>VZZZZ@! 'DWDVHW6 VZZZZZ VZZZZ VZZZZZ U U U 'DWDVHW5 ZZZZZ ZZZZZ ZZZZ U^ZZZZZ>VZZZZZ@>VZZZZZ@`! U^ZZZZZ>VZZZZ@`! dz dz >UZZZZZ@>VZZZZ@! dz Fig. 1. Running exampe of generating one signature is O(1), thus the space/time compexity of the first map phase is O(N + M). As we emit one intermediate key, vaue pair for each signature and the space of each signature is O(1), the transmission compexity is aso O(N + M). The reduce step needs to scan the ists of rids and sids in the vaue fieds of the key-vaue pairs, thus the time/space compexity of the reduce phase is O(N M). The reduce step writes the candidate pairs to disk, thus the IO cost is aso O(N M). Verification Stage: The map function of the fiter phase needs to read the dataset S and the candidate pairs, thus the IO cost is O(N M + S). Emitting a key, vaue takes O(1) time, and the space/time compexity is O(N M + S ). The shuffe transmission compexity is aso O(N M + S). The reduce function scans the input vaue ist once and the time compexity is O ( N M + S ). As we at most output each origina string in S once, the IO cost is O(N M + S). Suppose there are c distinct candidate pairs generated from the fiter reduce of the verification stage, and the average size of the strings in S is S avg. As each candidate pair contains a string in S, the IO compexity for this map phase is O(cS avg + R). The time compexity of the third map is O(c + R ). As each candidate pair contains a string of S, the shuffe transmission compexity is O(cS avg + R). Suppose the average verification time for each candidate pair is v. The time compexity of the third reduce is O(cv). The IO cost in this stage is O ( c(r avg + S avg ) ). We show a the compexity anaysis in Tabe V. D. Supporting Sef-Join The MASSJOIN agorithm can inherenty support the sefjoin scenario where two datasets are the same, i.e., R = S. In this case, for each string r, we first take its segments as segment signatures and then seect its substrings for strings with ength between L o and r as substring signatures (taking it as s). We can use a specia mark to differentiate segment signatures and substring signatures. In this case, we can support sef joins. V. MERGING KEY-VALUE PAIRS Based on the compexity anaysis in Section IV-C, the basic framework generates arge numbers of signature-based keyvaue pairs for string s S in the fiter stage. In this section, we discuss how to reduce the number of key-vaue pairs without sacrificing the pruning power. For exampe, we can decrease the number from O( s 3 ) to O( s ) for the Jaccard function. To achieve this goa, we first introduce how to merge key-vaue pairs in Section V-B, and then devise an efficient merging agorithm in Section V-B, and finay discuss how to incorporate the method into our framework in Section V-C. A. Merging Key-Vaue Pairs For simpicity, the key-vaue pairs in this section refer to those generated from the map phrase in the fiter stage. Basic Idea. The basic framework wi generate arge numbers of key-vaue pairs for s S: s[x i, i ], i,, s for i x i i, 1 i and L o L u. We aim to merge them to generate a sma number of key-vaue pairs. Obviousy if two signatures match, e.g., r[p i, i ], i, = s[x i, i ], i,, the segment r[p i, i ] and the substring s[x i, i ] must match, i.e., r[p i, i ] = s[x i, i ]. One simpe method is to take s[x i, i ], s as key-vaue pairs for s and r[p i, i ], r as key-vaue pairs for r. Obviousy, if r and s are simiar, they must share a same key and the method wi not miss a simiar pair. However, this method may reduce the pruning power. This is because a substring and a segment may match with different start positions, or with different engths. Both cases wi generate more fase positives. To achieve the same pruning power as the basic framework, we need to check whether the start position x i of the substring s[x i, i ] is within [ i, i ] as we as whether the ength of r is within [L o, L u ]. These bounds are different for various i and, and it is expensive to add them into the key-vaue pairs. Fortunatey, these bounds can be efficienty computed just based on the vaues of i and in O(1) time. For each key-vaue pair s[x i, i ], s of s, we add x i and s into its vaue fied and the new key-vaue pair is s[x i, i ], s, x i, s. For each key-vaue pair r[x i, i ], r of r, we add i and = r in its vaue fied and the key-vaue pair is r[p i, i ], i,, r. In the map phase, we generate these new key-vaue pairs. In the reduce phase, for two matching keys r[p i, i ] = s[x i, i ] with vaues i,, r and s, x i, s, we first cacuate i, i, L o and L u based on, i and s in the vaue fieds and δ from the configuration fie. 7

8 TABLE V COMPLEXITY ANALYSIS OF MASSJOIN (R/S ARE THE DATASET SIZES AND R / S ARE THE NUMBERS OF STRINGS IN R/S). Stage Fiter stage First phase of the verification stage Second phase of the verification stage Time/Space O(N + M) O(N M + S ) O(c + R ) map IO O(R + S) O(N M + S) O(cS avg + R) shuffe Trans O(N + M) O(N M + S) O(cS avg + R) Time/Space O(N M) O(N M + S ) O(cv) reduce ( ) IO O(N M) O(N M + S) O c(r avg + S avg) Then we check if L o L u and i x i i. If yes, we output this pair as a candidate pair. It is easy to prove that the method using r[p i, i ], i,, r and s[x i, i ], s, x i, s as keys generates the same candidate pairs as that using r[p i, i ], i,, r and s[x i, i ], i,, s as keys. It is worth noting that this method can significanty reduce the number of key-vaue pairs. Given any string s S, for setbased simiarity functions, we can reduce the number of keyvaue pairs from O(f δ s 3 ) to O( s ) as stated in Lemma 6. Lemma 6: Given a string s and a set-based simiarity function threshod δ, the number of key-vaue pairs in the mergebased method is O( s ). Proof Sketch: Here we use JAC as an exampe. For any string s and any JAC simiarity threshod δ, is monotonicay increasing with the increasing of, thus the minimum and maximum vaue of i Lo Lu Lu is and respectivey. Lo + 1 Lu Lo Thus there are at most 4 different vaues of i. Since x i must be in [1, s ], there are at most O( s ) keys. Moreover, each key s[x i, p i ] can soey determine the vaue s, x i, sid. Thus the number of keyvaue pairs is O( s ). Simiary, we can prove this emma for COS and DICE functions. We can aso prove that for ED, the number of key-vaue pairs in this merge-based method is O ( min( s, τ 2 ) ). This is formaized in Lemma 7. Lemma 7: Given a string s and a ED threshod τ, the number of key-vaue pairs in the merge-based method is O ( min( s, τ 2 ) ). Proof Sketch: Based on the simiar idea in Lemma 6, we can prove that the number of key-vaue pairs generated by s is bounded by O( s ). Next we prove the number of keyvaue pairs is aso bounded by O(τ 2 ). For any string s and threshod τ, is monotonicay increasing, thus the range of i is [ Lo Lu, ]. As Lu Lo + 1 2τ+1 τ , the size of the range of i is at most 4. We aso find that the size of the range of x i is at most 2τ + 4 for a L o L u and a fixed i [1, ]. Thus the tota number of possibe x i is O(τ 2 ). Thus the number of key-vaue pairs is aso bounded by O(τ 2 ). B. Merge Agorithm For sting r, it is easy to generate its key-vaue pairs r[p i, i ], i, r, r for i [1, ]. For string s, a straightforward method enumerates a signatures for s as discussed in Section III-C and then merges them with the same s[x i, i ]. However, the time compexity of this method is O(f δ s 3 ). Here we study how to efficienty merge the origina key-vaue pairs to obtain the new key-vaue pairs for string s. As proved in Lemma 6, there are ony four different vaues for i. As x i is a position in string s, x i [1, s ]. Then we can devise an efficient agorithm to directy generate the new key-vaue pairs as foows. For x i [1, s ] and for each i, we generate a key-vaue pair s[x i, i ], s, x i, s. Obviousy the time compexity is O( s ). 3 Compexity Improvement. The merge-based method can reduce the number of key-vaue pairs for s from O( s 3 ) to O( s ). The tota number of key-vaue pairs for a strings in S is from S max = S min f δ 3 n to S max n = S. The time/space min compexity to generate the key-vaue pairs for string s is aso from O( s 3 ) to O( s ). C. Changes on MASSJOIN Agorithm This section discusses the changes we need to make for incorporating the merge-based method into our MASSJOIN framework. The pseudo-code shown in Agorithm 2 is the same as Agorithm 1 except for the fiter stage. To generate the new key-vaue pairs, we repace Line 4 and Lines 6-9 in Agorithm 1 with Line 2 and Lines 3-5 in Agorithm 2 respectivey. For each string s S, we ony need to scan the string once to generate a key-vaue pairs (Lines 3-5). In the reduce phrase, we sti spit the input vaue ists into two ists and output those pairs satisfying L o L u and i x i i (Lines 7-12). Exampe 5: Consider the strings r 3 and s 3 in Tabe I. Using the merge-based agorithm, we generate t- wo key-vaue pairs for r 3, i.e., r3 [1, 2], 1, 4, r 3 and r3 [3, 2], 2, 4, r 3, and six key-vaue pairs for s 3, i.e., s 3 [1, 2], 5, 1, s 3, s 3 [2, 2], 5, 2, s 3, s 3 [3, 2], 5, 3, s 3, s3 [4, 2], 5, 4, s 3, s3 [1, 3], 5, 1, s 3, and s3 [3, 3], 5, 3, s 3. Note that for the basic method, we generate eight key-vaue pairs for s 3, thus the merge-based agorithm reduces the transmission cost of two key-vaue pairs. When using the key-vaue pairs to generate the candidate pairs, athough r 3 [1, 2] = s 3 [4, 2], we do not output r 3, s 3 since 4 [ 4 1 = 1, 4 1 = 2]. VI. USING LIGHT-WEIGHT FILTER UNIT TO REDUCE CANDIDATE PAIRS As the transmission and computation cost in the verification stage heaviy reies on the number of candidate pairs, in this section, we aim to decrease the number of candidate pairs. One simpe method is to attach the origina strings to the 3 Athough this method may generate more key-vaue pairs than the straightforward method, the compexity is sti O( s ). In the reduce step, we can remove such pairs based on the checking method in Section V-B. 8

9 Agorithm 2: Merge-based Agorithm // the fitering stage 1 Map( rid, r / sid, s ) // Repace Line 4 in Ago 1 with 2 emit( [p i, i ], i, r,rid ) ; // Repace Lines 6 9 in Ago 1 with 3 for 1 x i s do 4 for Lo i Lu do 5 emit( s[x i, i], s, x i, sid ); 6 Reduce ( sig, ist( i,, rid / s, x i, sid ) ) 7 spit ist( i,, rid / s, x i, sid ) into two groups ist( i,, rid ) and ist( s, x i, sid ); 8 foreach s, x i, sid ist( s, x i, sid ) do 9 foreach i,, rid ist( i,, rid ) do 1 if L o L u & i x i then 11 ist(rid) rid; 12 output( sid, ist(rid) ); vaue fied of each key, vaue pair in the map phase of fiter stage. In the reduce phase, for each candidate pair, we cacuate their rea simiarity and remove those pairs whose simiarity is smaer than the threshod δ. Athough this method can reduce the transmission cost of candidate pairs, it increases the transmitting cost of origina strings dramaticay. To aeviate this probem, we incorporate an ight-weight fiter unit to repace the origina strings. On one hand, fiter units can be utiized to prune many dissimiar pairs. On the other hand, fiter units shoud be ight-weight in order not to significanty increase the transmission cost. To this end, we propose an effective fiter unit in Section VI-A, and then discuss how to integrate this technique into our MASSJOIN framework in Section VI-B. A. Light-weight Fiter Unit Basic Idea. We first consider the set-based simiarity functions. Given a string, we can use an integer to repace each token in the string and take the set of integers as the fiter unit of the string. Obviousy two strings are simiar ony if their corresponding integer sets are simiar. However when the number of tokens is arge, this method sti invoves arge transmission cost. To reduce the size, we can group the integers into n buckets and keep a ist of n integers as a fiter unit. Formay, we first partition a tokens in the two string sets into n groups, G 1, G 2,, G n. Then for each string r(s), its fiter unit is g1, r g2, r, gn ( g r 1, s g2, s, gn ), s where gi r(gs i ) is the number of tokens in r(s) that are in the i-th group, i.e., gi r = G i r (gi s = G i s ). Two strings r and s are simiar ony if 1 i n gr i gs i U as stated in Lemma 8, where U is an upper bound of r s + s r (Section III-A). Lemma 8: Given two strings r and s with fiter unit g r 1, g r 2,, g r n and g s 1, g s 2,, g s n, if 1 i n gr i gs i > U, r and s cannot be simiar. Proof Sketch: We divide r and s into n disjoint groups Agorithm 3: Light-weight Fitering // the token count stage 1 Map( id, string ) 2 For each token in the string, emit( token, 1 ); 3 Reduce ( token, ist(1) ) 4 Output token frequency tf; // the fitering stage 5 MapSetup 6 Partition tokens into n groups ; 7 Map( rid, r / sid, s ) // Add into Ago 1 8 Compute fiter unit g r /g s for strings r/s; // Repace Line 4 in Ago 1 with 9 emit( r[p i, i], i, = r, rid, g r ); // Repace Line 9 in Ago 1 with 1 emit( s[x i, i], i,, sid, g s ); 11 Reduce ( sig, ist( id, g ) ) 12 ist( id, g ) ist( sid, g s ) and ist( rid, g r ); 13 foreach sid, g s ist( sid, g s ) do 14 foreach rid, g r ist( rid, g r ) do 15 if 1 i n gr i gi s U then 16 ist(rid) rid; 17 output( sid, ist(rid) ); using a universa grouping method. The tota number of different tokens between r and s is the sum of the number of different tokens in the n groups. For the i-th group, the number of different tokens in it is G i r G i s + G i s G i r = G i r + G i s 2 (G i r) (G i s) gi r +gs i 2 max(gr i, gs i ) = gi r gs i. Thus the tota number of different tokens between r and s cannot be smaer than 1 i n gr i gs i. Meanwhie, the number of different tokens in two simiar strings cannot exceed U(see Section III-A). Thus if 1 i n gr i gs i > U, r and s are not simiar. To utiize the fiter unit, we add the fiter unit into the vaue part in the key-vaue pair of the map function in the fiter stage. As discussed in Section III-A, the upper bound U ony depends on r, s and δ. Obviousy r = 1 i n gr i and s = 1 i n gs i, thus we can compute U in the reduce step. Thus we can utiize the fiter condition to prune dissimiar pairs. Notice that the fiter unit based method is orthogona to the merge-based method and they can be used together. This method can be appied to the character-based simiarity functions by taking each character as a token and U = 2τ. Finding the Optima Grouping Strategy. We find that different grouping methods may have different pruning power. Next we study how to evauate a grouping strategy and how to seect the optima grouping so as to generate high-quaity fiter unit. Intuitivey, for two strings r and s, the arger 1 i n gr i gi s, the two strings have higher probabiity to be pruned. Thus we want to maximize the vaue. Simiary, for a strings in R and S, we want to find an optima grouping strategy to 9

10 maximize r R s S 1 i n g r i g s i. (1) We can prove that the optima grouping probem is NP-hard by a reduction from the 3-SAT probem. Theorem 1: The optima grouping probem is NP-hard. Proof Sketch: We can prove the probem by a reduction from the 3-SAT probem. We propose a heuristic approach to sove this probem. The approach is based on two observations. First, it is not good to put two tokens with arge token frequencies into the same group, where the token frequency is the number of strings containing the token. This is because if we put them into the same group, many string pairs wi fasey consider them as the same token and the vaue gi r gs i for such string pairs wi not increase; on the contrary, if they are assigned into two different groups, the vaue wi be significanty increased. Second, the sum of token frequencies is a constant. To make the overa vaue as arge as possibe, we want to make the sum of token frequency in each group neary equa. Based on these two observations, we can devise a greedy agorithm as foows. We first sort a the tokens by their frequencies in decreasing order. Then we access each token in order and add it into the group with the minimum sum of token frequencies. We repeat this step unti a the tokens have been accessed. B. Changes on the MASSJOIN Agorithm To incorporate fiter units into our method, we need make some changes on the MASSJOIN agorithm. The method can be utiized to both the basic method and the merge-based method. Here we take the basic method as an exampe. The pseudo-code is shown in Agorithm 3. We need to add another MapReduce phase to count token frequencies (Lines 1-4). We aso modify the fiter stage (Lines 5-17) as foows. We add a setup phase to read the token frequency and an approximation agorithm to divide a the tokens to n groups (Line 6). In the map phase, we oad the token frequency, identify the tokens for each string, partition them into different groups based on token frequencies, and generate fiter unit (Line 8). Then we emit the fiter unit aong with the key, vaue pairs (Lines 9-1). In the reduce step, we ony output those pairs passing the grouping fiter (Lines 12-17). Exampe 6: Consider the two datasets in Tabe I. We use JAC and the threshod is δ =.8. For simpicity, we use the ids of strings as shown in Figure 1. In the token count stage we get the token frequencies, w 2, 5, w 3, 5, w 4, 5, w 5, 4, w 6, 4, w 1, 3 and w 7, 2. Suppose the group number is 4. We can divide the tokens into 4 groups G 1 = {w 2, w 7 }, G 2 = {w 3, w 1 }, G 3 = {w 4 }, and G 4 = {w 5, w 6 }. The fiter unit for r 1 is 1, 2, 1, 1 and that for s 1 is 2, 2, 1,. We fiter pair s 1, r 1 in the reduce phase as 4 i=1 gr1 i g s1 i = 2 > U =. VII. EXPERIMENT We have impemented our MASSJOIN method and conducted experiments on four rea datasets: Enron emai 4, PubMed paper abstract, PubMed paper tite 5 and NCBI DNA sequence 6. The detais of the datasets are shown in Tabe VI. TABLE VI DATASETS Datasets Size(MB) Cardinaity Avg Size/Len Σ Enron Emai (Set) , PubMed Abstract (Set) ,347, PubMed Tite (String) ,394, DNA Sequence (String) ,299, For Enron emai and PubMed paper abstract datasets, we used set-based simiarity functions, where strings are tokenized by non-aphanumeric characters. For PubMed paper tite and DNA datasets, we used the character-based distance functions. In the foowing experiments, we spit each dataset into two datasets with equa size to conduct R-S join. Due to space constraints, we focus on JAC and ED in the experiments and use them as defaut functions. We wi show the resuts on COS and DICE in Section VII-D. We compared with state-of-the-art method PrefixFiter [14]. We got their source code from their home page (asterix.ics.uci.edu/fuzzyjoin). A agorithms were impemented on Hadoop and run on a 1-node De custer. Each node had two Inte(R) Xeon(R) E GHZ processors with 8 cores, 16GB RAM, and 1TB disk. Each node is instaed 64-bit Ubuntu Server 1.4, Java 1.6, and Hadoop We set the bock size of the distributed fie system to 16MB and aocate 2GB virtua memory to each task Fig. 2. Greedy Random Number of Groups (a) PubMed Abstract Evauating ight-weight fiter units. A. Evauating Our Proposed Techniques We first evauated our fiter unit based method by varying different group numbers from 1 to 15. We impemented two grouping methods: Random and Greedy. Random computed the hash code of each token and randomy assigned it into a group. Greedy used our greedy agorithm. Figure 2 shows the resuts. We can see that with the increase of groups, the performance of Greedy improved first and then depraved. This is because a sma group number wi reduce the transmission cost but with ower pruning power, and a arge group number wi improve the pruning power but with arge transmission cost. 4 enron/

MassJoin: A MapReduce-based Method for Scalable String Similarity Joins

MassJoin: A MapReduce-based Method for Scaabe String Simiarity Joins Dong Deng, Guoiang Li, Shuang Hao, Jiannan Wang, Jianhua Feng, Wen-Syan Li Department of Computer Science, Tsinghua University, Beijing,