MassJoin: A MapReduce-based Method for Scalable String Similarity Joins

Size: px
Start display at page:

Download "MassJoin: A MapReduce-based Method for Scalable String Similarity Joins"

Transcription

1 MassJoin: A MapReduce-based Method for Scaabe String Simiarity Joins Dong Deng, Guoiang Li, Shuang Hao, Jiannan Wang, Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China {dd11, hs13, wjn8}@mais.tsinghua.edu.cn, {iguoiang,fengjh}@tsinghua.edu.cn Abstract String simiarity join is an essentia operation in data integration. The era of big data cas for scaabe agorithms to support arge-scae string simiarity joins. In this paper, we study scaabe string simiarity joins using MapReduce. We propose a MapReduce-based framework, caed MASSJOIN, which supports both set-based simiarity functions and characterbased simiarity functions. We extend the existing partition-based signature scheme to support set-based simiarity functions. We utiize the signatures to generate key-vaue pairs. To reduce the transmission cost, we merge key-vaue pairs to significanty reduce the number of key-vaue pairs, from cubic to inear compexity, whie not sacrificing the pruning power. To improve the performance, we incorporate ight-weight fiter units into the key-vaue pairs which can be utiized to prune arge number of dissimiar pairs without significanty increasing the transmission cost. Experimenta resuts on rea-word datasets show that our method significanty outperformed state-of-the-art approaches. I. INTRODUCTION Data integration has received significant attention in the ast three decades, because it can combine heterogenous data from different sources and provide a unified view of these data. The string simiarity join, which, given two coections of strings, finds a simiar string pairs, is an essentia operation in data integration. The simiarity between two strings is usuay quantified by simiarity functions. There are two main types of simiarity functions (see Section II): set-based simiarity functions (e.g., Jaccard) and character-based simiarity functions (e.g., Edit distance). String simiarity joins have many reaword appications, e.g., entity resoution, copy detection, and document custering. Most of existing simiarity-join methods used in-memory agorithms. The era of big data poses new chaenges for arge-scae string simiarity joins and cas for new scaabe agorithms. MapReduce provides a programming mode for processing arge-scae data, and in this paper we study scaabe string simiarity joins using MapReduce. A naive method is to enumerate a string pairs from the two coections and use MapReduce to process these string pairs. However this method is rather expensive for arge string sets. To address this probem, Vernica et a. [14] proposed a prefix fitering based method, which used a fiter-and-verification framework. In the fiter step, they seected some tokens from each string and generated a set of candidate pairs who share a common token. In the verification step, they verified the candidate pairs to generate the fina answers. One big imitation of this method is ow pruning power. As a singe token is very short and usuay has ow seectivity, many dissimiar pairs wi share a same token and cannot be pruned. To address this imitation, we propose a MapReduce-based framework, caed MASSJOIN, for scaabe string simiarity joins, which supports both set-based simiarity functions and character-based simiarity functions. We aso adopt a fiterand-verification framework. In the fiter step, we generate the signatures for each string and prove that two strings are simiar ony if they share a common signature. We utiize this property to generate the candidate pairs. In the verification step, we verify the candidate pairs to generate the fina resuts. One big chaenge is to generate high-quaity signatures. PASSJOIN [11] proposed a high-quaity partitionbased signature scheme for the edit distance function. We extend the partition-based signature scheme to support setbased simiarity functions. It is worth noting that the extension is nontrivia (see Section III), because the partition number is fixed for edit distance whie the partition numbers for set-based simiarity functions are not. To use MapReduce, we take the signatures as keys and the strings as vaues to generate key-vaue pairs. Then we use the key-vaue pairs to compute the candidate pairs that share a same key (see Section IV). However this method may generate arge numbers of key-vaue pairs and eads to huge transmission cost. For exampe, considering the Jaccard function, this method generates O( 3 ) key-vaue pairs for a string with ength. To address this issue, we merge key-vaue pairs to significanty reduce the number of key-vaue pairs but without sacrificing the pruning power (see Section V). For exampe, we can reduce the number from O( 3 ) to O(). To improve the performance, we incorporate ight-wight fiter units into the key-vaue pairs which can be utiized to prune arge numbers of dissimiar pairs without significanty increasing the transmission cost (see Section VI). In summary, we make the foowing contributions. We extend the partition-based signature scheme to support set-based simiarity functions. We propose a scaabe MapReduce-based framework to support both set-based and character-based simiarity functions. We devise a merge-based agorithm to significanty reduce the number of key-vaue pairs without sacrificing the pruning power. We deveop a ight-weight fiter unit based method to prune arge numbers of dissimiar pairs whie not significanty increasing the transmission cost.

2 We have impemented our method and experimenta resuts on rea-word datasets show that our method significanty outperforms state-of-the-art approaches. The structure of the rest paper is as foows. We formuate our probem and review reated works in Section II. We extend existing partition-based signature scheme to support set-based simiarity functions in Section III. Section IV presents our MASSJOIN framework. We devise a merge-based method to reduce the number of key-vaue pairs in Section V and deveop a ight-weight fiter unit based method to reduce the number of candidate pairs in Section VI. We show the experimenta resuts in Section VII and give the concusion in Section VIII. A. Probem Definition II. PRELIMINARY The simiarity join probem finds a simiar string pairs from two given string coections, where the simiarity between two strings is usuay quantified by simiarity functions. Given a simiarity function SIM and a simiarity threshod δ, two strings r and s are simiar if and ony if SIM(r, s) δ. There are two main types of simiarity functions, set-based simiarity functions and character-based simiarity functions. Set-based simiarity functions first tokenize each string into a set of tokens, where a token can be either a word or a n-gram. In the paper we suppose each token is a tokenized word. For simpicity, strings and token sets are interchangeaby used if there is no ambiguous. There are three we-known set-based simiarity functions, Jaccard Simiarity, Dice Simiarity, and Cosine Simiarity, defined as beow. JAC(r, s)= r s r s, r s COS(r, s)=, r s DICE(r, s)= 2 r s r + s, where denotes the size of a set. For exampe, suppose r = Barack H Obama II and s = Barack Obama II. Then we have r = 4, s = 3, r s = 3, and r s = 4. Thus, JAC(r, s) = 3 3 4, COS(r, s) = 12 and DICE(r, s) = 6 7. Character-based simiarity functions are defined based on the number of character operations to transform one string to another. Edit Distance (ED) is a we-known character-based simiarity function. The ED of two strings is the minimum number of edit operations needed to transform one string to another, where the permitted edit operations incude insertion, deetion and substitution. For exampe, given r = michea and s = michae, we have ED(r, s) = 2. Next we formuate the simiarity-join probem. Definition 1 (Simiarity Joins): Given two string coections R and S, a simiarity function SIM (or a distance function DIS) and a simiarity threshod δ (or a distance threshod τ), a simiarity join finds a string pairs r R, s S such that SIM (r, s) δ (or DIS (r, s) τ). For exampe, consider the two string sets in Tabe I. Suppose the Jaccard threshod is δ =.8. The simiarity join finds a simiar pair r 2, s 2 with JAC(r 2, s 2 ) =.8. R S TABLE I TWO STRING COLLECTIONS: R AND S id string size r 1 conference on service information management 5 r 2 poicy on service information management 5 r 3 the poicy on service 4 B. Reated Work s 1 the conference on information management 5 s 2 service information management poicy 4 s 3 conference on information management poicy 5 Partition-based Method: PASSJOIN [11], [1] is the state-ofthe-art simiarity-join agorithm for edit distance. It won the champion in the recent simiarity-join competition organized by EDBT PASSJOIN empoys a fiter-and-verification framework. In the fiter step, it generates signatures for strings. If two strings are simiar, they must share a common signature. It prunes arge numbers of dissimiar pairs based on this idea and the survived pairs are caed candidate pairs. In the verification step, it verifies the candidate pairs to generate the fina answers. One big chaenge in PASSJOIN is to generate high-quaity signatures for two strings. Suppose the edit-distance threshod is τ. Given two strings r and s, it partitions r into τ + 1 disjoint segments, and proves that if s is simiar to r, s must contain a substring that matches one of r s segments, based on the pigeon-hoe theory. Thus for string r, it generates τ +1 signatures. For string s, it seects some substrings of s as its signatures. However PASSJOIN cannot support set-based simiarity functions, because the number of segments (e.g., τ + 1) is not fixed for these functions. We extend PASSJOIN to support set-based simiarity functions and propose a scaabe framework MASSJOIN for arge-scae string simiarity join using MapReduce. Moreover, to reduce the transmission cost, we merge the seected substrings and decrease the number of signatures from O( 3 ) to O() for set-based simiarity functions, where is the string ength, and from O(τ 3 ) to O(min(τ 2, )) for character-based simiarity functions. Simiarity Join Using Map-Reduce: Vernica et a. [14] proposed a simiarity join method using MapReduce which utiized the prefix fitering to support set-based simiarity functions. They seected a subset of tokens as signatures and proved that two strings are simiar ony if their signatures share common tokens. For each string, they used each token in its prefix as a key and the string as a vaue to generate the key-vaue pairs. As a singe token is very short and usuay has ow seectivity, these methods have two disadvantages. First, many dissimiar pairs may share the same token, and they wi generate many fase positives and thus ead to poor pruning power. Second, a singe token may ead to a skewed probem and there may be arge numbers of strings sharing the same key. Thus the reducer with such keys wi have a heavy workoad. Metway et a. [12] proposed a 2-stage agorithm V- SMART-Join for simiarity joins on sets, mutisets and vectors. They aso used a singe token as a key and had the same 1 wandet/searchjoincompetition213 2

3 issues as [14]. Afrati et a. [1] proposed mutipe agorithms to perform simiarity joins in a singe MapReduce stage. They anayzed the map, reduce and communication cost. However for ong strings, it is rather expensive to transfer the strings using a singe MapReduce stage. Kim et a. [9] addressed the top-k simiarity join probem using MapReduce, which is different from our probem. In-Memory Simiarity Joins: There are many studies on in-memory simiarity join agorithms [19], [4], [15], [18], [2], [8], [7], [6], [17], [16], [2], [21], [3], [13], [1], [2]. Bayardo et a. [3] first proposed the prefix-fiter framework for simiarity joins. Xiao et a. [21] improved the prefix-fiter framework by introducing position fitering. ED-Join [19] extended prefix fitering to support edit distance and used ocation-based mismatch techniques to shorten prefixes and content-based mismatch to fiter dissimiar pairs. Wang et a. [17] proposed an adaptive prefix-fiter framework to dynamicay seect prefixes with different engths. Wang et a. [15] proposed Trie-Join which utiizes a trie structure to recursivey perform prefix fiter. Arasu et a. [2] proposed PartEnum which partitions strings into pieces and enumerates deetion neighborhoods on the pieces as signature to do simiarity joins. Wang et a. [18] proposed VChunkJoin to use qchunks with different engths as signatures and empoy the prefix-fiter framework to do simiarity joins. Wang et a. [16] devised a new kind of simiarity functions and proposed efficient agorithm on the new simiarity functions. Qin et a. [13] proposed an asymmetric signature for both simiarity join and search probems. Chaudhuri et a. [4] proposed a primitive operator for simiarity joins. Gravano et a. [7] proposed to perform simiarity joins inside RDBMS. Xiao et a. [2] studied top-k simiarity joins using adaptive prefix fitering. Jacox et a. [8] studied simiarity joins under metric space. Different from these studies, in this paper we focuses on how to support arge-scae simiarity joins using MapReduce. MapReduce: MapReduce [5] is a famous framework proposed by Googe to faciitate processing arge-scae data in parae. The MapReduce program runs on a arge custer with mutipe nodes. The input data fies are divided into sma fie spits and distributed in a distributed fie system (DFS). MapReduce incudes a map phase, a shuffe phase and a reduce phase to process the data. Each map node reads the fie spits on the node, processes the input key, vaue pairs, and emits intermediate key, vaue pairs. The intermediate key, vaue pairs are shuffed based on the keys and transferred to the reduce nodes. A the intermediate key, vaue pairs with same key must be shuffed to the same reduce node. Each reduce node receives a key-vaue pair key, ist(vaue), where ist(vaue) is a ist of vaues sharing the same key, processes the pair, and writes its output to DFS. III. SIGNATURES FOR SET-BASED SIMILARITY FUNCTIONS To avoid enumerating a string pairs from the two given string sets, we adopt a fiter-and-verification framework. We extend the partition-based signature scheme designed for edit distance [11] to support set-based simiarity functions in this section. Notice that the extension is non-trivia, because PASSJOIN reies on a given edit-distance threshod and generates a fixed number of signatures. However for set-based simiarity functions, the number depends on the string engths. We need to deduce the number of signatures and devise new agorithms to generate signatures. To address this probem, we first discuss how to generate signatures for two strings in Section III-A and then extend the method to support two string sets in Section III-B. Finay, we give the signature compexity in Section III-C. A. Signatures for Two Strings r and s Given two token sets r and s, and a jaccard threshod δ, we sort each token set based on a universa order, e.g., aphabetica order, and obtain two sorted token ists. For simpicity, r and s are referred to their corresponding sorted token ists if there is no ambiguous. If r and s are simiar w.r.t Jaccard threshod r s r + s r s δ, i.e., JAC(r, s) = r s r s = δ, we can deduce r s δ ( r + s ). The tota number of different tokens between r and s is r s + s r. Since r s and s r are two integers, r s r r s r δ s and s r s r s s δ r. Let U = r δ s + s δ r which shoud be an upper bound of r s + s r. We spit r into U + 1 disjoint segments. Based on the pigeon-hoe theory, if s is simiar to r, s must contain a substring which matches a segment of r. We use the segments of r as r s signatures and seect some substrings of s as s s signatures. Then if r and s are simiar, they must share a common signature. Next we formay discuss how to generate the signatures for r and s. Signatures for r: There are mutipe ways to partition r into U + 1 segments. Here we use an even partition scheme as an exampe, where a U + 1 segments have neary the same ength. Formay, given a string r with ength = r, et k = U+1 (U +1). The ast k segments of string r have a ength of U+1 and the first U + 1 k segments have a ength of U+1. Let (U + 1, i, ), abbreviated as i, denote the ength of the i-th segment of r. Obviousy i = U+1 if i U +1 k; otherwise U+1. Let p(u +1, i, ), abbreviated as p i, denote the start position of the i-th segment, i.e., p 1 = 1 and p i = j<i j=1 j +1. The i-th segment of r is the substring of r starting from p i and with ength i, denoted by r[p i, i ]. Thus the signatures of r are tripes r[p i, i ], i, for 1 i U +1. Signatures for s: If s is simiar to r, it requires to have a signature matching one of signatures of r, e.g., r[p i, i ], i,. Thus the signature of s shoud be in the form of s[x i, i ], i, such that r[p i, i ], i, = s[x i, i ], i,. Intuitivey, the start position of the i-th substring of s, e.g., x i, can be any integer in [1, s i + 1]. However this method wi generate arge numbers of signatures. To address this issue, we propose two signature generation methods, a position-aware method and a muti-match-aware method. Position-aware method. Consider the i-th signature of r with start position p i, and a signature of s that matches the i-th 3

4 signature of r with start position x i. If x i < p i, we can deduce that p i x i r s, because r and s are sorted by a universa order, and if r[p i, i ] = s[x i, i ], among the first p i 1 tokens of r and the first x i 1 tokens of s, there are at east p i x i tokens that appear in r and do not appear in s. Since there are totay r s such tokens, p i x r δ s i r s. Let U r s = r δ s, we have p i x i U r s and x i p i U r s. Simiary, if x i p i, we have x i p s δ r i s r. Let U s r = s δ r, we have x i p i + U s r. Thus the position-aware method sets x i [p i U r s, p i +U s r] which wi not invove fase negatives as stated in Lemma 1. Lemma 1: The position-aware signature generation method wi not invove fase negatives. 2 Proof Sketch: We prove it by contradiction. Suppose r and s are simiar and s has a substring with start position x i matching the i-th segment of r and x i < p i U r s. Obviousy r s p i x i > U r s which contradicts with that U r s r s. Simiary we can prove x i p i + U s r. Muti-match-aware method. Sti consider the i-th signature of r with start position p i, and a signature of s that matches the i-th signature of r with start position x i. We have x i p i (i 1), because if x i < p i (i 1), r[1, p i 1] s[1, x i 1] > i 1 and there are at east i different tokens in r and s before the i-th segment based on the ength difference. In other words, if they are simiar, after the i-th segment, there are at most U + 1 i 1 mismatch tokens. Since there are U + 1 i segments, there must be a matching signature after the i-th segment based on the pigeon-hoe theory. Obviousy we can skip this one and use the atter one. Simiary, if we consider the reversed strings of r and s, we have s x i r p i (U + 1 i). We can deduce that x i p i + s r + (U + 1 i). Thus the muti-match-aware method sets x i [p i (i 1), p i + s + (U + 1 i)] where = r, and this method wi not invove any fase negative as stated in Lemma 3. Lemma 2: The muti-match-aware signature generation method wi not invove fase negatives. Proof Sketch: Consider any string r with ength. The start position of its i-th segment is p i. If the first i 1 segments of r contain more than i 1 different tokens from s, regardess of the i-th segment, they must share another common segment in the ast U + 1 i segments and we can drop the i-th segment. Meanwhie, if r and s are simiar and s contains a substring with start position x i matching the i- th segment of r, the number of different tokens in the first i 1 segments of r cannot be smaer p i x i. Thus we have p i x i i 1, i.e., x i p i (i 1). Symmetricay, we can get x i p i + s + (U + 1 i). Interestingy, we can use the two methods simutaneousy and propose a hybrid method which sets x i [max(p i (i 1), p i U r s), min(p i + s + (U + 1 i), p i + U s r)]. 2 For space constraints, we omit detaied proofs of emmas in this paper This hybrid method wi not invove fase negatives as stated in Lemma 3, because the two methods are independent. Lemma 3: The hybrid signature generation method wi not invove fase negatives. Proof Sketch: Based on Lemma 1, we have for any two strings r and s, if s has a substring with start position x i matching the i-th segment of r and x i [p i U r s, p i +U s r], r s > U r s or s r > U s r. That is to say r and s cannot be simiar. Thus for any two simiar strings, a there matching signatures must be within the range generated by the position-aware method. Simiary this caim is aso true for the muti-match-aware method. Thus the hybrid signature generation method wi not invove fase negatives. Thus the signatures of s with respect to r are s[x i, i ], i, for 1 i U + 1 and i x i i, where i = max(p i (i 1), p i U r s ), i = min(p i + s + (U + 1 i), p i + U s r ). Exampe 1: Consider two strings r 1 and s 1 in Tabe I. Suppose the JAC threshod is δ =.8. To generate the signatures for r 1, we compute = r = 5, s = 5, U =, p 1 = 1 and 1 = 5, and obtain the signature r 1 [1, 5], 1, 5 for r 1. To generate the signatures for s 1, we compute U r s =, U s r =, 1 = 1, and 1 = 1, and obtain the signature s 1 [1, 5], 1, 5 for s 1. B. Signatures for Two String Sets When we join two datasets R and S, many strings in R may be simiar to many strings in S, and we need to generaize our method to support two string sets. Based on the ength pruning, if a string s is simiar to r, r s i.e., r s δ, we can deduce s r s r s δ r δ. Simiary we can deduce s r s δ r s δ r. Thus the upper bound of engths of the possibe simiar strings to r is r /δ and the ower bound is δ r. Obviousy for string r we need to consider the strings with engths in the range and generate vaid signatures of r for any possibe simiar string s with ength in [ δ r, r /δ ]. To this end, we derive an upper bound of the number of different tokens to r for a possibe simiar strings that are simiar to r, denoted by U. Obviousy U = max{ r s + s r s is simiar to r}. As r s + s r r + s 2 r s 1 δ ( r + s ) 1 δ r 1 δ ( r + δ ) = δ r, we can deduce a bound U = 1 δ δ r. Thus if a string s is simiar to r, we have r s + s r U. Notice that we can deduce a much tighter bound of U for Jaccard as stated in Lemma 4. Lemma 4: We can deduce a tighter upper bound U = r δ r /δ + r /δ δ r. Based on the upper bound U, we can devise a partition scheme for two string sets as beow. Signatures for r R: We partition r into = U + 1 segments as discussed in Section III-A. The signatures of r are 4

5 TABLE II BOUNDS FOR A SIMILAR STRING PAIR r, s WITHIN THRESHOLD δ JAC COS DICE ED U r s r δ s r δ r s r δ( r + s ) - 2 U s r s δ r s δ r s s δ( r + s ) - 2 U U r s + U s r - U 1 δ 1 δ2 r δ δ 2 r 2(1 δ) r δ τ 1 δ 1 δ2 r + 1 δ δ 2 r + 1 2(1 δ) r + 1 δ τ+1 L o δ s δ 2 s δ 2 δ s s τ L u s s 2 δ s s +τ g δ δ 1 δ δ δ 2 1 δ 2 δ 2 ( f 3 )(1 δ) 3 ( 6 )(1 δ 2 ) 3 δ δ 3 δ 6 δ 2(1 δ) δ - 16(1 δ) 3 (3δ 2 6δ+4) (2 δ) 3 δ 3 - tripes r[p i, i ], i, for 1 i. Simiary, we can deduce the bounds and signatures for other functions as shown in Tabes II-IV. Signatures for s S: s may be simiar to many strings in R. Based on the ength pruning, the upper bound of engths of the possibe simiar strings to s is L u = s /δ ] and the ower bound is L o = δ s. Thus for s, we need to consider strings with ength [L o, L u ] and generate signatures s[x i, i ] for each L o L u and 1 i. In concusion, the signatures generated are shown in Tabe IV. Note that to make the formua consistent for different simiarity functions, we set U r s = s + ( i) and U s r = i 1 for ED. Exampe 2: Consider the two string coections R and S in Tabe I. Suppose the Jaccard Simiarity threshod is δ =.8. For r 1 R, we can deduce = 5, U = 1, = 2, 1 = 2, 2 = 3, p 1 = 1 and p 2 = 3. Thus, we need to generate two signatures r 1 [1, 2], 1, 5 and r 1 [3, 3], 2, 5 for r 1. For string s 1 S, we can deduce L o = 4 and L u = 6, i.e., 4 6. For = 4, we can deduce that = 2, 1 = 2, 2 = 2, U r s =, U s r = 1, 1 = 1, 1 = 2, 2 = 3 and 2 = 4, thus it generates four signatures s 1 [1, 2], 1, 4, s 1 [2, 2], 1, 4, s 1 [3, 2], 2, 4 and s 1 [4, 2], 2, 4. Simiary, for = 5, we generate two signatures s 1 [1, 2], 1, 5 and s 1 [3, 3], 2, 5. For = 6, we generate two signatures s 1 [1, 3], 1, 6 and s 1 [3, 3], 2, 6. In tota, we need to generate eight signatures for s 1. C. Signature Compexity For string r S, as we generate one signature for 1 i, it totay has signatures. For any string s, the tota number of its generated signatures is O((L u s ) 3 + ( s L o ) 3 ) as shown in Lemma 5. Lemma 5: The tota number of signatures for any string s is O((L u s ) 3 + ( s L o ) 3 ). For exampe, suppose we use JAC simiarity and the simiarity threshod is δ, we have the tota number of generated signatures for string s is O( (3 )(1 δ) 3 δ 3 s 3 ). IV. THE MASSJOIN FRAMEWORK In this section we present a scaabe MapReduce-based string simiarity join agorithm, caed MASSJOIN, which can support both set-based simiarity functions and character-based i p i sig sig TABLE III SIGNATURES GENERATED FOR r R JAC COS DICE ED r for i ( ) for i > ( ) p 1 = 1 and p i = j<i j=1 j + 1 r[p i, i ], i, for 1 i TABLE IV SIGNATURES GENERATED FOR s S (FOR ED, U r s = s + ( i) AND U s r = i 1). JAC COS DICE ED L o L u i for i ( ) for i > ( ) i max(p i (i 1), p i U r s) i min(p i + s + ( i), p i + U s r) sig s[x i, i ], i, for i x i i, 1 i, Lo Lu sig f δ s 3 τ 3 simiarity functions. For simpicity, we use Jaccard as an exampe in this section. MapReduce contains two main stages: the fiter stage and the verification stage. We wi introduce the two stages respectivey in Section IV-A and Section IV-B. Then we give the compexity in Section IV-C. Finay we discuss how to support sef joins in Section IV-D. A. Fiter Stage In this fiter phase, we generate candidate pairs using the fiter techniques in Section III: if two strings r and s are simiar, r and s must share a signature. In the map phase, we use the signatures as keys and the strings as vaues. Thus for r R, the key-vaue pairs are r[p i, i ], i,, r for 1 i. For s S, the key-vaue pairs are s[x i, i ], i,, s for i x i i, 1 i and L o L u. As two simiar strings must share a same key, they must be shuffed to the same reduce task. To reduce the transmission cost, in the key-vaue pairs, we use string ids to repace strings (an id is much smaer than its corresponding string.). The pseudocode of our MASSJOIN agorithm is shown in Agorithm 1. It first generates key-vaue pairs for strings in R (ines 2-4) and then for strings in S (ines 5-9). In the reduce phase, each reduce node takes a key-vaue pair sig, ist(sid/rid) as input, where sig is a signature and ist(sid/rid) is the ist of strings that contain the signature. It first spits the ist into two groups: ist(sid) and ist(rid) (ine 11). Notice that for any pair sid, rid from the two ists, it is a candidate pair. To reduce the transmission cost, for each sid ist(sid), we generate a key-vaue pair sid, ist(rid) (ines 12-13). The map and reduce functions are as foows. map: sid/rid, string signature, sid/rid reduce: signature, ist(sid/rid) sid, ist(rid) Comparing with existing works [14], [12] that used a singe token as a key, our method utiizes signatures with mutipe tokes as keys and thus has higher pruning power. Moreover, the number of pairs that share mutipe tokens is smaer than that of a singe token, thus our method can address the skewed token distribution probem as discussed in Section II-B. 5

6 Agorithm 1: MASSJOIN Agorithm // the fiter stage 1 Map( rid, r / sid, s ) 2 for r R do 3 for 1 i do 4 emit( r[p i, i], i, = r, rid ); 5 for s S do 6 for L o L u do 7 for 1 i do 8 for i x i i do 9 emit( s[x i, i], i,, sid ); 1 Reduce ( sig, ist(id) ) 11 spit ist(id) into two groups ist(sid) and ist(rid) ; 12 foreach sid ist(sid) do 13 output( sid, ist(rid) ); // first phase of verification stage 14 Map( sid, ist(rid) / sid, s ) 15 emit( sid, ist(rid) ); emit( sid, s ) ; 16 Reduce ( sid, ist(ist(rid)/s) ) 17 Identify s and ist(ist(rid)) ; 18 Compute distinct ist(rid) from ist(ist(rid)) ; 19 output s, ist(rid) ; // second phase of verification stage 2 Map( s, ist(rid) / rid, r ) 21 emit( rid, r ) ; 22 for each rid in ist(rid) do 23 emit( rid, s ) ; 24 Reduce ( rid, ist(s/r) ) 25 Identify r and ist(s) from ist(s/r); 26 foreach s ist(s) do 27 if SIM(r, s) δ then output( r, s, SIM(r, s) ); Exampe 3: Consider the two string coections in Tabe I. Suppose we use JAC function and the threshod is δ =.8. Figure 1 shows the running exampe. The map functions in the fiter stage generate 2 key-vaue pairs for each r 1, r 2 and r 3 in R and 8, 4, and 8 key-vaue pairs for s 1, s 2 and s 3 in S as shown in the figure. The shuffe phase groups the intermediate key, vaue pairs and sends them to the reduce tasks. In the reduce phase, we get 18 key-vaue pairs. However ony two of them, [ conference information, 1, 5], {r 1, s 1, s 3 } and [ information management, 1, 5], {r 2, s 2 } can generate outputs, which are s 1, {r 1 }, s 2, {r 2 }, and s 3, {r 1 }. B. Verification Stage In the verification stage, we verify the candidate pairs generated from the fiter stage. As two strings may share mutipe signatures, there may be many dupicate candidate pairs, and we want to remove the dupicate ones. In addition, we need to repace the id in candidate pairs with its rea string to verify the candidate pairs. To achieve these two goas, we propose a two-phase method. In the first phase, the map function takes dataset S and sid, ist(rid) as input and emits two types of key-vaue pairs (ine 15). The first one is sid, s from the dataset S which is used to repace sid with string s. The second one is sid, rid which is generated from input sid, ist(rid) gotten from the fiter stage. The reduce function gathers the ist ist(rid) and the string s for the key sid. It first identifies the string s, and then removes the dupicates from ist(rid) to generate a ist of distinct rids, and finay emits key-vaue pair s, ist(rid) (ines 17-19). The map and reduce functions are as foows. map: sid, ist(rid) sid, ist(rid) ; sid, s sid, s reduce: sid, ist(ist(rid)/s) s, ist(rid) In the second phase of verification, we need to repace the rid with string r and verify the candidate pairs. The map function takes dataset R and s, ist(rid) as input and emits two types of key-vaue pairs: rid, r and rid, s. The first one is generated from the dataset R (ine 21). The second one is generated from s, ist(rid) that is the output of the first phase. For each key s, it generates a key-vaue pair rid, s for each rid in ist(rid) (ine 23). In the reduce phase, we have a string r and a ist of string s such that r, s is a candidate pair. We first identify the string r and ist(s), then verify the candidate pairs, and finay output the fina resuts (ines 25-27). The map and reduce functions are as foows. map: s, ist(rid) rid, s, rid, r rid, r ; reduce: rid, ist(r/s) (r, s), SIM(r, s) Exampe 4: Reca the running exampe in the ast subsection and here we discuss the verification stage. In the fiter step, we get three candidate pairs. In the first phase, we remove dupicates from the three pairs, repace sid with its string and output the key-vaue pairs, e.g., service information management poicy, {r 2 }. In the second phase, we repace rid with its string and verify the candidate pairs. Finay, we output r 2, s 2 as a resut. The detais are shown in Figure 1. C. Compexity In this section, we anayze the compexity of our approach in each stage, incuding the space/time compexity and IO cost of the map and reduce phase, and the transmission compexity of the shuffe phase. For each string with ength in R, we need to generate O(U) = O(g δ ) signatures, where g δ is the factor of in the formua of U as iustrated in Tabe II. For each string with ength in S, we generate O((L u ) 3 + ( L o ) 3 ) = O(f δ 3 ) signatures where f δ is the factor of 3 in (L u ) 3 + ( L o ) 3 as iustrated in Tabe II. Let m (n ) denote the number of the strings with ength in R (S). Based on the discussion in Section III, the number of generated signatures for strings in R (S) is O( R max g = R δ m ) (O( S max f = min S δ 3 n )), where min R min (S min ) and S max ( S max) are the minimum and maximum string engths in R (S) respectivey. For ease of presentation, et N = R max g = R δ m and M = S max f = min S δ 3 n. min Fiter Stage: The map function oads the dataset from DFS and the IO cost is O(R + S). The space/time compexity 6

7 dm dm dm dm dm em em UZZZZZ VZZZZZ UZZZZZ VZZZZ U ZZZZ V ZZZZZ & D & D & D >ZZ@U!>ZZZ@U! >ZZ@V!>ZZ@V! >ZZ@V!>ZZ@V! >ZZ@V!>ZZZ@V! >ZZZ@V!>ZZZ@V! >ZZ@U!>ZZZ@U! >ZZ@V!>ZZ@V! >ZZ@V!>ZZZ@V! >ZZ@U!>ZZ@U! >ZZ@V!>ZZ@V! >ZZ@V!>ZZ@V! >ZZ@V!>ZZZ@V! >ZZZ@V!>ZZZ@V! & ^ ' < >ZZ@^UVV`! >ZZZ@^U`! >ZZ@^VV`! >ZZ@^VVV`! >ZZ@^VV`! >ZZ@^V`! >ZZZ@^V`! >ZZZ@^VV`! >ZZZ@^V`! >ZZ@^UV`! >ZZZ@^U`! >ZZ@^V`! >ZZZ@^V`! >ZZ@^U`! >ZZ@^U`! >ZZ@^V`! >ZZZ@^V`! >ZZZ@^V`! & Z & Z V^U`! V^U`! V^U`! U>VZZZZZ@! U>VZZZZZ@! U>VZZZZ@! 'DWDVHW6 VZZZZZ VZZZZ VZZZZZ U U U 'DWDVHW5 ZZZZZ ZZZZZ ZZZZ U^ZZZZZ>VZZZZZ@>VZZZZZ@`! U^ZZZZZ>VZZZZ@`! dz dz >UZZZZZ@>VZZZZ@! dz Fig. 1. Running exampe of generating one signature is O(1), thus the space/time compexity of the first map phase is O(N + M). As we emit one intermediate key, vaue pair for each signature and the space of each signature is O(1), the transmission compexity is aso O(N + M). The reduce step needs to scan the ists of rids and sids in the vaue fieds of the key-vaue pairs, thus the time/space compexity of the reduce phase is O(N M). The reduce step writes the candidate pairs to disk, thus the IO cost is aso O(N M). Verification Stage: The map function of the fiter phase needs to read the dataset S and the candidate pairs, thus the IO cost is O(N M + S). Emitting a key, vaue takes O(1) time, and the space/time compexity is O(N M + S ). The shuffe transmission compexity is aso O(N M + S). The reduce function scans the input vaue ist once and the time compexity is O ( N M + S ). As we at most output each origina string in S once, the IO cost is O(N M + S). Suppose there are c distinct candidate pairs generated from the fiter reduce of the verification stage, and the average size of the strings in S is S avg. As each candidate pair contains a string in S, the IO compexity for this map phase is O(cS avg + R). The time compexity of the third map is O(c + R ). As each candidate pair contains a string of S, the shuffe transmission compexity is O(cS avg + R). Suppose the average verification time for each candidate pair is v. The time compexity of the third reduce is O(cv). The IO cost in this stage is O ( c(r avg + S avg ) ). We show a the compexity anaysis in Tabe V. D. Supporting Sef-Join The MASSJOIN agorithm can inherenty support the sefjoin scenario where two datasets are the same, i.e., R = S. In this case, for each string r, we first take its segments as segment signatures and then seect its substrings for strings with ength between L o and r as substring signatures (taking it as s). We can use a specia mark to differentiate segment signatures and substring signatures. In this case, we can support sef joins. V. MERGING KEY-VALUE PAIRS Based on the compexity anaysis in Section IV-C, the basic framework generates arge numbers of signature-based keyvaue pairs for string s S in the fiter stage. In this section, we discuss how to reduce the number of key-vaue pairs without sacrificing the pruning power. For exampe, we can decrease the number from O( s 3 ) to O( s ) for the Jaccard function. To achieve this goa, we first introduce how to merge key-vaue pairs in Section V-B, and then devise an efficient merging agorithm in Section V-B, and finay discuss how to incorporate the method into our framework in Section V-C. A. Merging Key-Vaue Pairs For simpicity, the key-vaue pairs in this section refer to those generated from the map phrase in the fiter stage. Basic Idea. The basic framework wi generate arge numbers of key-vaue pairs for s S: s[x i, i ], i,, s for i x i i, 1 i and L o L u. We aim to merge them to generate a sma number of key-vaue pairs. Obviousy if two signatures match, e.g., r[p i, i ], i, = s[x i, i ], i,, the segment r[p i, i ] and the substring s[x i, i ] must match, i.e., r[p i, i ] = s[x i, i ]. One simpe method is to take s[x i, i ], s as key-vaue pairs for s and r[p i, i ], r as key-vaue pairs for r. Obviousy, if r and s are simiar, they must share a same key and the method wi not miss a simiar pair. However, this method may reduce the pruning power. This is because a substring and a segment may match with different start positions, or with different engths. Both cases wi generate more fase positives. To achieve the same pruning power as the basic framework, we need to check whether the start position x i of the substring s[x i, i ] is within [ i, i ] as we as whether the ength of r is within [L o, L u ]. These bounds are different for various i and, and it is expensive to add them into the key-vaue pairs. Fortunatey, these bounds can be efficienty computed just based on the vaues of i and in O(1) time. For each key-vaue pair s[x i, i ], s of s, we add x i and s into its vaue fied and the new key-vaue pair is s[x i, i ], s, x i, s. For each key-vaue pair r[x i, i ], r of r, we add i and = r in its vaue fied and the key-vaue pair is r[p i, i ], i,, r. In the map phase, we generate these new key-vaue pairs. In the reduce phase, for two matching keys r[p i, i ] = s[x i, i ] with vaues i,, r and s, x i, s, we first cacuate i, i, L o and L u based on, i and s in the vaue fieds and δ from the configuration fie. 7

8 TABLE V COMPLEXITY ANALYSIS OF MASSJOIN (R/S ARE THE DATASET SIZES AND R / S ARE THE NUMBERS OF STRINGS IN R/S). Stage Fiter stage First phase of the verification stage Second phase of the verification stage Time/Space O(N + M) O(N M + S ) O(c + R ) map IO O(R + S) O(N M + S) O(cS avg + R) shuffe Trans O(N + M) O(N M + S) O(cS avg + R) Time/Space O(N M) O(N M + S ) O(cv) reduce ( ) IO O(N M) O(N M + S) O c(r avg + S avg) Then we check if L o L u and i x i i. If yes, we output this pair as a candidate pair. It is easy to prove that the method using r[p i, i ], i,, r and s[x i, i ], s, x i, s as keys generates the same candidate pairs as that using r[p i, i ], i,, r and s[x i, i ], i,, s as keys. It is worth noting that this method can significanty reduce the number of key-vaue pairs. Given any string s S, for setbased simiarity functions, we can reduce the number of keyvaue pairs from O(f δ s 3 ) to O( s ) as stated in Lemma 6. Lemma 6: Given a string s and a set-based simiarity function threshod δ, the number of key-vaue pairs in the mergebased method is O( s ). Proof Sketch: Here we use JAC as an exampe. For any string s and any JAC simiarity threshod δ, is monotonicay increasing with the increasing of, thus the minimum and maximum vaue of i Lo Lu Lu is and respectivey. Lo + 1 Lu Lo Thus there are at most 4 different vaues of i. Since x i must be in [1, s ], there are at most O( s ) keys. Moreover, each key s[x i, p i ] can soey determine the vaue s, x i, sid. Thus the number of keyvaue pairs is O( s ). Simiary, we can prove this emma for COS and DICE functions. We can aso prove that for ED, the number of key-vaue pairs in this merge-based method is O ( min( s, τ 2 ) ). This is formaized in Lemma 7. Lemma 7: Given a string s and a ED threshod τ, the number of key-vaue pairs in the merge-based method is O ( min( s, τ 2 ) ). Proof Sketch: Based on the simiar idea in Lemma 6, we can prove that the number of key-vaue pairs generated by s is bounded by O( s ). Next we prove the number of keyvaue pairs is aso bounded by O(τ 2 ). For any string s and threshod τ, is monotonicay increasing, thus the range of i is [ Lo Lu, ]. As Lu Lo + 1 2τ+1 τ , the size of the range of i is at most 4. We aso find that the size of the range of x i is at most 2τ + 4 for a L o L u and a fixed i [1, ]. Thus the tota number of possibe x i is O(τ 2 ). Thus the number of key-vaue pairs is aso bounded by O(τ 2 ). B. Merge Agorithm For sting r, it is easy to generate its key-vaue pairs r[p i, i ], i, r, r for i [1, ]. For string s, a straightforward method enumerates a signatures for s as discussed in Section III-C and then merges them with the same s[x i, i ]. However, the time compexity of this method is O(f δ s 3 ). Here we study how to efficienty merge the origina key-vaue pairs to obtain the new key-vaue pairs for string s. As proved in Lemma 6, there are ony four different vaues for i. As x i is a position in string s, x i [1, s ]. Then we can devise an efficient agorithm to directy generate the new key-vaue pairs as foows. For x i [1, s ] and for each i, we generate a key-vaue pair s[x i, i ], s, x i, s. Obviousy the time compexity is O( s ). 3 Compexity Improvement. The merge-based method can reduce the number of key-vaue pairs for s from O( s 3 ) to O( s ). The tota number of key-vaue pairs for a strings in S is from S max = S min f δ 3 n to S max n = S. The time/space min compexity to generate the key-vaue pairs for string s is aso from O( s 3 ) to O( s ). C. Changes on MASSJOIN Agorithm This section discusses the changes we need to make for incorporating the merge-based method into our MASSJOIN framework. The pseudo-code shown in Agorithm 2 is the same as Agorithm 1 except for the fiter stage. To generate the new key-vaue pairs, we repace Line 4 and Lines 6-9 in Agorithm 1 with Line 2 and Lines 3-5 in Agorithm 2 respectivey. For each string s S, we ony need to scan the string once to generate a key-vaue pairs (Lines 3-5). In the reduce phrase, we sti spit the input vaue ists into two ists and output those pairs satisfying L o L u and i x i i (Lines 7-12). Exampe 5: Consider the strings r 3 and s 3 in Tabe I. Using the merge-based agorithm, we generate t- wo key-vaue pairs for r 3, i.e., r3 [1, 2], 1, 4, r 3 and r3 [3, 2], 2, 4, r 3, and six key-vaue pairs for s 3, i.e., s 3 [1, 2], 5, 1, s 3, s 3 [2, 2], 5, 2, s 3, s 3 [3, 2], 5, 3, s 3, s3 [4, 2], 5, 4, s 3, s3 [1, 3], 5, 1, s 3, and s3 [3, 3], 5, 3, s 3. Note that for the basic method, we generate eight key-vaue pairs for s 3, thus the merge-based agorithm reduces the transmission cost of two key-vaue pairs. When using the key-vaue pairs to generate the candidate pairs, athough r 3 [1, 2] = s 3 [4, 2], we do not output r 3, s 3 since 4 [ 4 1 = 1, 4 1 = 2]. VI. USING LIGHT-WEIGHT FILTER UNIT TO REDUCE CANDIDATE PAIRS As the transmission and computation cost in the verification stage heaviy reies on the number of candidate pairs, in this section, we aim to decrease the number of candidate pairs. One simpe method is to attach the origina strings to the 3 Athough this method may generate more key-vaue pairs than the straightforward method, the compexity is sti O( s ). In the reduce step, we can remove such pairs based on the checking method in Section V-B. 8

9 Agorithm 2: Merge-based Agorithm // the fitering stage 1 Map( rid, r / sid, s ) // Repace Line 4 in Ago 1 with 2 emit( [p i, i ], i, r,rid ) ; // Repace Lines 6 9 in Ago 1 with 3 for 1 x i s do 4 for Lo i Lu do 5 emit( s[x i, i], s, x i, sid ); 6 Reduce ( sig, ist( i,, rid / s, x i, sid ) ) 7 spit ist( i,, rid / s, x i, sid ) into two groups ist( i,, rid ) and ist( s, x i, sid ); 8 foreach s, x i, sid ist( s, x i, sid ) do 9 foreach i,, rid ist( i,, rid ) do 1 if L o L u & i x i then 11 ist(rid) rid; 12 output( sid, ist(rid) ); vaue fied of each key, vaue pair in the map phase of fiter stage. In the reduce phase, for each candidate pair, we cacuate their rea simiarity and remove those pairs whose simiarity is smaer than the threshod δ. Athough this method can reduce the transmission cost of candidate pairs, it increases the transmitting cost of origina strings dramaticay. To aeviate this probem, we incorporate an ight-weight fiter unit to repace the origina strings. On one hand, fiter units can be utiized to prune many dissimiar pairs. On the other hand, fiter units shoud be ight-weight in order not to significanty increase the transmission cost. To this end, we propose an effective fiter unit in Section VI-A, and then discuss how to integrate this technique into our MASSJOIN framework in Section VI-B. A. Light-weight Fiter Unit Basic Idea. We first consider the set-based simiarity functions. Given a string, we can use an integer to repace each token in the string and take the set of integers as the fiter unit of the string. Obviousy two strings are simiar ony if their corresponding integer sets are simiar. However when the number of tokens is arge, this method sti invoves arge transmission cost. To reduce the size, we can group the integers into n buckets and keep a ist of n integers as a fiter unit. Formay, we first partition a tokens in the two string sets into n groups, G 1, G 2,, G n. Then for each string r(s), its fiter unit is g1, r g2, r, gn ( g r 1, s g2, s, gn ), s where gi r(gs i ) is the number of tokens in r(s) that are in the i-th group, i.e., gi r = G i r (gi s = G i s ). Two strings r and s are simiar ony if 1 i n gr i gs i U as stated in Lemma 8, where U is an upper bound of r s + s r (Section III-A). Lemma 8: Given two strings r and s with fiter unit g r 1, g r 2,, g r n and g s 1, g s 2,, g s n, if 1 i n gr i gs i > U, r and s cannot be simiar. Proof Sketch: We divide r and s into n disjoint groups Agorithm 3: Light-weight Fitering // the token count stage 1 Map( id, string ) 2 For each token in the string, emit( token, 1 ); 3 Reduce ( token, ist(1) ) 4 Output token frequency tf; // the fitering stage 5 MapSetup 6 Partition tokens into n groups ; 7 Map( rid, r / sid, s ) // Add into Ago 1 8 Compute fiter unit g r /g s for strings r/s; // Repace Line 4 in Ago 1 with 9 emit( r[p i, i], i, = r, rid, g r ); // Repace Line 9 in Ago 1 with 1 emit( s[x i, i], i,, sid, g s ); 11 Reduce ( sig, ist( id, g ) ) 12 ist( id, g ) ist( sid, g s ) and ist( rid, g r ); 13 foreach sid, g s ist( sid, g s ) do 14 foreach rid, g r ist( rid, g r ) do 15 if 1 i n gr i gi s U then 16 ist(rid) rid; 17 output( sid, ist(rid) ); using a universa grouping method. The tota number of different tokens between r and s is the sum of the number of different tokens in the n groups. For the i-th group, the number of different tokens in it is G i r G i s + G i s G i r = G i r + G i s 2 (G i r) (G i s) gi r +gs i 2 max(gr i, gs i ) = gi r gs i. Thus the tota number of different tokens between r and s cannot be smaer than 1 i n gr i gs i. Meanwhie, the number of different tokens in two simiar strings cannot exceed U(see Section III-A). Thus if 1 i n gr i gs i > U, r and s are not simiar. To utiize the fiter unit, we add the fiter unit into the vaue part in the key-vaue pair of the map function in the fiter stage. As discussed in Section III-A, the upper bound U ony depends on r, s and δ. Obviousy r = 1 i n gr i and s = 1 i n gs i, thus we can compute U in the reduce step. Thus we can utiize the fiter condition to prune dissimiar pairs. Notice that the fiter unit based method is orthogona to the merge-based method and they can be used together. This method can be appied to the character-based simiarity functions by taking each character as a token and U = 2τ. Finding the Optima Grouping Strategy. We find that different grouping methods may have different pruning power. Next we study how to evauate a grouping strategy and how to seect the optima grouping so as to generate high-quaity fiter unit. Intuitivey, for two strings r and s, the arger 1 i n gr i gi s, the two strings have higher probabiity to be pruned. Thus we want to maximize the vaue. Simiary, for a strings in R and S, we want to find an optima grouping strategy to 9

10 maximize r R s S 1 i n g r i g s i. (1) We can prove that the optima grouping probem is NP-hard by a reduction from the 3-SAT probem. Theorem 1: The optima grouping probem is NP-hard. Proof Sketch: We can prove the probem by a reduction from the 3-SAT probem. We propose a heuristic approach to sove this probem. The approach is based on two observations. First, it is not good to put two tokens with arge token frequencies into the same group, where the token frequency is the number of strings containing the token. This is because if we put them into the same group, many string pairs wi fasey consider them as the same token and the vaue gi r gs i for such string pairs wi not increase; on the contrary, if they are assigned into two different groups, the vaue wi be significanty increased. Second, the sum of token frequencies is a constant. To make the overa vaue as arge as possibe, we want to make the sum of token frequency in each group neary equa. Based on these two observations, we can devise a greedy agorithm as foows. We first sort a the tokens by their frequencies in decreasing order. Then we access each token in order and add it into the group with the minimum sum of token frequencies. We repeat this step unti a the tokens have been accessed. B. Changes on the MASSJOIN Agorithm To incorporate fiter units into our method, we need make some changes on the MASSJOIN agorithm. The method can be utiized to both the basic method and the merge-based method. Here we take the basic method as an exampe. The pseudo-code is shown in Agorithm 3. We need to add another MapReduce phase to count token frequencies (Lines 1-4). We aso modify the fiter stage (Lines 5-17) as foows. We add a setup phase to read the token frequency and an approximation agorithm to divide a the tokens to n groups (Line 6). In the map phase, we oad the token frequency, identify the tokens for each string, partition them into different groups based on token frequencies, and generate fiter unit (Line 8). Then we emit the fiter unit aong with the key, vaue pairs (Lines 9-1). In the reduce step, we ony output those pairs passing the grouping fiter (Lines 12-17). Exampe 6: Consider the two datasets in Tabe I. We use JAC and the threshod is δ =.8. For simpicity, we use the ids of strings as shown in Figure 1. In the token count stage we get the token frequencies, w 2, 5, w 3, 5, w 4, 5, w 5, 4, w 6, 4, w 1, 3 and w 7, 2. Suppose the group number is 4. We can divide the tokens into 4 groups G 1 = {w 2, w 7 }, G 2 = {w 3, w 1 }, G 3 = {w 4 }, and G 4 = {w 5, w 6 }. The fiter unit for r 1 is 1, 2, 1, 1 and that for s 1 is 2, 2, 1,. We fiter pair s 1, r 1 in the reduce phase as 4 i=1 gr1 i g s1 i = 2 > U =. VII. EXPERIMENT We have impemented our MASSJOIN method and conducted experiments on four rea datasets: Enron emai 4, PubMed paper abstract, PubMed paper tite 5 and NCBI DNA sequence 6. The detais of the datasets are shown in Tabe VI. TABLE VI DATASETS Datasets Size(MB) Cardinaity Avg Size/Len Σ Enron Emai (Set) , PubMed Abstract (Set) ,347, PubMed Tite (String) ,394, DNA Sequence (String) ,299, For Enron emai and PubMed paper abstract datasets, we used set-based simiarity functions, where strings are tokenized by non-aphanumeric characters. For PubMed paper tite and DNA datasets, we used the character-based distance functions. In the foowing experiments, we spit each dataset into two datasets with equa size to conduct R-S join. Due to space constraints, we focus on JAC and ED in the experiments and use them as defaut functions. We wi show the resuts on COS and DICE in Section VII-D. We compared with state-of-the-art method PrefixFiter [14]. We got their source code from their home page (asterix.ics.uci.edu/fuzzyjoin). A agorithms were impemented on Hadoop and run on a 1-node De custer. Each node had two Inte(R) Xeon(R) E GHZ processors with 8 cores, 16GB RAM, and 1TB disk. Each node is instaed 64-bit Ubuntu Server 1.4, Java 1.6, and Hadoop We set the bock size of the distributed fie system to 16MB and aocate 2GB virtua memory to each task Fig. 2. Greedy Random Number of Groups (a) PubMed Abstract Evauating ight-weight fiter units. A. Evauating Our Proposed Techniques We first evauated our fiter unit based method by varying different group numbers from 1 to 15. We impemented two grouping methods: Random and Greedy. Random computed the hash code of each token and randomy assigned it into a group. Greedy used our greedy agorithm. Figure 2 shows the resuts. We can see that with the increase of groups, the performance of Greedy improved first and then depraved. This is because a sma group number wi reduce the transmission cost but with ower pruning power, and a arge group number wi improve the pruning power but with arge transmission cost. 4 enron/

MassJoin: A MapReduce-based Method for Scalable String Similarity Joins

MassJoin: A MapReduce-based Method for Scalable String Similarity Joins MassJoin: A MapReduce-based Method for Scaabe String Simiarity Joins Dong Deng, Guoiang Li, Shuang Hao, Jiannan Wang, Jianhua Feng, Wen-Syan Li Department of Computer Science, Tsinghua University, Beijing,

More information

Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search

Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search Two Birds With One Stone: An Efficient Hierarchica Framework for Top-k and Threshod-based String Simiarity Search Jin Wang Guoiang Li Dong Deng Yong Zhang Jianhua Feng Department of Computer Science and

More information

Asynchronous Control for Coupled Markov Decision Systems

Asynchronous Control for Coupled Markov Decision Systems INFORMATION THEORY WORKSHOP (ITW) 22 Asynchronous Contro for Couped Marov Decision Systems Michae J. Neey University of Southern Caifornia Abstract This paper considers optima contro for a coection of

More information

Efficiently Generating Random Bits from Finite State Markov Chains

Efficiently Generating Random Bits from Finite State Markov Chains 1 Efficienty Generating Random Bits from Finite State Markov Chains Hongchao Zhou and Jehoshua Bruck, Feow, IEEE Abstract The probem of random number generation from an uncorreated random source (of unknown

More information

Stochastic Complement Analysis of Multi-Server Threshold Queues. with Hysteresis. Abstract

Stochastic Complement Analysis of Multi-Server Threshold Queues. with Hysteresis. Abstract Stochastic Compement Anaysis of Muti-Server Threshod Queues with Hysteresis John C.S. Lui The Dept. of Computer Science & Engineering The Chinese University of Hong Kong Leana Goubchik Dept. of Computer

More information

Cryptanalysis of PKP: A New Approach

Cryptanalysis of PKP: A New Approach Cryptanaysis of PKP: A New Approach Éiane Jaumes and Antoine Joux DCSSI 18, rue du Dr. Zamenhoff F-92131 Issy-es-Mx Cedex France eiane.jaumes@wanadoo.fr Antoine.Joux@ens.fr Abstract. Quite recenty, in

More information

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network An Agorithm for Pruning Redundant Modues in Min-Max Moduar Network Hui-Cheng Lian and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University 1954 Hua Shan Rd., Shanghai

More information

Efficient Generation of Random Bits from Finite State Markov Chains

Efficient Generation of Random Bits from Finite State Markov Chains Efficient Generation of Random Bits from Finite State Markov Chains Hongchao Zhou and Jehoshua Bruck, Feow, IEEE Abstract The probem of random number generation from an uncorreated random source (of unknown

More information

Minimizing Total Weighted Completion Time on Uniform Machines with Unbounded Batch

Minimizing Total Weighted Completion Time on Uniform Machines with Unbounded Batch The Eighth Internationa Symposium on Operations Research and Its Appications (ISORA 09) Zhangiaie, China, September 20 22, 2009 Copyright 2009 ORSC & APORC, pp. 402 408 Minimizing Tota Weighted Competion

More information

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness 1 Scheduabiity Anaysis of Deferrabe Scheduing Agorithms for Maintaining Rea-Time Data Freshness Song Han, Deji Chen, Ming Xiong, Kam-yiu Lam, Aoysius K. Mok, Krithi Ramamritham UT Austin, Emerson Process

More information

T.C. Banwell, S. Galli. {bct, Telcordia Technologies, Inc., 445 South Street, Morristown, NJ 07960, USA

T.C. Banwell, S. Galli. {bct, Telcordia Technologies, Inc., 445 South Street, Morristown, NJ 07960, USA ON THE SYMMETRY OF THE POWER INE CHANNE T.C. Banwe, S. Gai {bct, sgai}@research.tecordia.com Tecordia Technoogies, Inc., 445 South Street, Morristown, NJ 07960, USA Abstract The indoor power ine network

More information

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness 1 Scheduabiity Anaysis of Deferrabe Scheduing Agorithms for Maintaining Rea- Data Freshness Song Han, Deji Chen, Ming Xiong, Kam-yiu Lam, Aoysius K. Mok, Krithi Ramamritham UT Austin, Emerson Process Management,

More information

Centralized Coded Caching of Correlated Contents

Centralized Coded Caching of Correlated Contents Centraized Coded Caching of Correated Contents Qianqian Yang and Deniz Gündüz Information Processing and Communications Lab Department of Eectrica and Eectronic Engineering Imperia Coege London arxiv:1711.03798v1

More information

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel Sequentia Decoding of Poar Codes with Arbitrary Binary Kerne Vera Miosavskaya, Peter Trifonov Saint-Petersburg State Poytechnic University Emai: veram,petert}@dcn.icc.spbstu.ru Abstract The probem of efficient

More information

THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE

THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE KATIE L. MAY AND MELISSA A. MITCHELL Abstract. We show how to identify the minima path network connecting three fixed points on

More information

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents MARKOV CHAINS AND MARKOV DECISION THEORY ARINDRIMA DATTA Abstract. In this paper, we begin with a forma introduction to probabiity and expain the concept of random variabes and stochastic processes. After

More information

A. Distribution of the test statistic

A. Distribution of the test statistic A. Distribution of the test statistic In the sequentia test, we first compute the test statistic from a mini-batch of size m. If a decision cannot be made with this statistic, we keep increasing the mini-batch

More information

Efficient Similarity Search across Top-k Lists under the Kendall s Tau Distance

Efficient Similarity Search across Top-k Lists under the Kendall s Tau Distance Efficient Simiarity Search across Top-k Lists under the Kenda s Tau Distance Koninika Pa TU Kaisersautern Kaisersautern, Germany pa@cs.uni-k.de Sebastian Miche TU Kaisersautern Kaisersautern, Germany smiche@cs.uni-k.de

More information

A Fundamental Storage-Communication Tradeoff in Distributed Computing with Straggling Nodes

A Fundamental Storage-Communication Tradeoff in Distributed Computing with Straggling Nodes A Fundamenta Storage-Communication Tradeoff in Distributed Computing with Stragging odes ifa Yan, Michèe Wigger LTCI, Téécom ParisTech 75013 Paris, France Emai: {qifa.yan, michee.wigger} @teecom-paristech.fr

More information

XSAT of linear CNF formulas

XSAT of linear CNF formulas XSAT of inear CN formuas Bernd R. Schuh Dr. Bernd Schuh, D-50968 Kön, Germany; bernd.schuh@netcoogne.de eywords: compexity, XSAT, exact inear formua, -reguarity, -uniformity, NPcompeteness Abstract. Open

More information

An Efficient Partition Based Method for Exact Set Similarity Joins

An Efficient Partition Based Method for Exact Set Similarity Joins An Efficient Partition Based Method for Exact Set Similarity Joins Dong Deng Guoliang Li He Wen Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China. {dd11,wenhe1}@mails.tsinghua.edu.cn;{liguoliang,fengjh}@tsinghua.edu.cn

More information

BDD-Based Analysis of Gapped q-gram Filters

BDD-Based Analysis of Gapped q-gram Filters BDD-Based Anaysis of Gapped q-gram Fiters Marc Fontaine, Stefan Burkhardt 2 and Juha Kärkkäinen 2 Max-Panck-Institut für Informatik Stuhsatzenhausweg 85, 6623 Saarbrücken, Germany e-mai: stburk@mpi-sb.mpg.de

More information

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete Uniprocessor Feasibiity of Sporadic Tasks with Constrained Deadines is Strongy conp-compete Pontus Ekberg and Wang Yi Uppsaa University, Sweden Emai: {pontus.ekberg yi}@it.uu.se Abstract Deciding the feasibiity

More information

Target Location Estimation in Wireless Sensor Networks Using Binary Data

Target Location Estimation in Wireless Sensor Networks Using Binary Data Target Location stimation in Wireess Sensor Networks Using Binary Data Ruixin Niu and Pramod K. Varshney Department of ectrica ngineering and Computer Science Link Ha Syracuse University Syracuse, NY 344

More information

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Models A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,

More information

Problem set 6 The Perron Frobenius theorem.

Problem set 6 The Perron Frobenius theorem. Probem set 6 The Perron Frobenius theorem. Math 22a4 Oct 2 204, Due Oct.28 In a future probem set I want to discuss some criteria which aow us to concude that that the ground state of a sef-adjoint operator

More information

arxiv:quant-ph/ v3 6 Jan 1995

arxiv:quant-ph/ v3 6 Jan 1995 arxiv:quant-ph/9501001v3 6 Jan 1995 Critique of proposed imit to space time measurement, based on Wigner s cocks and mirrors L. Diósi and B. Lukács KFKI Research Institute for Partice and Nucear Physics

More information

<C 2 2. λ 2 l. λ 1 l 1 < C 1

<C 2 2. λ 2 l. λ 1 l 1 < C 1 Teecommunication Network Contro and Management (EE E694) Prof. A. A. Lazar Notes for the ecture of 7/Feb/95 by Huayan Wang (this document was ast LaT E X-ed on May 9,995) Queueing Primer for Muticass Optima

More information

Algorithms to solve massively under-defined systems of multivariate quadratic equations

Algorithms to solve massively under-defined systems of multivariate quadratic equations Agorithms to sove massivey under-defined systems of mutivariate quadratic equations Yasufumi Hashimoto Abstract It is we known that the probem to sove a set of randomy chosen mutivariate quadratic equations

More information

Efficient Generation of Random Bits from Finite State Markov Chains

Efficient Generation of Random Bits from Finite State Markov Chains Efficient Generation of Random Bits from Finite State Markov Chains Hongchao Zhou and Jehoshua Bruck, Feow, IEEE arxiv:0.5339v [cs.it] 4 Dec 00 Abstract The probem of random number generation from an uncorreated

More information

Approximated MLC shape matrix decomposition with interleaf collision constraint

Approximated MLC shape matrix decomposition with interleaf collision constraint Approximated MLC shape matrix decomposition with intereaf coision constraint Thomas Kainowski Antje Kiese Abstract Shape matrix decomposition is a subprobem in radiation therapy panning. A given fuence

More information

Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints

Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Efficient Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang,, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition

More information

Research of Data Fusion Method of Multi-Sensor Based on Correlation Coefficient of Confidence Distance

Research of Data Fusion Method of Multi-Sensor Based on Correlation Coefficient of Confidence Distance Send Orders for Reprints to reprints@benthamscience.ae 340 The Open Cybernetics & Systemics Journa, 015, 9, 340-344 Open Access Research of Data Fusion Method of Muti-Sensor Based on Correation Coefficient

More information

BP neural network-based sports performance prediction model applied research

BP neural network-based sports performance prediction model applied research Avaiabe onine www.jocpr.com Journa of Chemica and Pharmaceutica Research, 204, 6(7:93-936 Research Artice ISSN : 0975-7384 CODEN(USA : JCPRC5 BP neura networ-based sports performance prediction mode appied

More information

In-plane shear stiffness of bare steel deck through shell finite element models. G. Bian, B.W. Schafer. June 2017

In-plane shear stiffness of bare steel deck through shell finite element models. G. Bian, B.W. Schafer. June 2017 In-pane shear stiffness of bare stee deck through she finite eement modes G. Bian, B.W. Schafer June 7 COLD-FORMED STEEL RESEARCH CONSORTIUM REPORT SERIES CFSRC R-7- SDII Stee Diaphragm Innovation Initiative

More information

Approximated MLC shape matrix decomposition with interleaf collision constraint

Approximated MLC shape matrix decomposition with interleaf collision constraint Agorithmic Operations Research Vo.4 (29) 49 57 Approximated MLC shape matrix decomposition with intereaf coision constraint Antje Kiese and Thomas Kainowski Institut für Mathematik, Universität Rostock,

More information

Explicit overall risk minimization transductive bound

Explicit overall risk minimization transductive bound 1 Expicit overa risk minimization transductive bound Sergio Decherchi, Paoo Gastado, Sandro Ridea, Rodofo Zunino Dept. of Biophysica and Eectronic Engineering (DIBE), Genoa University Via Opera Pia 11a,

More information

arxiv: v1 [cs.lg] 31 Oct 2017

arxiv: v1 [cs.lg] 31 Oct 2017 ACCELERATED SPARSE SUBSPACE CLUSTERING Abofaz Hashemi and Haris Vikao Department of Eectrica and Computer Engineering, University of Texas at Austin, Austin, TX, USA arxiv:7.26v [cs.lg] 3 Oct 27 ABSTRACT

More information

CONSISTENT LABELING OF ROTATING MAPS

CONSISTENT LABELING OF ROTATING MAPS CONSISTENT LABELING OF ROTATING MAPS Andreas Gemsa, Martin Nöenburg, Ignaz Rutter Abstract. Dynamic maps that aow continuous map rotations, for exampe, on mobie devices, encounter new geometric abeing

More information

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons Expectation-Maximization for Estimating Parameters for a Mixture of Poissons Brandon Maone Department of Computer Science University of Hesini February 18, 2014 Abstract This document derives, in excrutiating

More information

Gauss Law. 2. Gauss s Law: connects charge and field 3. Applications of Gauss s Law

Gauss Law. 2. Gauss s Law: connects charge and field 3. Applications of Gauss s Law Gauss Law 1. Review on 1) Couomb s Law (charge and force) 2) Eectric Fied (fied and force) 2. Gauss s Law: connects charge and fied 3. Appications of Gauss s Law Couomb s Law and Eectric Fied Couomb s

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part IX The EM agorithm In the previous set of notes, we taked about the EM agorithm as appied to fitting a mixture of Gaussians. In this set of notes, we give a broader view

More information

A Robust Voice Activity Detection based on Noise Eigenspace Projection

A Robust Voice Activity Detection based on Noise Eigenspace Projection A Robust Voice Activity Detection based on Noise Eigenspace Projection Dongwen Ying 1, Yu Shi 2, Frank Soong 2, Jianwu Dang 1, and Xugang Lu 1 1 Japan Advanced Institute of Science and Technoogy, Nomi

More information

Approximate Bandwidth Allocation for Fixed-Priority-Scheduled Periodic Resources (WSU-CS Technical Report Version)

Approximate Bandwidth Allocation for Fixed-Priority-Scheduled Periodic Resources (WSU-CS Technical Report Version) Approximate Bandwidth Aocation for Fixed-Priority-Schedued Periodic Resources WSU-CS Technica Report Version) Farhana Dewan Nathan Fisher Abstract Recent research in compositiona rea-time systems has focused

More information

IN MODERN embedded systems, the ever-increasing complexity

IN MODERN embedded systems, the ever-increasing complexity IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO., JANUARY 208 2 Muticore Mixed-Criticaity Systems: Partitioned Scheduing and Utiization Bound Jian-Jun Han, Member,

More information

Turbo Codes. Coding and Communication Laboratory. Dept. of Electrical Engineering, National Chung Hsing University

Turbo Codes. Coding and Communication Laboratory. Dept. of Electrical Engineering, National Chung Hsing University Turbo Codes Coding and Communication Laboratory Dept. of Eectrica Engineering, Nationa Chung Hsing University Turbo codes 1 Chapter 12: Turbo Codes 1. Introduction 2. Turbo code encoder 3. Design of intereaver

More information

MONOCHROMATIC LOOSE PATHS IN MULTICOLORED k-uniform CLIQUES

MONOCHROMATIC LOOSE PATHS IN MULTICOLORED k-uniform CLIQUES MONOCHROMATIC LOOSE PATHS IN MULTICOLORED k-uniform CLIQUES ANDRZEJ DUDEK AND ANDRZEJ RUCIŃSKI Abstract. For positive integers k and, a k-uniform hypergraph is caed a oose path of ength, and denoted by

More information

Iterative Decoding Performance Bounds for LDPC Codes on Noisy Channels

Iterative Decoding Performance Bounds for LDPC Codes on Noisy Channels Iterative Decoding Performance Bounds for LDPC Codes on Noisy Channes arxiv:cs/060700v1 [cs.it] 6 Ju 006 Chun-Hao Hsu and Achieas Anastasopouos Eectrica Engineering and Computer Science Department University

More information

Data Mining Technology for Failure Prognostic of Avionics

Data Mining Technology for Failure Prognostic of Avionics IEEE Transactions on Aerospace and Eectronic Systems. Voume 38, #, pp.388-403, 00. Data Mining Technoogy for Faiure Prognostic of Avionics V.A. Skormin, Binghamton University, Binghamton, NY, 1390, USA

More information

arxiv: v1 [cs.db] 1 Aug 2012

arxiv: v1 [cs.db] 1 Aug 2012 Functiona Mechanism: Regression Anaysis under Differentia Privacy arxiv:208.029v [cs.db] Aug 202 Jun Zhang Zhenjie Zhang 2 Xiaokui Xiao Yin Yang 2 Marianne Winsett 2,3 ABSTRACT Schoo of Computer Engineering

More information

Related Topics Maxwell s equations, electrical eddy field, magnetic field of coils, coil, magnetic flux, induced voltage

Related Topics Maxwell s equations, electrical eddy field, magnetic field of coils, coil, magnetic flux, induced voltage Magnetic induction TEP Reated Topics Maxwe s equations, eectrica eddy fied, magnetic fied of cois, coi, magnetic fux, induced votage Principe A magnetic fied of variabe frequency and varying strength is

More information

Unconditional security of differential phase shift quantum key distribution

Unconditional security of differential phase shift quantum key distribution Unconditiona security of differentia phase shift quantum key distribution Kai Wen, Yoshihisa Yamamoto Ginzton Lab and Dept of Eectrica Engineering Stanford University Basic idea of DPS-QKD Protoco. Aice

More information

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations Optimaity of Inference in Hierarchica Coding for Distributed Object-Based Representations Simon Brodeur, Jean Rouat NECOTIS, Département génie éectrique et génie informatique, Université de Sherbrooke,

More information

Quantum Multicollision-Finding Algorithm

Quantum Multicollision-Finding Algorithm Quantum Muticoision-Finding Agorithm Akinori Hosoyamada 1, Yu Sasaki 1, and Keita Xagawa 1 NTT Secure Patform Laboratories, 3-9-11, Midori-cho Musashino-shi, Tokyo 180-8585, Japan. {hosoyamada.akinori,sasaki.yu,xagawa.keita}@ab.ntt.co.jp

More information

Fast Blind Recognition of Channel Codes

Fast Blind Recognition of Channel Codes Fast Bind Recognition of Channe Codes Reza Moosavi and Erik G. Larsson Linköping University Post Print N.B.: When citing this work, cite the origina artice. 213 IEEE. Persona use of this materia is permitted.

More information

MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES

MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES Separation of variabes is a method to sove certain PDEs which have a warped product structure. First, on R n, a inear PDE of order m is

More information

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries c 26 Noninear Phenomena in Compex Systems First-Order Corrections to Gutzwier s Trace Formua for Systems with Discrete Symmetries Hoger Cartarius, Jörg Main, and Günter Wunner Institut für Theoretische

More information

Componentwise Determination of the Interval Hull Solution for Linear Interval Parameter Systems

Componentwise Determination of the Interval Hull Solution for Linear Interval Parameter Systems Componentwise Determination of the Interva Hu Soution for Linear Interva Parameter Systems L. V. Koev Dept. of Theoretica Eectrotechnics, Facuty of Automatics, Technica University of Sofia, 1000 Sofia,

More information

Partial permutation decoding for MacDonald codes

Partial permutation decoding for MacDonald codes Partia permutation decoding for MacDonad codes J.D. Key Department of Mathematics and Appied Mathematics University of the Western Cape 7535 Bevie, South Africa P. Seneviratne Department of Mathematics

More information

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with? Bayesian Learning A powerfu and growing approach in machine earning We use it in our own decision making a the time You hear a which which coud equay be Thanks or Tanks, which woud you go with? Combine

More information

arxiv: v1 [cs.db] 25 Jun 2013

arxiv: v1 [cs.db] 25 Jun 2013 Communication Steps for Parae Query Processing Pau Beame, Paraschos Koutris and Dan Suciu {beame,pkoutris,suciu}@cs.washington.edu University of Washington arxiv:1306.5972v1 [cs.db] 25 Jun 2013 June 26,

More information

Online Load Balancing on Related Machines

Online Load Balancing on Related Machines Onine Load Baancing on Reated Machines ABSTRACT Sungjin Im University of Caifornia at Merced Merced, CA, USA sim3@ucmerced.edu Debmaya Panigrahi Duke University Durham, NC, USA debmaya@cs.duke.edu We give

More information

Moreau-Yosida Regularization for Grouped Tree Structure Learning

Moreau-Yosida Regularization for Grouped Tree Structure Learning Moreau-Yosida Reguarization for Grouped Tree Structure Learning Jun Liu Computer Science and Engineering Arizona State University J.Liu@asu.edu Jieping Ye Computer Science and Engineering Arizona State

More information

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION Hsiao-Chang Chen Dept. of Systems Engineering University of Pennsyvania Phiadephia, PA 904-635, U.S.A. Chun-Hung Chen

More information

17 Lecture 17: Recombination and Dark Matter Production

17 Lecture 17: Recombination and Dark Matter Production PYS 652: Astrophysics 88 17 Lecture 17: Recombination and Dark Matter Production New ideas pass through three periods: It can t be done. It probaby can be done, but it s not worth doing. I knew it was

More information

Multi-server queueing systems with multiple priority classes

Multi-server queueing systems with multiple priority classes Muti-server queueing systems with mutipe priority casses Mor Harcho-Bater Taayui Osogami Aan Scheer-Wof Adam Wierman Abstract We present the first near-exact anaysis of an M/PH/ queue with m > 2 preemptive-resume

More information

The influence of temperature of photovoltaic modules on performance of solar power plant

The influence of temperature of photovoltaic modules on performance of solar power plant IOSR Journa of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vo. 05, Issue 04 (Apri. 2015), V1 PP 09-15 www.iosrjen.org The infuence of temperature of photovotaic modues on performance

More information

Coded Caching for Files with Distinct File Sizes

Coded Caching for Files with Distinct File Sizes Coded Caching for Fies with Distinct Fie Sizes Jinbei Zhang iaojun Lin Chih-Chun Wang inbing Wang Department of Eectronic Engineering Shanghai Jiao ong University China Schoo of Eectrica and Computer Engineering

More information

Pairwise RNA Edit Distance

Pairwise RNA Edit Distance Pairwise RNA Edit Distance In the foowing: Sequences S 1 and S 2 associated structures P 1 and P 2 scoring of aignment: different edit operations arc atering arc removing 1) ACGUUGACUGACAACAC..(((...)))...

More information

Model Calculation of n + 6 Li Reactions Below 20 MeV

Model Calculation of n + 6 Li Reactions Below 20 MeV Commun. Theor. Phys. (Beijing, China) 36 (2001) pp. 437 442 c Internationa Academic Pubishers Vo. 36, No. 4, October 15, 2001 Mode Cacuation of n + 6 Li Reactions Beow 20 MeV ZHANG Jing-Shang and HAN Yin-Lu

More information

Limited magnitude error detecting codes over Z q

Limited magnitude error detecting codes over Z q Limited magnitude error detecting codes over Z q Noha Earief choo of Eectrica Engineering and Computer cience Oregon tate University Corvais, OR 97331, UA Emai: earief@eecsorstedu Bea Bose choo of Eectrica

More information

Rate-Distortion Theory of Finite Point Processes

Rate-Distortion Theory of Finite Point Processes Rate-Distortion Theory of Finite Point Processes Günther Koiander, Dominic Schuhmacher, and Franz Hawatsch, Feow, IEEE Abstract We study the compression of data in the case where the usefu information

More information

Density-Based Clustering for Real-Time Stream Data

Density-Based Clustering for Real-Time Stream Data Density-Based Custering for Rea-Time Stream Data Yixin Chen Department of Computer Science and Engineering Washington University in St. Louis St. Louis, USA chen@cse.wust.edu Li Tu Institute of Information

More information

Multiple Beam Interference

Multiple Beam Interference MutipeBeamInterference.nb James C. Wyant 1 Mutipe Beam Interference 1. Airy's Formua We wi first derive Airy's formua for the case of no absorption. ü 1.1 Basic refectance and transmittance Refected ight

More information

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1 Inductive Bias: How to generaize on nove data CS 478 - Inductive Bias 1 Overfitting Noise vs. Exceptions CS 478 - Inductive Bias 2 Non-Linear Tasks Linear Regression wi not generaize we to the task beow

More information

Throughput Optimal Scheduling for Wireless Downlinks with Reconfiguration Delay

Throughput Optimal Scheduling for Wireless Downlinks with Reconfiguration Delay Throughput Optima Scheduing for Wireess Downinks with Reconfiguration Deay Vineeth Baa Sukumaran vineethbs@gmai.com Department of Avionics Indian Institute of Space Science and Technoogy. Abstract We consider

More information

Scalable Spectrum Allocation for Large Networks Based on Sparse Optimization

Scalable Spectrum Allocation for Large Networks Based on Sparse Optimization Scaabe Spectrum ocation for Large Networks ased on Sparse Optimization innan Zhuang Modem R&D Lab Samsung Semiconductor, Inc. San Diego, C Dongning Guo, Ermin Wei, and Michae L. Honig Department of Eectrica

More information

8 APPENDIX. E[m M] = (n S )(1 exp( exp(s min + c M))) (19) E[m M] n exp(s min + c M) (20) 8.1 EMPIRICAL EVALUATION OF SAMPLING

8 APPENDIX. E[m M] = (n S )(1 exp( exp(s min + c M))) (19) E[m M] n exp(s min + c M) (20) 8.1 EMPIRICAL EVALUATION OF SAMPLING 8 APPENDIX 8.1 EMPIRICAL EVALUATION OF SAMPLING We wish to evauate the empirica accuracy of our samping technique on concrete exampes. We do this in two ways. First, we can sort the eements by probabiity

More information

Lecture Note 3: Stationary Iterative Methods

Lecture Note 3: Stationary Iterative Methods MATH 5330: Computationa Methods of Linear Agebra Lecture Note 3: Stationary Iterative Methods Xianyi Zeng Department of Mathematica Sciences, UTEP Stationary Iterative Methods The Gaussian eimination (or

More information

Discrete Applied Mathematics

Discrete Applied Mathematics Discrete Appied Mathematics 159 (2011) 812 825 Contents ists avaiabe at ScienceDirect Discrete Appied Mathematics journa homepage: www.esevier.com/ocate/dam A direct barter mode for course add/drop process

More information

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

(This is a sample cover image for this issue. The actual cover is not yet available at this time.) (This is a sampe cover image for this issue The actua cover is not yet avaiabe at this time) This artice appeared in a journa pubished by Esevier The attached copy is furnished to the author for interna

More information

Do Schools Matter for High Math Achievement? Evidence from the American Mathematics Competitions Glenn Ellison and Ashley Swanson Online Appendix

Do Schools Matter for High Math Achievement? Evidence from the American Mathematics Competitions Glenn Ellison and Ashley Swanson Online Appendix VOL. NO. DO SCHOOLS MATTER FOR HIGH MATH ACHIEVEMENT? 43 Do Schoos Matter for High Math Achievement? Evidence from the American Mathematics Competitions Genn Eison and Ashey Swanson Onine Appendix Appendix

More information

$, (2.1) n="# #. (2.2)

$, (2.1) n=# #. (2.2) Chapter. Eectrostatic II Notes: Most of the materia presented in this chapter is taken from Jackson, Chap.,, and 4, and Di Bartoo, Chap... Mathematica Considerations.. The Fourier series and the Fourier

More information

Some Measures for Asymmetry of Distributions

Some Measures for Asymmetry of Distributions Some Measures for Asymmetry of Distributions Georgi N. Boshnakov First version: 31 January 2006 Research Report No. 5, 2006, Probabiity and Statistics Group Schoo of Mathematics, The University of Manchester

More information

Melodic contour estimation with B-spline models using a MDL criterion

Melodic contour estimation with B-spline models using a MDL criterion Meodic contour estimation with B-spine modes using a MDL criterion Damien Loive, Ney Barbot, Oivier Boeffard IRISA / University of Rennes 1 - ENSSAT 6 rue de Kerampont, B.P. 80518, F-305 Lannion Cedex

More information

BALANCING REGULAR MATRIX PENCILS

BALANCING REGULAR MATRIX PENCILS BALANCING REGULAR MATRIX PENCILS DAMIEN LEMONNIER AND PAUL VAN DOOREN Abstract. In this paper we present a new diagona baancing technique for reguar matrix pencis λb A, which aims at reducing the sensitivity

More information

Minimum Spanning Trees in Temporal Graphs

Minimum Spanning Trees in Temporal Graphs Minimum Spanning Trees in Tempora Graphs Siu Huang Chinese University of Hong Kong shuang@cse.cuhk.edu.hk Ada Wai-Chee Fu Chinese University of Hong Kong adafu@cse.cuhk.edu.hk Ruifeng Liu Chinese University

More information

A proposed nonparametric mixture density estimation using B-spline functions

A proposed nonparametric mixture density estimation using B-spline functions A proposed nonparametric mixture density estimation using B-spine functions Atizez Hadrich a,b, Mourad Zribi a, Afif Masmoudi b a Laboratoire d Informatique Signa et Image de a Côte d Opae (LISIC-EA 4491),

More information

arxiv: v2 [cond-mat.stat-mech] 14 Nov 2008

arxiv: v2 [cond-mat.stat-mech] 14 Nov 2008 Random Booean Networks Barbara Drosse Institute of Condensed Matter Physics, Darmstadt University of Technoogy, Hochschustraße 6, 64289 Darmstadt, Germany (Dated: June 27) arxiv:76.335v2 [cond-mat.stat-mech]

More information

Emmanuel Abbe Colin Sandon

Emmanuel Abbe Colin Sandon Detection in the stochastic bock mode with mutipe custers: proof of the achievabiity conjectures, acycic BP, and the information-computation gap Emmanue Abbe Coin Sandon Abstract In a paper that initiated

More information

A Branch and Cut Algorithm to Design. LDPC Codes without Small Cycles in. Communication Systems

A Branch and Cut Algorithm to Design. LDPC Codes without Small Cycles in. Communication Systems A Branch and Cut Agorithm to Design LDPC Codes without Sma Cyces in Communication Systems arxiv:1709.09936v1 [cs.it] 28 Sep 2017 Banu Kabakuak 1, Z. Caner Taşkın 1, and Ai Emre Pusane 2 1 Department of

More information

Separation of Variables and a Spherical Shell with Surface Charge

Separation of Variables and a Spherical Shell with Surface Charge Separation of Variabes and a Spherica She with Surface Charge In cass we worked out the eectrostatic potentia due to a spherica she of radius R with a surface charge density σθ = σ cos θ. This cacuation

More information

New Efficiency Results for Makespan Cost Sharing

New Efficiency Results for Makespan Cost Sharing New Efficiency Resuts for Makespan Cost Sharing Yvonne Beischwitz a, Forian Schoppmann a, a University of Paderborn, Department of Computer Science Fürstenaee, 3302 Paderborn, Germany Abstract In the context

More information

12.2. Maxima and Minima. Introduction. Prerequisites. Learning Outcomes

12.2. Maxima and Minima. Introduction. Prerequisites. Learning Outcomes Maima and Minima 1. Introduction In this Section we anayse curves in the oca neighbourhood of a stationary point and, from this anaysis, deduce necessary conditions satisfied by oca maima and oca minima.

More information

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron Neura Information Processing - Letters and Reviews Vo. 5, No. 2, November 2004 LETTER A Soution to the 4-bit Parity Probem with a Singe Quaternary Neuron Tohru Nitta Nationa Institute of Advanced Industria

More information

arxiv: v1 [quant-ph] 23 Jun 2017

arxiv: v1 [quant-ph] 23 Jun 2017 A generic construction of quantum-obivious-key-transfer-based private query with idea database security and zero faiure Chun-Yan Wei 1,2, Xiao-Qiu Cai 2, Bin Liu 3, Tian-Yin Wang 1, and Fei Gao 2 1 Coege

More information

Multilayer Kerceptron

Multilayer Kerceptron Mutiayer Kerceptron Zotán Szabó, András Lőrincz Department of Information Systems, Facuty of Informatics Eötvös Loránd University Pázmány Péter sétány 1/C H-1117, Budapest, Hungary e-mai: szzoi@csetehu,

More information

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM MIKAEL NILSSON, MATTIAS DAHL AND INGVAR CLAESSON Bekinge Institute of Technoogy Department of Teecommunications and Signa Processing

More information

Candidate Number. General Certificate of Education Advanced Level Examination June 2010

Candidate Number. General Certificate of Education Advanced Level Examination June 2010 Centre Number Surname Candidate Number For Examiner s Use Other Names Candidate Signature Examiner s Initias Genera Certificate of Education Advanced Leve Examination June 2010 Question 1 2 Mark Physics

More information

Recursive Constructions of Parallel FIFO and LIFO Queues with Switched Delay Lines

Recursive Constructions of Parallel FIFO and LIFO Queues with Switched Delay Lines Recursive Constructions of Parae FIFO and LIFO Queues with Switched Deay Lines Po-Kai Huang, Cheng-Shang Chang, Feow, IEEE, Jay Cheng, Member, IEEE, and Duan-Shin Lee, Senior Member, IEEE Abstract One

More information