Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints

Size: px

Start display at page:

Download "Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints"

Sophie Horn
5 years ago
Views:

1 Efficient Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang,, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition on EDBT/ICDT 2013

2 Outline 1 Problem Definition Application 2 3 Evaluating Pruning Techniques Evaluating ism Evaluating Scalability

3 Problem Definition STRING SIMILARITY JOINS Problem Definition Application Given a set of strings S, the task is to find all pairs of τ-similar strings from S. A program must output all matches with both string identifiers and distance τ.(track II)

4 An Example Problem Definition Application Table: A string dataset ID Strings Length s 1 vankatesh 9 s 2 avataresha 10 s 3 kaushic chaduri 15 s 4 kaushik chakrab 15 s 5 kaushuk chadhui 15 s 6 caushik chakrabar 17 Consider the string dataset in Table 1. Suppose τ = 3. s 4, s 6 is a similar pair as ED(s 4, s 6 ) τ

5 Application Problem Definition Application Data cleaning Information Extraction Comparison of biological sequences...

6 Basic Idea Lemma Given a string r with τ + 1 segments and a string s, if s is similar to r within threshold τ, s must contain a segment of r. Example τ = 1, r = EDBT has two segments ED and BT. s = ICDT cannot similar to r as s contains none of the two segemtns.

7 Even Partition Scheme Definition In even partition scheme, each segment has almost the same length. ( s s τ+1 or τ+1 ) Example τ = 3, we partition s 1 = vankatesh into four segments va, nk, at, esh.

8 Substring Selection Basic Methods Enumeration: Enumerate all substrings for each of the segment. Length-based: For each segment, only select substrings with same length. Shift-based: For segment with start position p i, select substrings with start position in [p i τ, p i + τ]

9 Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i, select substrings with start position in [p i τ 2, p i + τ+ 2 ] where = s r.

10 Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i, select substrings with start position in [p i τ 2, p i + τ+ 2 ] where = s r.

11 Substring Selection Position-aware Substring Selection Example τ = 3, = 1, [p i τ 2, p i + τ+ 2 ] = [p i 1, p i + 2]

12 Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i, select substrings within [p i i, p i +i] [p i + (τ+1 i), p i + +(τ+1 i)].

13 Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i, select substrings within [p i i, p i +i] [p i + (τ+1 i), p i + +(τ+1 i)].

14 Substring Selection Multi-match-aware Substring Selection Example

15 Substring Selection Theoretical Results 1 The number of selected substrings by the multi-match-aware method is minimum. 2 For strings longer than 2 (τ + 1), our selection method is the only way to select minimum number of substrings.

16 Substring Selection al Results # of selected substrings 1e+009 1e+008 1e+007 1e+006 Length Shift Positon Multi-Match Threshold τ (a) Author Name (Avg Len = 15) # of selected substrings 1e+010 1e+009 1e+008 1e+007 1e+006 Length Shift Positon Multi-Match Threshold τ (b) Query Log (Avg Len = 45) # of selected substrings 1e+011 1e+010 1e+009 1e+008 1e+007 Length Shift Positon Multi-Match Threshold τ (c) Author+Title (Avg Len = 105) Figure: Numbers of selected substrings

17 Substring Selection al Results Selection Time (s) Length Shift Positon Multi-Match Threshold τ (a) Author Name (Avg Len = 15) Selection Time (s) Length Shift Positon Multi-Match Threshold τ (b) Query Log (Avg Len = 45) Selection Time (s) Length Shift Positon Multi-Match Threshold τ (c) Author+Title (Avg Len = 105) Figure: Elapsed time for generating substrings

18 Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

19 Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

20 Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

21 Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.

22 Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.

23 Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.

24 Verification al Results Elapsed Time (s) τ+1 τ+1 Extension SharePrefix Elapsed Time (s) τ+1 τ+1 Extension SharePrefix Elapsed Time (s) τ+1 τ+1 Extension SharePrefix Threshold τ Threshold τ Threshold τ (a) Author Name (Avg Len 15) (b) Query Log (Avg Len 45) (c) Author+Title (Avg Len 105) Figure: Elapsed time for verification

25 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

26 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

27 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

28 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

29 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

30 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

31 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

32 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

33 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

34 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

35 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

36 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

37 1 Sorting. Group strings by lengths using existing parallel algorithm. 2 Building Indexes. building indexes for each group. 3 Joins. perform similarity joins on each groups.

38 Setup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Table: Datasets Datasets cardinality average len max len min len GeoNames 400, GeoNames Query 100, Reads 750, Reads Query 100,

39 Setup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Numbers of strings String Lengths (a) GeoNames Numbers of strings String Lengths (b) Reads Figure: Length Distribution.

40 Evaluating Pruning Techniques Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) Basic Content Longer ParaJoin Edit Distance Threshold (a) GeoNames Elapsed Time (s) Basic Content Longer ParaJoin Edit Distance Threshold (b) Reads Figure: Evaluating pruning techniques for similarity joins(8 threads).

41 Evaluating Pruning Techniques Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) BasicSearch ParaSearch Edit Distance Threshold (a) GeoNames Elapsed Time (s) BasicSearch ParaSearch Edit Distance Threshold (b) Reads Figure: Evaluating pruning techniques for similarity search(8 threads).

42 Evaluating ism Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Threads (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Threads (b) Reads Figure: Evaluating running time of similarity join by varying number of threads.

43 Evaluating Speedup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Speedup tau=4 tau=3 tau=2 tau=1 Ideal Speedup tau=16 tau=12 tau=8 tau=4 Ideal Number of Threads (a) GeoNames Number of Threads (b) Reads Figure: Evaluating speedup of similarity join.

44 Evaluating ism Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Threads (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Threads (b) Reads Figure: Evaluating running time of similarity search by varying number of threads.

45 Evaluating Speedup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Speedup tau=4 tau=3 tau=2 tau=1 Ideal Speedup tau=16 tau=12 tau=8 tau=4 Ideal Number of Threads (a) GeoNames Number of Threads (b) Reads Figure: Evaluating speedup of similarity search.

46 Evaluating Scalability Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Strings(*1,000,000) (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Strings(*1,000,000) (b) Reads Figure: Evaluating the scalability of the similarity join algorithm(8 threads).

47 Evaluating Scalability Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Strings(*1,000,000) (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Strings(*1,000,000) (b) Reads Figure: Evaluating the scalability of the similarity search algorithm(8 threads).

48 Appendix Our Team About our team I We are from Tsinghua University, Beijing, China. Yu Jiang, Jiannan Wang, Guoliang Li, Jianhua Feng and.

49 Appendix Our Team About our team II

50 Appendix Our Team Thank You Q & A Pass-Join: A Partition based Method for Similarity Joins. Guoliang Li,, Jiannan Wang, Jianhua Feng. VLDB 2012.

An Efficient Partition Based Method for Exact Set Similarity Joins

An Efficient Partition Based Method for Exact Set Similarity Joins Dong Deng Guoliang Li He Wen Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China. {dd11,wenhe1}@mails.tsinghua.edu.cn;{liguoliang,fengjh}@tsinghua.edu.cn