Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints

Similar documents
An Efficient Partition Based Method for Exact Set Similarity Joins

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion

An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms

Efficient Approximate Entity Matching Using Jaro-Winkler Distance

Multi-Approximate-Keyword Routing Query

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty

PartSS: An Efficient Partition-based Filtering for Edit Distance Constraints

A Transformation-based Framework for KNN Set Similarity Search

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Similarity Joins for Uncertain Strings

Statistical Substring Reduction in Linear Time

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

ParaGraphE: A Library for Parallel Knowledge Graph Embedding

CSE 5243 INTRO. TO DATA MINING

Efficient Haplotype Inference with Boolean Satisfiability

Two notes on subshifts

TASM: Top-k Approximate Subtree Matching

Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data

Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Window-aware Load Shedding for Aggregation Queries over Data Streams

Mining Positive and Negative Fuzzy Association Rules

Finding Pareto Optimal Groups: Group based Skyline

CS 347 Parallel and Distributed Data Processing

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Finding High-Order Correlations in High-Dimensional Biological Data

Database Systems CSE 514

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Analysis and Design of Algorithms Dynamic Programming

PetaBricks: Variable Accuracy and Online Learning

On the Monotonicity of the String Correction Factor for Words with Mismatches

Improved Hamming Distance Search using Variable Length Hashing

Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons

Decision Diagrams: Tutorial

Searching Dimension Incomplete Databases

Machine Learning 3. week

arxiv: v1 [stat.ml] 23 Oct 2016

Constraint-based Subspace Clustering

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

ECS289: Scalable Machine Learning

Pavel Zezula, Giuseppe Amato,

Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach

Notes for Comp 497 (Comp 454) Week 12 4/19/05. Today we look at some variations on machines we have already seen. Chapter 21

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Optical Character Recognition of Jutakshars within Devanagari Script

D B M G Data Base and Data Mining Group of Politecnico di Torino

The Inclusion Exclusion Principle

Association Rules. Fundamentals

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Bloom Filters, Minhashes, and Other Random Stuff

ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS

Binary Decision Diagrams and Symbolic Model Checking

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Behavioral Simulations in MapReduce

CSE 5243 INTRO. TO DATA MINING

Reorganized and Compact DFA for Efficient Regular Expression Matching

Height, Size Performance of Complete and Nearly Complete Binary Search Trees in Dictionary Applications

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

Chapter 2: Finite Automata

Open Access A New Optimization Algorithm for Checking and Sorting Project Schedules

Associa'on Rule Mining

CONSTRUCTION PROBLEMS

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden

A metric approach for. comparing DNA sequences

Compressed Index for Dynamic Text

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Filtering with the Crowd

OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Implementing Approximate Regularities

arxiv: v1 [cs.db] 2 Sep 2014

Minimizing Clock Latency Range in Robust Clock Tree Synthesis

Disconnecting Networks via Node Deletions

Numerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Unsupervised Vocabulary Induction

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA

Su Liu 1, Alexandros Papakonstantinou 2, Hongjun Wang 1,DemingChen 2

Mining Emerging Substrings

Bio nformatics. Lecture 3. Saad Mneimneh

Counting Palindromic Binary Strings Without r-runs of Ones

Trace Reconstruction Revisited

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Binary Convolutional Neural Network on RRAM

A Laplacian of Gaussian-based Approach for Spot Detection in Two-Dimensional Gel Electrophoresis Images

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Photometry of Supernovae with Makali i

Computability Theory

Induction of Decision Trees

Transcription:

Efficient Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang,, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition on EDBT/ICDT 2013

Outline 1 Problem Definition Application 2 3 Evaluating Pruning Techniques Evaluating ism Evaluating Scalability

Problem Definition STRING SIMILARITY JOINS Problem Definition Application Given a set of strings S, the task is to find all pairs of τ-similar strings from S. A program must output all matches with both string identifiers and distance τ.(track II)

An Example Problem Definition Application Table: A string dataset ID Strings Length s 1 vankatesh 9 s 2 avataresha 10 s 3 kaushic chaduri 15 s 4 kaushik chakrab 15 s 5 kaushuk chadhui 15 s 6 caushik chakrabar 17 Consider the string dataset in Table 1. Suppose τ = 3. s 4, s 6 is a similar pair as ED(s 4, s 6 ) τ

Application Problem Definition Application Data cleaning Information Extraction Comparison of biological sequences...

Basic Idea Lemma Given a string r with τ + 1 segments and a string s, if s is similar to r within threshold τ, s must contain a segment of r. Example τ = 1, r = EDBT has two segments ED and BT. s = ICDT cannot similar to r as s contains none of the two segemtns.

Even Partition Scheme Definition In even partition scheme, each segment has almost the same length. ( s s τ+1 or τ+1 ) Example τ = 3, we partition s 1 = vankatesh into four segments va, nk, at, esh.

Substring Selection Basic Methods Enumeration: Enumerate all substrings for each of the segment. Length-based: For each segment, only select substrings with same length. Shift-based: For segment with start position p i, select substrings with start position in [p i τ, p i + τ]

Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i, select substrings with start position in [p i τ 2, p i + τ+ 2 ] where = s r.

Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i, select substrings with start position in [p i τ 2, p i + τ+ 2 ] where = s r.

Substring Selection Position-aware Substring Selection Example τ = 3, = 1, [p i τ 2, p i + τ+ 2 ] = [p i 1, p i + 2]

Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i, select substrings within [p i i, p i +i] [p i + (τ+1 i), p i + +(τ+1 i)].

Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i, select substrings within [p i i, p i +i] [p i + (τ+1 i), p i + +(τ+1 i)].

Substring Selection Multi-match-aware Substring Selection Example

Substring Selection Theoretical Results 1 The number of selected substrings by the multi-match-aware method is minimum. 2 For strings longer than 2 (τ + 1), our selection method is the only way to select minimum number of substrings.

Substring Selection al Results # of selected substrings 1e+009 1e+008 1e+007 1e+006 Length Shift Positon Multi-Match 1 2 3 4 Threshold τ (a) Author Name (Avg Len = 15) # of selected substrings 1e+010 1e+009 1e+008 1e+007 1e+006 Length Shift Positon Multi-Match 4 5 6 7 8 Threshold τ (b) Query Log (Avg Len = 45) # of selected substrings 1e+011 1e+010 1e+009 1e+008 1e+007 Length Shift Positon Multi-Match 5 6 7 8 9 10 Threshold τ (c) Author+Title (Avg Len = 105) Figure: Numbers of selected substrings

Substring Selection al Results Selection Time (s) 100 10 1 0.1 Length Shift Positon Multi-Match 1 2 3 4 Threshold τ (a) Author Name (Avg Len = 15) Selection Time (s) 1000 100 10 1 Length Shift Positon Multi-Match 4 5 6 7 8 Threshold τ (b) Query Log (Avg Len = 45) Selection Time (s) 10000 1000 100 10 1 Length Shift Positon Multi-Match 5 6 7 8 9 10 Threshold τ (c) Author+Title (Avg Len = 105) Figure: Elapsed time for generating substrings

Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.

Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.

Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.

Verification al Results Elapsed Time (s) 100000 10000 1000 100 10 2τ+1 τ+1 Extension SharePrefix Elapsed Time (s) 10000 1000 100 2τ+1 τ+1 Extension SharePrefix Elapsed Time (s) 10000 1000 100 2τ+1 τ+1 Extension SharePrefix 1 1 2 3 4 10 4 5 6 7 8 10 5 6 7 8 9 10 Threshold τ Threshold τ Threshold τ (a) Author Name (Avg Len 15) (b) Query Log (Avg Len 45) (c) Author+Title (Avg Len 105) Figure: Elapsed time for verification

Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = 1 + 3 = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = 1 + 3 = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = 1 + 3 = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = 1 + 3 = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

1 Sorting. Group strings by lengths using existing parallel algorithm. 2 Building Indexes. building indexes for each group. 3 Joins. perform similarity joins on each groups.

Setup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Table: Datasets Datasets cardinality average len max len min len GeoNames 400,000 11.106 1 60 GeoNames Query 100,000 10.7 2 43 Reads 750,000 101.388 86 106 Reads Query 100,000 101.2 88 116

Setup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Numbers of strings 50000 40000 30000 20000 10000 0 0 10 20 30 40 50 60 String Lengths (a) GeoNames Numbers of strings 400000 300000 200000 100000 0 85 90 95 100 105 String Lengths (b) Reads Figure: Length Distribution.

Evaluating Pruning Techniques Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) 50 40 30 20 10 Basic Content Longer ParaJoin 0 1 2 3 4 Edit Distance Threshold (a) GeoNames Elapsed Time (s) 800 600 400 200 Basic Content Longer ParaJoin 0 4 8 12 16 Edit Distance Threshold (b) Reads Figure: Evaluating pruning techniques for similarity joins(8 threads).

Evaluating Pruning Techniques Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) 40 30 20 10 BasicSearch ParaSearch 0 1 2 3 4 Edit Distance Threshold (a) GeoNames Elapsed Time (s) 200 150 100 50 BasicSearch ParaSearch 0 4 8 12 16 Edit Distance Threshold (b) Reads Figure: Evaluating pruning techniques for similarity search(8 threads).

Evaluating ism Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) 100 80 60 40 20 tau=4 tau=3 tau=2 tau=1 0 2 4 6 8 Number of Threads (a) GeoNames Elapsed Time (s) 750 600 450 300 150 tau=16 tau=12 tau=8 tau=4 0 2 4 6 8 Number of Threads (b) Reads Figure: Evaluating running time of similarity join by varying number of threads.

Evaluating Speedup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Speedup 8 6 4 tau=4 tau=3 tau=2 tau=1 Ideal Speedup 8 6 4 tau=16 tau=12 tau=8 tau=4 Ideal 2 2 2 4 6 8 Number of Threads (a) GeoNames 2 4 6 8 Number of Threads (b) Reads Figure: Evaluating speedup of similarity join.

Evaluating ism Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) 150 120 90 60 30 tau=4 tau=3 tau=2 tau=1 0 2 4 6 8 Number of Threads (a) GeoNames Elapsed Time (s) 200 150 100 50 tau=16 tau=12 tau=8 tau=4 0 2 4 6 8 Number of Threads (b) Reads Figure: Evaluating running time of similarity search by varying number of threads.

Evaluating Speedup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Speedup 8 6 4 tau=4 tau=3 tau=2 tau=1 Ideal Speedup 8 6 4 tau=16 tau=12 tau=8 tau=4 Ideal 2 2 2 4 6 8 Number of Threads (a) GeoNames 2 4 6 8 Number of Threads (b) Reads Figure: Evaluating speedup of similarity search.

Evaluating Scalability Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) 120 80 40 tau=4 tau=3 tau=2 tau=1 0 0.25 0.5 0.75 1 Number of Strings(*1,000,000) (a) GeoNames Elapsed Time (s) 320 240 160 80 tau=16 tau=12 tau=8 tau=4 0 0.25 0.5 0.75 1 Number of Strings(*1,000,000) (b) Reads Figure: Evaluating the scalability of the similarity join algorithm(8 threads).

Evaluating Scalability Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) 90 60 30 tau=4 tau=3 tau=2 tau=1 0 0.25 0.5 0.75 1 Number of Strings(*1,000,000) (a) GeoNames Elapsed Time (s) 40 30 20 10 tau=16 tau=12 tau=8 tau=4 0 0.25 0.5 0.75 1 Number of Strings(*1,000,000) (b) Reads Figure: Evaluating the scalability of the similarity search algorithm(8 threads).

Appendix Our Team About our team I We are from Tsinghua University, Beijing, China. Yu Jiang, Jiannan Wang, Guoliang Li, Jianhua Feng and.

Appendix Our Team About our team II

Appendix Our Team Thank You Q & A http://dbgroup.cs.tsinghua.edu.cn/dd Pass-Join: A Partition based Method for Similarity Joins. Guoliang Li,, Jiannan Wang, Jianhua Feng. VLDB 2012.