Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints

Size: px
Start display at page:

Download "Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints"

Transcription

1 Efficient Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang,, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition on EDBT/ICDT 2013

2 Outline 1 Problem Definition Application 2 3 Evaluating Pruning Techniques Evaluating ism Evaluating Scalability

3 Problem Definition STRING SIMILARITY JOINS Problem Definition Application Given a set of strings S, the task is to find all pairs of τ-similar strings from S. A program must output all matches with both string identifiers and distance τ.(track II)

4 An Example Problem Definition Application Table: A string dataset ID Strings Length s 1 vankatesh 9 s 2 avataresha 10 s 3 kaushic chaduri 15 s 4 kaushik chakrab 15 s 5 kaushuk chadhui 15 s 6 caushik chakrabar 17 Consider the string dataset in Table 1. Suppose τ = 3. s 4, s 6 is a similar pair as ED(s 4, s 6 ) τ

5 Application Problem Definition Application Data cleaning Information Extraction Comparison of biological sequences...

6 Basic Idea Lemma Given a string r with τ + 1 segments and a string s, if s is similar to r within threshold τ, s must contain a segment of r. Example τ = 1, r = EDBT has two segments ED and BT. s = ICDT cannot similar to r as s contains none of the two segemtns.

7 Even Partition Scheme Definition In even partition scheme, each segment has almost the same length. ( s s τ+1 or τ+1 ) Example τ = 3, we partition s 1 = vankatesh into four segments va, nk, at, esh.

8 Substring Selection Basic Methods Enumeration: Enumerate all substrings for each of the segment. Length-based: For each segment, only select substrings with same length. Shift-based: For segment with start position p i, select substrings with start position in [p i τ, p i + τ]

9 Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i, select substrings with start position in [p i τ 2, p i + τ+ 2 ] where = s r.

10 Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i, select substrings with start position in [p i τ 2, p i + τ+ 2 ] where = s r.

11 Substring Selection Position-aware Substring Selection Example τ = 3, = 1, [p i τ 2, p i + τ+ 2 ] = [p i 1, p i + 2]

12 Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i, select substrings within [p i i, p i +i] [p i + (τ+1 i), p i + +(τ+1 i)].

13 Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i, select substrings within [p i i, p i +i] [p i + (τ+1 i), p i + +(τ+1 i)].

14 Substring Selection Multi-match-aware Substring Selection Example

15 Substring Selection Theoretical Results 1 The number of selected substrings by the multi-match-aware method is minimum. 2 For strings longer than 2 (τ + 1), our selection method is the only way to select minimum number of substrings.

16 Substring Selection al Results # of selected substrings 1e+009 1e+008 1e+007 1e+006 Length Shift Positon Multi-Match Threshold τ (a) Author Name (Avg Len = 15) # of selected substrings 1e+010 1e+009 1e+008 1e+007 1e+006 Length Shift Positon Multi-Match Threshold τ (b) Query Log (Avg Len = 45) # of selected substrings 1e+011 1e+010 1e+009 1e+008 1e+007 Length Shift Positon Multi-Match Threshold τ (c) Author+Title (Avg Len = 105) Figure: Numbers of selected substrings

17 Substring Selection al Results Selection Time (s) Length Shift Positon Multi-Match Threshold τ (a) Author Name (Avg Len = 15) Selection Time (s) Length Shift Positon Multi-Match Threshold τ (b) Query Log (Avg Len = 45) Selection Time (s) Length Shift Positon Multi-Match Threshold τ (c) Author+Title (Avg Len = 105) Figure: Elapsed time for generating substrings

18 Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

19 Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

20 Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.

21 Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.

22 Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.

23 Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.

24 Verification al Results Elapsed Time (s) τ+1 τ+1 Extension SharePrefix Elapsed Time (s) τ+1 τ+1 Extension SharePrefix Elapsed Time (s) τ+1 τ+1 Extension SharePrefix Threshold τ Threshold τ Threshold τ (a) Author Name (Avg Len 15) (b) Query Log (Avg Len 45) (c) Author+Title (Avg Len 105) Figure: Elapsed time for verification

25 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

26 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

27 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

28 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.

29 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

30 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

31 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

32 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.

33 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

34 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

35 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

36 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.

37 1 Sorting. Group strings by lengths using existing parallel algorithm. 2 Building Indexes. building indexes for each group. 3 Joins. perform similarity joins on each groups.

38 Setup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Table: Datasets Datasets cardinality average len max len min len GeoNames 400, GeoNames Query 100, Reads 750, Reads Query 100,

39 Setup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Numbers of strings String Lengths (a) GeoNames Numbers of strings String Lengths (b) Reads Figure: Length Distribution.

40 Evaluating Pruning Techniques Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) Basic Content Longer ParaJoin Edit Distance Threshold (a) GeoNames Elapsed Time (s) Basic Content Longer ParaJoin Edit Distance Threshold (b) Reads Figure: Evaluating pruning techniques for similarity joins(8 threads).

41 Evaluating Pruning Techniques Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) BasicSearch ParaSearch Edit Distance Threshold (a) GeoNames Elapsed Time (s) BasicSearch ParaSearch Edit Distance Threshold (b) Reads Figure: Evaluating pruning techniques for similarity search(8 threads).

42 Evaluating ism Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Threads (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Threads (b) Reads Figure: Evaluating running time of similarity join by varying number of threads.

43 Evaluating Speedup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Speedup tau=4 tau=3 tau=2 tau=1 Ideal Speedup tau=16 tau=12 tau=8 tau=4 Ideal Number of Threads (a) GeoNames Number of Threads (b) Reads Figure: Evaluating speedup of similarity join.

44 Evaluating ism Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Threads (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Threads (b) Reads Figure: Evaluating running time of similarity search by varying number of threads.

45 Evaluating Speedup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Speedup tau=4 tau=3 tau=2 tau=1 Ideal Speedup tau=16 tau=12 tau=8 tau=4 Ideal Number of Threads (a) GeoNames Number of Threads (b) Reads Figure: Evaluating speedup of similarity search.

46 Evaluating Scalability Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Strings(*1,000,000) (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Strings(*1,000,000) (b) Reads Figure: Evaluating the scalability of the similarity join algorithm(8 threads).

47 Evaluating Scalability Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Strings(*1,000,000) (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Strings(*1,000,000) (b) Reads Figure: Evaluating the scalability of the similarity search algorithm(8 threads).

48 Appendix Our Team About our team I We are from Tsinghua University, Beijing, China. Yu Jiang, Jiannan Wang, Guoliang Li, Jianhua Feng and.

49 Appendix Our Team About our team II

50 Appendix Our Team Thank You Q & A Pass-Join: A Partition based Method for Similarity Joins. Guoliang Li,, Jiannan Wang, Jianhua Feng. VLDB 2012.

An Efficient Partition Based Method for Exact Set Similarity Joins

An Efficient Partition Based Method for Exact Set Similarity Joins An Efficient Partition Based Method for Exact Set Similarity Joins Dong Deng Guoliang Li He Wen Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China. {dd11,wenhe1}@mails.tsinghua.edu.cn;{liguoliang,fengjh}@tsinghua.edu.cn

More information

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline

More information

META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion

META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion : An Efficient Matching-Based Method for Error-Tolerant Autocompletion Dong Deng Guoliang Li He Wen H. V. Jagadish Jianhua Feng Department of Computer Science, Tsinghua National Laboratory for Information

More information

An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms

An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms Jin Wang (UCLA) Chunbin Lin (Amazon AWS) Mingda Li (UCLA) Carlo Zaniolo (UCLA) OUTLINE Motivation Preliminaries Framework

More information

Efficient Approximate Entity Matching Using Jaro-Winkler Distance

Efficient Approximate Entity Matching Using Jaro-Winkler Distance Efficient Approximate Entity Matching Using Jaro-Winkler Distance Yaoshu Wang (B), Jianbin Qin, and Wei Wang School of Computer Science and Engineering, Univeristy of New South Wales, Sydney, Australia

More information

Multi-Approximate-Keyword Routing Query

Multi-Approximate-Keyword Routing Query Bin Yao 1, Mingwang Tang 2, Feifei Li 2 1 Department of Computer Science and Engineering Shanghai Jiao Tong University, P. R. China 2 School of Computing University of Utah, USA Outline 1 Introduction

More information

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties Outline Approximation: Theory and Algorithms Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 3 March 13, 2009 2 3 Nikolaus Augsten (DIS) Approximation: Theory and

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty Enumeration and symmetry of edit metric spaces by Jessie Katherine Campbell A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

More information

PartSS: An Efficient Partition-based Filtering for Edit Distance Constraints

PartSS: An Efficient Partition-based Filtering for Edit Distance Constraints : An Efficient Partition-based Filtering for Constraints Zhixu Li Laurianne Sitbon Xiaofang Zhou School of Information Technology & Electrical Engineering The University of Queensland, QLD 407 Australia

More information

A Transformation-based Framework for KNN Set Similarity Search

A Transformation-based Framework for KNN Set Similarity Search SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 A Transformation-based Framewor for KNN Set Similarity Search Yong Zhang Member, IEEE, Jiacheng Wu, Jin Wang, Chunxiao Xing Member, IEEE

More information

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Spatial Database. Ahmad Alhilal, Dimitris Tsaras Spatial Database Ahmad Alhilal, Dimitris Tsaras Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Query Aggregation Query RNN Query NN Queries with Validity Information

More information

Similarity Joins for Uncertain Strings

Similarity Joins for Uncertain Strings Similarity Joins for Uncertain Strings Manish Patil Louisiana State University USA mpatil@csc.lsu.edu Rahul Shah Louisiana State University USA rahul@csc.lsu.edu ABSTRACT Astringsimilarityjoinfindsallsimilarstringpairsbetween

More information

Statistical Substring Reduction in Linear Time

Statistical Substring Reduction in Linear Time Statistical Substring Reduction in Linear Time Xueqiang Lü Institute of Computational Linguistics Peking University, Beijing lxq@pku.edu.cn Le Zhang Institute of Computer Software & Theory Northeastern

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz

More information

ParaGraphE: A Library for Parallel Knowledge Graph Embedding

ParaGraphE: A Library for Parallel Knowledge Graph Embedding ParaGraphE: A Library for Parallel Knowledge Graph Embedding Xiao-Fan Niu, Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University,

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

Efficient Haplotype Inference with Boolean Satisfiability

Efficient Haplotype Inference with Boolean Satisfiability Efficient Haplotype Inference with Boolean Satisfiability Joao Marques-Silva 1 and Ines Lynce 2 1 School of Electronics and Computer Science University of Southampton 2 INESC-ID/IST Technical University

More information

Two notes on subshifts

Two notes on subshifts Two notes on subshifts Joseph S. Miller Special Session on Logic and Dynamical Systems Joint Mathematics Meetings, Washington, DC January 6, 2009 First Note Every Π 0 1 Medvedev degree contains a Π0 1

More information

TASM: Top-k Approximate Subtree Matching

TASM: Top-k Approximate Subtree Matching TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta,

More information

Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data

Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data VLDB Journal manuscript No. (will be inserted by the editor) Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data Jinchuan Chen Reynold Cheng Mohamed

More information

Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search

Two Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search Two Birds With One Stone: An Efficient Hierarchica Framework for Top-k and Threshod-based String Simiarity Search Jin Wang Guoiang Li Dong Deng Yong Zhang Jianhua Feng Department of Computer Science and

More information

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database

More information

Window-aware Load Shedding for Aggregation Queries over Data Streams

Window-aware Load Shedding for Aggregation Queries over Data Streams Window-aware Load Shedding for Aggregation Queries over Data Streams Nesime Tatbul Stan Zdonik Talk Outline Background Load shedding in Aurora Windowed aggregation queries Window-aware load shedding Experimental

More information

Mining Positive and Negative Fuzzy Association Rules

Mining Positive and Negative Fuzzy Association Rules Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing

More information

Finding Pareto Optimal Groups: Group based Skyline

Finding Pareto Optimal Groups: Group based Skyline Finding Pareto Optimal Groups: Group based Skyline Jinfei Liu Emory University jinfei.liu@emory.edu Jun Luo Lenovo; CAS jun.luo@siat.ac.cn Li Xiong Emory University lxiong@emory.edu Haoyu Zhang Emory University

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 4: Query Optimization Query Optimization Cost estimation Strategies for exploring plans Q min CS 347 Notes 4 2 Cost Estimation Based on

More information

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database

More information

Finding High-Order Correlations in High-Dimensional Biological Data

Finding High-Order Correlations in High-Dimensional Biological Data Finding High-Order Correlations in High-Dimensional Biological Data Xiang Zhang, Feng Pan, and Wei Wang Department of Computer Science University of North Carolina at Chapel Hill 1 Introduction Many real

More information

Database Systems CSE 514

Database Systems CSE 514 Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section

More information

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha

More information

Analysis and Design of Algorithms Dynamic Programming

Analysis and Design of Algorithms Dynamic Programming Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................

More information

PetaBricks: Variable Accuracy and Online Learning

PetaBricks: Variable Accuracy and Online Learning PetaBricks: Variable Accuracy and Online Learning Jason Ansel MIT - CSAIL May 4, 2011 Jason Ansel (MIT) PetaBricks May 4, 2011 1 / 40 Outline 1 Motivating Example 2 PetaBricks Language Overview 3 Variable

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

Improved Hamming Distance Search using Variable Length Hashing

Improved Hamming Distance Search using Variable Length Hashing Improved Hamming istance Search using Variable Length Hashing Eng-Jon Ong and Miroslaw Bober Centre for Vision, Speech and Signal Processing University of Surrey, Guildford, UK e.ong,m.bober@surrey.ac.uk

More information

Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons

Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons 2015 The University of Texas at Arlington. All Rights Reserved. Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons Abolfazl Asudeh, Gensheng Zhang, Naeemul Hassan, Chengkai Li, Gergely

More information

Decision Diagrams: Tutorial

Decision Diagrams: Tutorial Decision Diagrams: Tutorial John Hooker Carnegie Mellon University CP Summer School Cork, Ireland, June 2016 Decision Diagrams Used in computer science and AI for decades Logic circuit design Product configuration

More information

Searching Dimension Incomplete Databases

Searching Dimension Incomplete Databases IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO., JANUARY 3 Searching Dimension Incomplete Databases Wei Cheng, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang Abstract

More information

Machine Learning 3. week

Machine Learning 3. week Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which

More information

arxiv: v1 [stat.ml] 23 Oct 2016

arxiv: v1 [stat.ml] 23 Oct 2016 Formulas for counting the sizes of Markov Equivalence Classes Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs arxiv:1610.07921v1 [stat.ml] 23 Oct 2016 Yangbo He

More information

Constraint-based Subspace Clustering

Constraint-based Subspace Clustering Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30 Traditional Clustering Partitions

More information

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Masayuki Takeda 1,2, Shunsuke Inenaga 1,2, Hideo Bannai 3, Ayumi Shinohara 1,2, and Setsuo Arikawa 1 1 Department of Informatics,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Pavel Zezula, Giuseppe Amato,

Pavel Zezula, Giuseppe Amato, SIMILARITY SEARCH The Metric Space Approach Pavel Zezula Giuseppe Amato Vlastislav Dohnal Michal Batko Table of Content Part I: Metric searching in a nutshell Foundations of metric space searching Survey

More information

Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach

Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach Yinghui (Catherine) Yang Graduate School of Management, University of California, Davis AOB IV, One Shields

More information

Notes for Comp 497 (Comp 454) Week 12 4/19/05. Today we look at some variations on machines we have already seen. Chapter 21

Notes for Comp 497 (Comp 454) Week 12 4/19/05. Today we look at some variations on machines we have already seen. Chapter 21 Notes for Comp 497 (Comp 454) Week 12 4/19/05 Today we look at some variations on machines we have already seen. Errata (Chapter 21): None known Chapter 21 So far we have seen the equivalence of Post Machines

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

Optical Character Recognition of Jutakshars within Devanagari Script

Optical Character Recognition of Jutakshars within Devanagari Script Optical Character Recognition of Jutakshars within Devanagari Script Sheallika Singh Shreesh Ladha Supervised by : Dr. Harish Karnick, Dr. Amit Mitra UGP Presentation, 10 April 2016 OCR of Jutakshars within

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

The Inclusion Exclusion Principle

The Inclusion Exclusion Principle The Inclusion Exclusion Principle 1 / 29 Outline Basic Instances of The Inclusion Exclusion Principle The General Inclusion Exclusion Principle Counting Derangements Counting Functions Stirling Numbers

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING Yichi Zhang Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Department of Electronic Engineering

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information

ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS

ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS Richard A. Dutton and Weizhen Mao Department of Computer Science The College of William and Mary P.O. Box 795 Williamsburg, VA 2317-795, USA email: {radutt,wm}@cs.wm.edu

More information

Binary Decision Diagrams and Symbolic Model Checking

Binary Decision Diagrams and Symbolic Model Checking Binary Decision Diagrams and Symbolic Model Checking Randy Bryant Ed Clarke Ken McMillan Allen Emerson CMU CMU Cadence U Texas http://www.cs.cmu.edu/~bryant Binary Decision Diagrams Restricted Form of

More information

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity Cell-obe oofs and Nondeterministic Cell-obe Complexity Yitong Yin Department of Computer Science, Yale University yitong.yin@yale.edu. Abstract. We study the nondeterministic cell-probe complexity of static

More information

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche Department of Computer Science University of Warwick June 2006 Outline 1 Introduction Motivation The BSP

More information

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

More information

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule

More information

Behavioral Simulations in MapReduce

Behavioral Simulations in MapReduce Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

Reorganized and Compact DFA for Efficient Regular Expression Matching

Reorganized and Compact DFA for Efficient Regular Expression Matching Reorganized and Compact DFA for Efficient Regular Expression Matching Kai Wang 1,2, Yaxuan Qi 1,2, Yibo Xue 2,3, Jun Li 2,3 1 Department of Automation, Tsinghua University, Beijing, China 2 Research Institute

More information

Height, Size Performance of Complete and Nearly Complete Binary Search Trees in Dictionary Applications

Height, Size Performance of Complete and Nearly Complete Binary Search Trees in Dictionary Applications Height, Size Performance of Complete and Nearly Complete Binary Search Trees in Dictionary Applications AHMED TAREK Department of Math and Computer Science California University of Pennsylvania Eberly

More information

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009 ICM-Chemist How-To Guide Version 3.6-1g Last Updated 12/01/2009 ICM-Chemist HOW TO IMPORT, SKETCH AND EDIT CHEMICALS How to access the ICM Molecular Editor. 1. Click here 2. Start sketching How to sketch

More information

Chapter 2: Finite Automata

Chapter 2: Finite Automata Chapter 2: Finite Automata Peter Cappello Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 cappello@cs.ucsb.edu Please read the corresponding chapter before

More information

Open Access A New Optimization Algorithm for Checking and Sorting Project Schedules

Open Access A New Optimization Algorithm for Checking and Sorting Project Schedules Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 12-12 12 Open Access A New Optimization Algorithm for Checking and Sorting Project Schedules

More information

Associa'on Rule Mining

Associa'on Rule Mining Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014 1 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction

More information

CONSTRUCTION PROBLEMS

CONSTRUCTION PROBLEMS CONSTRUCTION PROBLEMS VIPUL NAIK Abstract. In this article, I describe the general problem of constructing configurations subject to certain conditions, and the use of techniques like greedy algorithms

More information

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden 1 Results Multiple Alignment with SP-score Star Alignment Tree Alignment (with given phylogeny) are NP-hard

More information

A metric approach for. comparing DNA sequences

A metric approach for. comparing DNA sequences A metric approach for comparing DNA sequences H. Mora-Mora Department of Computer and Information Technology University of Alicante, Alicante, Spain M. Lloret-Climent Department of Applied Mathematics.

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility Tao Jiang, Ming Li, Brendan Lucier September 26, 2005 Abstract In this paper we study the Kolmogorov Complexity of a

More information

Filtering with the Crowd

Filtering with the Crowd Filtering with the Crowd LRI Benoît Groz, Ezra Levin, Isaco Meiljson, Tova Milo Tel-Aviv University Univ. Paris Saclay 15 Mars 1 1 Outline 1 The CrowdScreen framework Algorithms for computing good/optimal

More information

OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS

OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS Tuğkan Batu a, Funda Ergun b, and Cenk Sahinalp b a LONDON SCHOOL OF ECONOMICS b SIMON FRASER UNIVERSITY LSE CDAM Seminar Oblivious String Embeddings

More information

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS FROM QUERIES TO TOP-K RESULTS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

arxiv: v1 [cs.db] 2 Sep 2014

arxiv: v1 [cs.db] 2 Sep 2014 An LSH Index for Computing Kendall s Tau over Top-k Lists Koninika Pal Saarland University Saarbrücken, Germany kpal@mmci.uni-saarland.de Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni-saarland.de

More information

Minimizing Clock Latency Range in Robust Clock Tree Synthesis

Minimizing Clock Latency Range in Robust Clock Tree Synthesis Minimizing Clock Latency Range in Robust Clock Tree Synthesis Wen-Hao Liu Yih-Lang Li Hui-Chi Chen You have to enlarge your font. Many pages are hard to view. I think the position of Page topic is too

More information

Disconnecting Networks via Node Deletions

Disconnecting Networks via Node Deletions 1 / 27 Disconnecting Networks via Node Deletions Exact Interdiction Models and Algorithms Siqian Shen 1 J. Cole Smith 2 R. Goli 2 1 IOE, University of Michigan 2 ISE, University of Florida 2012 INFORMS

More information

Numerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm

Numerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm Numerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm Hao Zhuang 1, 2, Wenjian Yu 1 *, Gang Hu 1, Zuochang Ye 3 1 Department

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Unsupervised Vocabulary Induction

Unsupervised Vocabulary Induction Infant Language Acquisition Unsupervised Vocabulary Induction MIT (Saffran et al., 1997) 8 month-old babies exposed to stream of syllables Stream composed of synthetic words (pabikumalikiwabufa) After

More information

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 2008-09 Salvatore Orlando 1 0-1 Knapsack problem N objects, j=1,..,n Each kind of item j has a value p j and a weight w j (single

More information

Su Liu 1, Alexandros Papakonstantinou 2, Hongjun Wang 1,DemingChen 2

Su Liu 1, Alexandros Papakonstantinou 2, Hongjun Wang 1,DemingChen 2 Real-Time Object Tracking System on FPGAs Su Liu 1, Alexandros Papakonstantinou 2, Hongjun Wang 1,DemingChen 2 1 School of Information Science and Engineering, Shandong University, Jinan, China 2 Electrical

More information

Mining Emerging Substrings

Mining Emerging Substrings Mining Emerging Substrings Sarah Chan Ben Kao C.L. Yip Michael Tang Department of Computer Science and Information Systems The University of Hong Kong {wyschan, kao, clyip, fmtang}@csis.hku.hk Abstract.

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

Counting Palindromic Binary Strings Without r-runs of Ones

Counting Palindromic Binary Strings Without r-runs of Ones 1 3 47 6 3 11 Journal of Integer Sequences, Vol. 16 (013), Article 13.8.7 Counting Palindromic Binary Strings Without r-runs of Ones M. A. Nyblom School of Mathematics and Geospatial Science RMIT University

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, Sofya Vorotnikova 1 1 University of Massachusetts Amherst 2 IBM Almaden Research Center Problem Description Take original string x of length

More information

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland Yufei Tao ITEE University of Queensland Today we will discuss problems closely related to the topic of multi-criteria optimization, where one aims to identify objects that strike a good balance often optimal

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

A Laplacian of Gaussian-based Approach for Spot Detection in Two-Dimensional Gel Electrophoresis Images

A Laplacian of Gaussian-based Approach for Spot Detection in Two-Dimensional Gel Electrophoresis Images A Laplacian of Gaussian-based Approach for Spot Detection in Two-Dimensional Gel Electrophoresis Images Feng He 1, Bangshu Xiong 1, Chengli Sun 1, Xiaobin Xia 1 1 Key Laboratory of Nondestructive Test

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

Photometry of Supernovae with Makali i

Photometry of Supernovae with Makali i Photometry of Supernovae with Makali i How to perform photometry specifically on supernovae targets using the free image processing software, Makali i This worksheet describes how to use photometry to

More information

Computability Theory

Computability Theory CS:4330 Theory of Computation Spring 2018 Computability Theory Decidable Problems of CFLs and beyond Haniel Barbosa Readings for this lecture Chapter 4 of [Sipser 1996], 3rd edition. Section 4.1. Decidable

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information