Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints
|
|
- Sophie Horn
- 5 years ago
- Views:
Transcription
1 Efficient Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang,, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition on EDBT/ICDT 2013
2 Outline 1 Problem Definition Application 2 3 Evaluating Pruning Techniques Evaluating ism Evaluating Scalability
3 Problem Definition STRING SIMILARITY JOINS Problem Definition Application Given a set of strings S, the task is to find all pairs of τ-similar strings from S. A program must output all matches with both string identifiers and distance τ.(track II)
4 An Example Problem Definition Application Table: A string dataset ID Strings Length s 1 vankatesh 9 s 2 avataresha 10 s 3 kaushic chaduri 15 s 4 kaushik chakrab 15 s 5 kaushuk chadhui 15 s 6 caushik chakrabar 17 Consider the string dataset in Table 1. Suppose τ = 3. s 4, s 6 is a similar pair as ED(s 4, s 6 ) τ
5 Application Problem Definition Application Data cleaning Information Extraction Comparison of biological sequences...
6 Basic Idea Lemma Given a string r with τ + 1 segments and a string s, if s is similar to r within threshold τ, s must contain a segment of r. Example τ = 1, r = EDBT has two segments ED and BT. s = ICDT cannot similar to r as s contains none of the two segemtns.
7 Even Partition Scheme Definition In even partition scheme, each segment has almost the same length. ( s s τ+1 or τ+1 ) Example τ = 3, we partition s 1 = vankatesh into four segments va, nk, at, esh.
8 Substring Selection Basic Methods Enumeration: Enumerate all substrings for each of the segment. Length-based: For each segment, only select substrings with same length. Shift-based: For segment with start position p i, select substrings with start position in [p i τ, p i + τ]
9 Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i, select substrings with start position in [p i τ 2, p i + τ+ 2 ] where = s r.
10 Substring Selection Position-aware Substring Selection Observation Theorem (Position-aware Substring Selection) For segment with start position p i, select substrings with start position in [p i τ 2, p i + τ+ 2 ] where = s r.
11 Substring Selection Position-aware Substring Selection Example τ = 3, = 1, [p i τ 2, p i + τ+ 2 ] = [p i 1, p i + 2]
12 Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i, select substrings within [p i i, p i +i] [p i + (τ+1 i), p i + +(τ+1 i)].
13 Substring Selection Multi-match-aware Substring Selection Observation There must be another matching between r r and s r. Theorem (Multi-match-aware Substring Selection) For the i-th segment with start position p i, select substrings within [p i i, p i +i] [p i + (τ+1 i), p i + +(τ+1 i)].
14 Substring Selection Multi-match-aware Substring Selection Example
15 Substring Selection Theoretical Results 1 The number of selected substrings by the multi-match-aware method is minimum. 2 For strings longer than 2 (τ + 1), our selection method is the only way to select minimum number of substrings.
16 Substring Selection al Results # of selected substrings 1e+009 1e+008 1e+007 1e+006 Length Shift Positon Multi-Match Threshold τ (a) Author Name (Avg Len = 15) # of selected substrings 1e+010 1e+009 1e+008 1e+007 1e+006 Length Shift Positon Multi-Match Threshold τ (b) Query Log (Avg Len = 45) # of selected substrings 1e+011 1e+010 1e+009 1e+008 1e+007 Length Shift Positon Multi-Match Threshold τ (c) Author+Title (Avg Len = 105) Figure: Numbers of selected substrings
17 Substring Selection al Results Selection Time (s) Length Shift Positon Multi-Match Threshold τ (a) Author Name (Avg Len = 15) Selection Time (s) Length Shift Positon Multi-Match Threshold τ (b) Query Log (Avg Len = 45) Selection Time (s) Length Shift Positon Multi-Match Threshold τ (c) Author+Title (Avg Len = 105) Figure: Elapsed time for generating substrings
18 Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.
19 Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.
20 Verification Length-aware Verification Inspired by the position-aware substring selection. Save at least half computation than traditional dynamic method. Save even more using improved early termination.
21 Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.
22 Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.
23 Verification Extension-based Verification Inspired by the multi-match-aware substring selection. Using tighter thresholds to verify the candidate pairs. Verify if ED(r r, s r ) τ + 1 i and ED(r l, s l ) i 1.
24 Verification al Results Elapsed Time (s) τ+1 τ+1 Extension SharePrefix Elapsed Time (s) τ+1 τ+1 Extension SharePrefix Elapsed Time (s) τ+1 τ+1 Extension SharePrefix Threshold τ Threshold τ Threshold τ (a) Author Name (Avg Len 15) (b) Query Log (Avg Len 45) (c) Author+Title (Avg Len 105) Figure: Elapsed time for verification
25 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.
26 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.
27 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.
28 Effective Indexing Strategy Partition longer strings into segments. Select substrings from shorter strings. Longer segments decrease the possibility of matching. Thus decrease the number of candidates.
29 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.
30 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.
31 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.
32 Content Filter Observation Let H r denote the character frequency vector of r. r = abyyyy, s = axxyyyxy. H r = {{a, 1}, {b, 1}, {y, 4}}, H s = {{a, 1}, {x, 3}, {y, 4}} Let H = H r H s. H = H r H s = = 4. A deletion or insertion changes H by 1 at most. An substitution changes H by 2 at most.
33 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.
34 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.
35 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.
36 Content Filter Observation At most τ edit operations, H 2τ. At most τ r s substitutions, H 2τ r s. Group symbols to improve the content-filter running time. Integrate the content filter with the extension-based verification.
37 1 Sorting. Group strings by lengths using existing parallel algorithm. 2 Building Indexes. building indexes for each group. 3 Joins. perform similarity joins on each groups.
38 Setup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Table: Datasets Datasets cardinality average len max len min len GeoNames 400, GeoNames Query 100, Reads 750, Reads Query 100,
39 Setup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Numbers of strings String Lengths (a) GeoNames Numbers of strings String Lengths (b) Reads Figure: Length Distribution.
40 Evaluating Pruning Techniques Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) Basic Content Longer ParaJoin Edit Distance Threshold (a) GeoNames Elapsed Time (s) Basic Content Longer ParaJoin Edit Distance Threshold (b) Reads Figure: Evaluating pruning techniques for similarity joins(8 threads).
41 Evaluating Pruning Techniques Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) BasicSearch ParaSearch Edit Distance Threshold (a) GeoNames Elapsed Time (s) BasicSearch ParaSearch Edit Distance Threshold (b) Reads Figure: Evaluating pruning techniques for similarity search(8 threads).
42 Evaluating ism Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Threads (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Threads (b) Reads Figure: Evaluating running time of similarity join by varying number of threads.
43 Evaluating Speedup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Speedup tau=4 tau=3 tau=2 tau=1 Ideal Speedup tau=16 tau=12 tau=8 tau=4 Ideal Number of Threads (a) GeoNames Number of Threads (b) Reads Figure: Evaluating speedup of similarity join.
44 Evaluating ism Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Threads (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Threads (b) Reads Figure: Evaluating running time of similarity search by varying number of threads.
45 Evaluating Speedup Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Speedup tau=4 tau=3 tau=2 tau=1 Ideal Speedup tau=16 tau=12 tau=8 tau=4 Ideal Number of Threads (a) GeoNames Number of Threads (b) Reads Figure: Evaluating speedup of similarity search.
46 Evaluating Scalability Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Strings(*1,000,000) (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Strings(*1,000,000) (b) Reads Figure: Evaluating the scalability of the similarity join algorithm(8 threads).
47 Evaluating Scalability Evaluating Pruning Techniques Evaluating ism Evaluating Scalability Elapsed Time (s) tau=4 tau=3 tau=2 tau= Number of Strings(*1,000,000) (a) GeoNames Elapsed Time (s) tau=16 tau=12 tau=8 tau= Number of Strings(*1,000,000) (b) Reads Figure: Evaluating the scalability of the similarity search algorithm(8 threads).
48 Appendix Our Team About our team I We are from Tsinghua University, Beijing, China. Yu Jiang, Jiannan Wang, Guoliang Li, Jianhua Feng and.
49 Appendix Our Team About our team II
50 Appendix Our Team Thank You Q & A Pass-Join: A Partition based Method for Similarity Joins. Guoliang Li,, Jiannan Wang, Jianhua Feng. VLDB 2012.
An Efficient Partition Based Method for Exact Set Similarity Joins
An Efficient Partition Based Method for Exact Set Similarity Joins Dong Deng Guoliang Li He Wen Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China. {dd11,wenhe1}@mails.tsinghua.edu.cn;{liguoliang,fengjh}@tsinghua.edu.cn
More informationMACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance
MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline
More informationMETA: An Efficient Matching-Based Method for Error-Tolerant Autocompletion
: An Efficient Matching-Based Method for Error-Tolerant Autocompletion Dong Deng Guoliang Li He Wen H. V. Jagadish Jianhua Feng Department of Computer Science, Tsinghua National Laboratory for Information
More informationAn Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms
An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms Jin Wang (UCLA) Chunbin Lin (Amazon AWS) Mingda Li (UCLA) Carlo Zaniolo (UCLA) OUTLINE Motivation Preliminaries Framework
More informationEfficient Approximate Entity Matching Using Jaro-Winkler Distance
Efficient Approximate Entity Matching Using Jaro-Winkler Distance Yaoshu Wang (B), Jianbin Qin, and Wei Wang School of Computer Science and Engineering, Univeristy of New South Wales, Sydney, Australia
More informationMulti-Approximate-Keyword Routing Query
Bin Yao 1, Mingwang Tang 2, Feifei Li 2 1 Department of Computer Science and Engineering Shanghai Jiao Tong University, P. R. China 2 School of Computing University of Utah, USA Outline 1 Introduction
More informationOutline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties
Outline Approximation: Theory and Algorithms Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 3 March 13, 2009 2 3 Nikolaus Augsten (DIS) Approximation: Theory and
More informationChapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining
Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of
More informationEnumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty
Enumeration and symmetry of edit metric spaces by Jessie Katherine Campbell A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY
More informationPartSS: An Efficient Partition-based Filtering for Edit Distance Constraints
: An Efficient Partition-based Filtering for Constraints Zhixu Li Laurianne Sitbon Xiaofang Zhou School of Information Technology & Electrical Engineering The University of Queensland, QLD 407 Australia
More informationA Transformation-based Framework for KNN Set Similarity Search
SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 A Transformation-based Framewor for KNN Set Similarity Search Yong Zhang Member, IEEE, Jiacheng Wu, Jin Wang, Chunxiao Xing Member, IEEE
More informationSpatial Database. Ahmad Alhilal, Dimitris Tsaras
Spatial Database Ahmad Alhilal, Dimitris Tsaras Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Query Aggregation Query RNN Query NN Queries with Validity Information
More informationSimilarity Joins for Uncertain Strings
Similarity Joins for Uncertain Strings Manish Patil Louisiana State University USA mpatil@csc.lsu.edu Rahul Shah Louisiana State University USA rahul@csc.lsu.edu ABSTRACT Astringsimilarityjoinfindsallsimilarstringpairsbetween
More informationStatistical Substring Reduction in Linear Time
Statistical Substring Reduction in Linear Time Xueqiang Lü Institute of Computational Linguistics Peking University, Beijing lxq@pku.edu.cn Le Zhang Institute of Computer Software & Theory Northeastern
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz
More informationParaGraphE: A Library for Parallel Knowledge Graph Embedding
ParaGraphE: A Library for Parallel Knowledge Graph Embedding Xiao-Fan Niu, Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University,
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan
More informationEfficient Haplotype Inference with Boolean Satisfiability
Efficient Haplotype Inference with Boolean Satisfiability Joao Marques-Silva 1 and Ines Lynce 2 1 School of Electronics and Computer Science University of Southampton 2 INESC-ID/IST Technical University
More informationTwo notes on subshifts
Two notes on subshifts Joseph S. Miller Special Session on Logic and Dynamical Systems Joint Mathematics Meetings, Washington, DC January 6, 2009 First Note Every Π 0 1 Medvedev degree contains a Π0 1
More informationTASM: Top-k Approximate Subtree Matching
TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta,
More informationScalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data
VLDB Journal manuscript No. (will be inserted by the editor) Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data Jinchuan Chen Reynold Cheng Mohamed
More informationTwo Birds With One Stone: An Efficient Hierarchical Framework for Top-k and Threshold-based String Similarity Search
Two Birds With One Stone: An Efficient Hierarchica Framework for Top-k and Threshod-based String Simiarity Search Jin Wang Guoiang Li Dong Deng Yong Zhang Jianhua Feng Department of Computer Science and
More informationImproving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques
Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database
More informationWindow-aware Load Shedding for Aggregation Queries over Data Streams
Window-aware Load Shedding for Aggregation Queries over Data Streams Nesime Tatbul Stan Zdonik Talk Outline Background Load shedding in Aurora Windowed aggregation queries Window-aware load shedding Experimental
More informationMining Positive and Negative Fuzzy Association Rules
Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing
More informationFinding Pareto Optimal Groups: Group based Skyline
Finding Pareto Optimal Groups: Group based Skyline Jinfei Liu Emory University jinfei.liu@emory.edu Jun Luo Lenovo; CAS jun.luo@siat.ac.cn Li Xiong Emory University lxiong@emory.edu Haoyu Zhang Emory University
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 4: Query Optimization Query Optimization Cost estimation Strategies for exploring plans Q min CS 347 Notes 4 2 Cost Estimation Based on
More informationImproving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques
Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database
More informationFinding High-Order Correlations in High-Dimensional Biological Data
Finding High-Order Correlations in High-Dimensional Biological Data Xiang Zhang, Feng Pan, and Wei Wang Department of Computer Science University of North Carolina at Chapel Hill 1 Introduction Many real
More informationDatabase Systems CSE 514
Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section
More informationDictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line
Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha
More informationAnalysis and Design of Algorithms Dynamic Programming
Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................
More informationPetaBricks: Variable Accuracy and Online Learning
PetaBricks: Variable Accuracy and Online Learning Jason Ansel MIT - CSAIL May 4, 2011 Jason Ansel (MIT) PetaBricks May 4, 2011 1 / 40 Outline 1 Motivating Example 2 PetaBricks Language Overview 3 Variable
More informationOn the Monotonicity of the String Correction Factor for Words with Mismatches
On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.
More informationImproved Hamming Distance Search using Variable Length Hashing
Improved Hamming istance Search using Variable Length Hashing Eng-Jon Ong and Miroslaw Bober Centre for Vision, Speech and Signal Processing University of Surrey, Guildford, UK e.ong,m.bober@surrey.ac.uk
More informationCrowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons
2015 The University of Texas at Arlington. All Rights Reserved. Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons Abolfazl Asudeh, Gensheng Zhang, Naeemul Hassan, Chengkai Li, Gergely
More informationDecision Diagrams: Tutorial
Decision Diagrams: Tutorial John Hooker Carnegie Mellon University CP Summer School Cork, Ireland, June 2016 Decision Diagrams Used in computer science and AI for decades Logic circuit design Product configuration
More informationSearching Dimension Incomplete Databases
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO., JANUARY 3 Searching Dimension Incomplete Databases Wei Cheng, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang Abstract
More informationMachine Learning 3. week
Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which
More informationarxiv: v1 [stat.ml] 23 Oct 2016
Formulas for counting the sizes of Markov Equivalence Classes Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs arxiv:1610.07921v1 [stat.ml] 23 Oct 2016 Yangbo He
More informationConstraint-based Subspace Clustering
Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30 Traditional Clustering Partitions
More informationDiscovering Most Classificatory Patterns for Very Expressive Pattern Classes
Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Masayuki Takeda 1,2, Shunsuke Inenaga 1,2, Hideo Bannai 3, Ayumi Shinohara 1,2, and Setsuo Arikawa 1 1 Department of Informatics,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationPavel Zezula, Giuseppe Amato,
SIMILARITY SEARCH The Metric Space Approach Pavel Zezula Giuseppe Amato Vlastislav Dohnal Michal Batko Table of Content Part I: Metric searching in a nutshell Foundations of metric space searching Survey
More informationOnline Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach
Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach Yinghui (Catherine) Yang Graduate School of Management, University of California, Davis AOB IV, One Shields
More informationNotes for Comp 497 (Comp 454) Week 12 4/19/05. Today we look at some variations on machines we have already seen. Chapter 21
Notes for Comp 497 (Comp 454) Week 12 4/19/05 Today we look at some variations on machines we have already seen. Errata (Chapter 21): None known Chapter 21 So far we have seen the equivalence of Post Machines
More informationProofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.
Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G
More informationOptical Character Recognition of Jutakshars within Devanagari Script
Optical Character Recognition of Jutakshars within Devanagari Script Sheallika Singh Shreesh Ladha Supervised by : Dr. Harish Karnick, Dr. Amit Mitra UGP Presentation, 10 April 2016 OCR of Jutakshars within
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationThe Inclusion Exclusion Principle
The Inclusion Exclusion Principle 1 / 29 Outline Basic Instances of The Inclusion Exclusion Principle The General Inclusion Exclusion Principle Counting Derangements Counting Functions Stirling Numbers
More informationAssociation Rules. Fundamentals
Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule
More informationLEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING
LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING Yichi Zhang Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Department of Electronic Engineering
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.
Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example
Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationMultiple-Site Distributed Spatial Query Optimization using Spatial Semijoins
11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,
More informationBloom Filters, Minhashes, and Other Random Stuff
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?
More informationONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS
ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS Richard A. Dutton and Weizhen Mao Department of Computer Science The College of William and Mary P.O. Box 795 Williamsburg, VA 2317-795, USA email: {radutt,wm}@cs.wm.edu
More informationBinary Decision Diagrams and Symbolic Model Checking
Binary Decision Diagrams and Symbolic Model Checking Randy Bryant Ed Clarke Ken McMillan Allen Emerson CMU CMU Cadence U Texas http://www.cs.cmu.edu/~bryant Binary Decision Diagrams Restricted Form of
More informationCell-Probe Proofs and Nondeterministic Cell-Probe Complexity
Cell-obe oofs and Nondeterministic Cell-obe Complexity Yitong Yin Department of Computer Science, Yale University yitong.yin@yale.edu. Abstract. We study the nondeterministic cell-probe complexity of static
More informationEfficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism
Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche Department of Computer Science University of Warwick June 2006 Outline 1 Introduction Motivation The BSP
More informationPage 1. Evolutionary Trees. Why build evolutionary tree? Outline
Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny
More informationAssociation Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University
Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule
More informationBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.
More informationReorganized and Compact DFA for Efficient Regular Expression Matching
Reorganized and Compact DFA for Efficient Regular Expression Matching Kai Wang 1,2, Yaxuan Qi 1,2, Yibo Xue 2,3, Jun Li 2,3 1 Department of Automation, Tsinghua University, Beijing, China 2 Research Institute
More informationHeight, Size Performance of Complete and Nearly Complete Binary Search Trees in Dictionary Applications
Height, Size Performance of Complete and Nearly Complete Binary Search Trees in Dictionary Applications AHMED TAREK Department of Math and Computer Science California University of Pennsylvania Eberly
More informationICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009
ICM-Chemist How-To Guide Version 3.6-1g Last Updated 12/01/2009 ICM-Chemist HOW TO IMPORT, SKETCH AND EDIT CHEMICALS How to access the ICM Molecular Editor. 1. Click here 2. Start sketching How to sketch
More informationChapter 2: Finite Automata
Chapter 2: Finite Automata Peter Cappello Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 cappello@cs.ucsb.edu Please read the corresponding chapter before
More informationOpen Access A New Optimization Algorithm for Checking and Sorting Project Schedules
Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 12-12 12 Open Access A New Optimization Algorithm for Checking and Sorting Project Schedules
More informationAssocia'on Rule Mining
Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7, 2014 1 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction
More informationCONSTRUCTION PROBLEMS
CONSTRUCTION PROBLEMS VIPUL NAIK Abstract. In this article, I describe the general problem of constructing configurations subject to certain conditions, and the use of techniques like greedy algorithms
More informationDid you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden
Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden 1 Results Multiple Alignment with SP-score Star Alignment Tree Alignment (with given phylogeny) are NP-hard
More informationA metric approach for. comparing DNA sequences
A metric approach for comparing DNA sequences H. Mora-Mora Department of Computer and Information Technology University of Alicante, Alicante, Spain M. Lloret-Climent Department of Applied Mathematics.
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationAverage Case Analysis of QuickSort and Insertion Tree Height using Incompressibility
Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility Tao Jiang, Ming Li, Brendan Lucier September 26, 2005 Abstract In this paper we study the Kolmogorov Complexity of a
More informationFiltering with the Crowd
Filtering with the Crowd LRI Benoît Groz, Ezra Levin, Isaco Meiljson, Tova Milo Tel-Aviv University Univ. Paris Saclay 15 Mars 1 1 Outline 1 The CrowdScreen framework Algorithms for computing good/optimal
More informationOBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS
OBLIVIOUS STRING EMBEDDINGS AND EDIT DISTANCE APPROXIMATIONS Tuğkan Batu a, Funda Ergun b, and Cenk Sahinalp b a LONDON SCHOOL OF ECONOMICS b SIMON FRASER UNIVERSITY LSE CDAM Seminar Oblivious String Embeddings
More informationFROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
FROM QUERIES TO TOP-K RESULTS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link
More informationImplementing Approximate Regularities
Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National
More informationarxiv: v1 [cs.db] 2 Sep 2014
An LSH Index for Computing Kendall s Tau over Top-k Lists Koninika Pal Saarland University Saarbrücken, Germany kpal@mmci.uni-saarland.de Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni-saarland.de
More informationMinimizing Clock Latency Range in Robust Clock Tree Synthesis
Minimizing Clock Latency Range in Robust Clock Tree Synthesis Wen-Hao Liu Yih-Lang Li Hui-Chi Chen You have to enlarge your font. Many pages are hard to view. I think the position of Page topic is too
More informationDisconnecting Networks via Node Deletions
1 / 27 Disconnecting Networks via Node Deletions Exact Interdiction Models and Algorithms Siqian Shen 1 J. Cole Smith 2 R. Goli 2 1 IOE, University of Michigan 2 ISE, University of Florida 2012 INFORMS
More informationNumerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm
Numerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm Hao Zhuang 1, 2, Wenjian Yu 1 *, Gang Hu 1, Zuochang Ye 3 1 Department
More informationA PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS
A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology
More informationData Analytics Beyond OLAP. Prof. Yanlei Diao
Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of
More informationUnsupervised Vocabulary Induction
Infant Language Acquisition Unsupervised Vocabulary Induction MIT (Saffran et al., 1997) 8 month-old babies exposed to stream of syllables Stream composed of synthetic words (pabikumalikiwabufa) After
More information0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA
0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 2008-09 Salvatore Orlando 1 0-1 Knapsack problem N objects, j=1,..,n Each kind of item j has a value p j and a weight w j (single
More informationSu Liu 1, Alexandros Papakonstantinou 2, Hongjun Wang 1,DemingChen 2
Real-Time Object Tracking System on FPGAs Su Liu 1, Alexandros Papakonstantinou 2, Hongjun Wang 1,DemingChen 2 1 School of Information Science and Engineering, Shandong University, Jinan, China 2 Electrical
More informationMining Emerging Substrings
Mining Emerging Substrings Sarah Chan Ben Kao C.L. Yip Michael Tang Department of Computer Science and Information Systems The University of Hong Kong {wyschan, kao, clyip, fmtang}@csis.hku.hk Abstract.
More informationBio nformatics. Lecture 3. Saad Mneimneh
Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per
More informationCounting Palindromic Binary Strings Without r-runs of Ones
1 3 47 6 3 11 Journal of Integer Sequences, Vol. 16 (013), Article 13.8.7 Counting Palindromic Binary Strings Without r-runs of Ones M. A. Nyblom School of Mathematics and Geospatial Science RMIT University
More informationTrace Reconstruction Revisited
Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, Sofya Vorotnikova 1 1 University of Massachusetts Amherst 2 IBM Almaden Research Center Problem Description Take original string x of length
More informationSkylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland
Yufei Tao ITEE University of Queensland Today we will discuss problems closely related to the topic of multi-criteria optimization, where one aims to identify objects that strike a good balance often optimal
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationA Laplacian of Gaussian-based Approach for Spot Detection in Two-Dimensional Gel Electrophoresis Images
A Laplacian of Gaussian-based Approach for Spot Detection in Two-Dimensional Gel Electrophoresis Images Feng He 1, Bangshu Xiong 1, Chengli Sun 1, Xiaobin Xia 1 1 Key Laboratory of Nondestructive Test
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationDATA MINING LECTURE 3. Frequent Itemsets Association Rules
DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.
More informationPhotometry of Supernovae with Makali i
Photometry of Supernovae with Makali i How to perform photometry specifically on supernovae targets using the free image processing software, Makali i This worksheet describes how to use photometry to
More informationComputability Theory
CS:4330 Theory of Computation Spring 2018 Computability Theory Decidable Problems of CFLs and beyond Haniel Barbosa Readings for this lecture Chapter 4 of [Sipser 1996], 3rd edition. Section 4.1. Decidable
More informationInduction of Decision Trees
Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.
More information