TASM: Top-k Approximate Subtree Matching

Similar documents
Approximation: Theory and Algorithms

An Optimal Decomposition Algorithm for Tree Edit Distance

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

XReason: A Semantic Approach that Reasons with Patterns to Answer XML Keyword Queries

Approximation: Theory and Algorithms

Optimal Color Range Reporting in One Dimension

INF2220: algorithms and data structures Series 1

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan

Approximation: Theory and Algorithms

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya Yanki, Bojana Dimcheva, Donald Kossmann

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Type Based XML Projection

5 Spatial Access Methods

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Divide-and-Conquer Algorithms Part Two

Multi-Dimensional Top-k Dominating Queries

Outline. Computer Science 331. Cost of Binary Search Tree Operations. Bounds on Height: Worst- and Average-Case

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

5 Spatial Access Methods

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

1 Approximate Quantiles and Summaries

Exam 1. March 12th, CS525 - Midterm Exam Solutions

Effective computation of biased quantiles over data streams

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Chapter 5 Data Structures Algorithm Theory WS 2017/18 Fabian Kuhn

An Efficient Algorithm for Tree Mapping in XML Databases

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs

Nearest Neighbor Search with Keywords in Spatial Databases

Finding Pareto Optimal Groups: Group based Skyline

past balancing schemes require maintenance of balance info at all times, are aggresive use work of searches to pay for work of rebalancing

Implementing Approximate Regularities

Fundamental Algorithms

ENS Lyon Camp. Day 2. Basic group. Cartesian Tree. 26 October

Evaluating Complex Queries against XML Streams with Polynomial Combined Complexity

CSE 421 Greedy: Huffman Codes

arxiv: v1 [cs.ds] 27 Mar 2017

Analysis of tree edit distance algorithms

Multi-Approximate-Keyword Routing Query

Search Trees. Chapter 10. CSE 2011 Prof. J. Elder Last Updated: :52 AM

Tree Edit Distance Cannot be Computed in Strongly Subcubic Time (unless APSP can)

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms

FluXQuery: An Optimizing XQuery Processor

Multivalued Decision Diagrams. Postoptimality Analysis Using. J. N. Hooker. Tarik Hadzic. Cork Constraint Computation Centre

Dictionary: an abstract data type

Lecture 1 : Data Compression and Entropy

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)

META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion

Projection for Nested Word Automata Speeds up XPath Evaluation on XML Streams

Space Efficient Algorithms for Ordered Tree Comparison

Dynamic Structures for Top-k Queries on Uncertain Data

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

Finding Frequent Items in Probabilistic Data

Binary Search Trees. Dictionary Operations: Additional operations: find(key) insert(key, value) erase(key)

Heaps and Priority Queues

Maintaining Significant Stream Statistics over Sliding Windows

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

Data Structures and Algorithms " Search Trees!!

Search Trees. EECS 2011 Prof. J. Elder Last Updated: 24 March 2015

Lecture 2 September 4, 2014

Factorized Relational Databases Olteanu and Závodný, University of Oxford

QuickScorer: a fast algorithm to rank documents with additive ensembles of regression trees

K-Nearest Neighbor Temporal Aggregate Queries

Computing the Entropy of a Stream

6.889 Lecture 4: Single-Source Shortest Paths

Scalable Algorithms for Distribution Search

/463 Algorithms - Fall 2013 Solution to Assignment 4

8: Splay Trees. Move-to-Front Heuristic. Splaying. Splay(12) CSE326 Spring April 16, Search for Pebbles: Move found item to front of list

arxiv: v2 [cs.ds] 8 Apr 2016

Analysis of Software Artifacts

Rank and Select Operations on Binary Strings (1974; Elias)

Querying. 1 o Semestre 2008/2009

Speculative Parallelism in Cilk++

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Introduction to Randomized Algorithms III

Efficient Evaluation of Semi-Skylines

Binary Search Trees. Motivation

CPT+: A Compact Model for Accurate Sequence Prediction

CS Data Structures and Algorithm Analysis

Online Sorted Range Reporting and Approximating the Mode

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Maintaining Frequent Itemsets over High-Speed Data Streams

Dynamic Ordered Sets with Exponential Search Trees

Each internal node v with d(v) children stores d 1 keys. k i 1 < key in i-th sub-tree k i, where we use k 0 = and k d =.

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

On the Study of Tree Pattern Matching Algorithms and Applications

Highly-scalable branch and bound for maximum monomial agreement

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Ukkonen's suffix tree construction algorithm

XML Security Views. Queries, Updates, and Schema. Benoît Groz. University of Lille, Mostrare INRIA. PhD defense, October 2012

Tree Edit Distance Cannot be Computed in Strongly Subcubic Time (unless APSP can) Karl Bringmann, Paweł Gawrychowski, Shay Mozes, Oren Weimann

Notes on Logarithmic Lower Bounds in the Cell Probe Model

Cardinality Networks: a Theoretical and Empirical Study

Math Models of OR: Branch-and-Bound

Diversified Top k Graph Pattern Matching

Smaller and Faster Lempel-Ziv Indices

Sequence analysis and Genomics

Transcription:

TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta, Canada denilson@cs.ualberta.ca 3 University of Zurich, Switzerland boehlen@ifi.uzh.ch 4 University of Trento, Italy themis@disi.unitn.eu ICDE 2010, March 3 Long Beach, CA, USA Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 1 / 28

Outline 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 2 / 28

Outline Motivation and Problem Definition 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 3 / 28

Motivation Query (XML fragment) authors article author author ICDE Tim John Motivation and Problem Definition booktitle top-k matches? Document (very large XML) DBLP 28M nodes, 531MB Rank the top-k matches for the article query in the DBLP document! Example Answer: k = 3 inproceedings authors author author ICDE Tim booktitle article authors author author booktitle inproceedings authors author author author ICDE John Tim John TKDE Tim John Peter (1 error) (2 errors) (3 errors) booktitle Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Motivation and Problem Definition TASM: Top-k Approximate Subtree Matching Definition (TASM: Top-k Approximate Subtree Matching) Given: query tree Q, document tree T, size k of ranking Goal: Compute a top-k ranking R = (T 1,T 2,...,T k ) of all subtrees T i of document T with respect to query Q using the tree edit distance for the ranking. Subtree T i : a node and all its descendants largest subtree is document itself top-k ranking R = (T 1,T i,..., T k ) subtrees sorted by distance to query best k subtrees: T i / R ted(q, T k ) ted(q, T i ) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 5 / 28

Motivation and Problem Definition Ranking Function: Tree Edit Distance (TED) del(authors) ren(icde) article article article authors booktitle author author ICDE Tim John author Tim author John booktitle ICDE author Tim author John Tree Edit Distance: Minimum number of node edit operations (insert, rename, delete) that transform one tree into the other. TASM computes TED between query and document subtrees booktitle TKDE Size and number of computed subtrees define TASM complexity Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Motivation and Problem Definition State of the Art TASM-Dynamic: dynamic programming solution 1 computes distance to every subtree of the document use smaller subtrees to compute larger ones rank subtrees by visiting memoization table Space complexity: O(mn), m: query size, n: document size Space complexity limits application to databases in database applications n is huge (database size!) TASM-Dynamic maintains two m n matrixes in RAM > 6GB RAM for our tiny query (m = 8) on DBLP (n = 28 10 6 ) For database size solutions dynamic programming is too expensive. State-of-the-art algorithms do not scale! 1 Zhang and Shasha 1989, Demaine et al. 2007 Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Motivation and Problem Definition Problem Definition Find a solution for TASM (Top-k Approximate Subtree Matching) that scales to very large documents runs in small memory ranks subtrees correctly (no heuristics!) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 8 / 28

Outline TASM-Postorder 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 9 / 28

Outline TASM-Postorder Upper Bound on Subtree Size 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 10 / 28

TASM-Postorder Upper Bound on Subtree Size Subtree Size Upper Bound in Three Steps worst match 1. Rank first k subtrees of T in postorder: R = (T 1,T 2,...,T k ) (i) ted(q,t k ) Q + T k Q delete Q insert T k T k 2. Final ranking R = (T 1,T 2,...,T k ) (=TASM result) T k k T i s in R are better than worst match T k of R (ii) ted(q,t i ) ted(q,t k ) Q + T k 3. Size upper bound for subtree T i at least: insert missing nodes T i Q T i Q ted(q,t i ) Q T i T i ted(q,t i ) + Q 2 Q + T k 2 Q + k Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size Upper Bound on Subtree Size Theorem (Upper Bound on Subtree Size) TASM needs to consider only small document subtrees of size τ or less: τ = 2 Q + k Upper bound is very powerful: independent of document size and structure! linear in query size and k Example: top-10 with example query Q = 8 on DBLP (28M nodes) with bound: max subtree size τ = 2 8 + 10 = 26 without bound: maximum subtree size is 28M (whole document)! Document-independent upper bound on subtree size! Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

Outline TASM-Postorder 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 13 / 28

TASM-Postorder Document Format: Postorder Queue dblp article proceedings book auth title conf article article title John X1 VLDB auth title auth title X2 Peter X3 Mike X4 John,1 auth,2 X1,1 title,2 article,5 VLDB,1 conf,2 Peter,1 auth,2 X3,1 title,2 article,5 Mike,1 auth,2 X4,1 title,2 article,5 proc,13 X2,1 title,2 book,3 dblp,22 Postorder queue: queue of (label,size)-pairs dequeue removes leftmost element, e.g., (John, 1) no random access! Relevant and state-of-the-art for XML Parsing full subtree known only at closing tag closing tags appear in postorder Implementation is efficient and heavily used for XML streams plain XML files (e.g., SAX) XML in database (Dewey, interval encoding,...) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Candidate Subtrees TASM-Postorder Candidate subtrees are all subtrees T i of the document with T i τ AND T i is not contained in a larger subtree T j τ Pruning: find candidate subtrees Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 15 / 28

TASM-Postorder Simple Pruning Approach article 5 dblp 22 proceedings 18 book 21 auth 2 John 1 title 4 X1 3 conf 7 VLDB 6 article 12 auth 9 Peter 8 title 11 X3 10 article 17 auth 14 Mike 13 title 16 X4 15 title 20 X2 19 Simple pruning approach: (τ = 6 in example above) add nodes to memory buffer until non-candidate ( T i > τ) is added subtrees of non-candidate with T i τ are candidate subtrees Problem: memory buffer can grow very large! must keep subtrees in memory until non-candidate ancestor is read worst case: memory buffer stores O(n) nodes (frequent in data-centric XML!) Example: DBLP, τ = 50 99% of nodes are still in buffer when root node is read! Simple pruning not feasible for large documents! Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Efficient Pruning is Tricky! Problem: when can we remove a node from the buffer? when we see T i τ, we don t yet know about parent (postorder!) subtree of parent might be smaller than τ! Our Solution does not wait for parent prefix ring buffer: fixed size buffer pruning rule: prune based on following nodes Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 17 / 28

TASM-Postorder Pruning in Small Memory prefix ring buffer (τ = 6) VLDB,1 John,1 auth,2 X1,1 title,4 article,5 e s Prefix ring buffer of size τ + 1 (main memory) stores prefix (τ nodes in postorder) of the document two operations append new node remove leftmost subtree/node Pruning rule: If leftmost node in full ring buffer is leaf: leftmost subtree is candidate subtree non-leaf: leftmost node is non-candidate node Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

TASM-Postorder Pruning Rule Intuition Candidate subtree: leftmost node is a leaf T i : leftmost subtree, starts with leftmost node T j : smallest subtree that contains T i due to postorder: T j contains all nodes in buffer since T i τ and T j > τ: T i is a candidate Non-candidate node: leftmost node is a non-leaf leftmost non-leaf is parent of previously removed nodes we remove either candidate subtrees and non-candidate nodes in both cases: parent is a non-candidate Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 19 / 28

TASM-Postorder Example article auth title conf John X1 τ = 6 dblp proceedings article VLDB auth title auth title X2 Peter X3 article title Mike X4 append book 1. fill ring buffer 2. check leftmost node leaf: candidate subtree to result non-leaf: non-candidate remove 3. until queue and buffer empty postorder queue (input) article,5 Mike,1 auth,2 prefix ring buffer (main memory) Peter,1 auth,2 X3,1 title,2 e VLDB,1 s conf,2 candidate subtrees: (output) article auth title John X1 Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder TASM-Postorder TASM-postorder 1. empty ranking R, tightening upper bound τ = τ 2. for each candidate subtree T i a. if R = k: update τ = min(τ, max(r) + Q ) b. compute tree edit distance for all subtrees of T i within τ c. update ranking R Theorem (TASM-Postorder) The space complexity of TASM-postorder is independent of the document size: O(m 2 + mk) (m: query size, k: result size) TASM-postorder scales to very large documents! Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 21 / 28

Outline Experiments 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 22 / 28

Pruning Effectiveness Experiments Prefix ring buffer pruning is very effective! Maximum subtree reduced from 37M to 18 nodes. Dataset: PSD protein sequences, 37M nodes, 683MB Compute TASM ( Q = 4, k = 1) TASM-dynamic (state of the art) TASM-postorder (our solution) Histogram of computed subtrees number of subtrees 1e7 1e6 1e5 1e4 1e3 1e2 1e1 TASM-Dynamic largest subtree: 37M entire document 1e0 1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7 subtree size (nodes) number of subtrees TASM-Postorder 1e7 1e6 1e5 largest subtree: 18 1e4 1e3 1e2 1e1 1e0 1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7 subtree size (nodes) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 23 / 28

Experiments Scalability: TASM-Postorder vs. TASM-Dynamic TASM-postorder much faster than TASM-dynamic. Dataset: XMark (synthetic XML for benchmark) Vary query size and document size Compute TASM (k = 5) 1e3 TASM-dynamic (state of the art) TASM-postorder (our solution) Measure wall clock time 1e3 time (seconds) 1e2 1e1 1e0 dyn, T:224MB dyn, T:112MB pos, T:224MB pos, T:112MB 4 8 16 32 64 query size (nodes) time (seconds) 1e2 1e1 1e0 dyn, Q =8 dyn, Q =4 pos, Q =8 pos, Q =4 112 224 448 896 1792 document size (MB) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 24 / 28

Experiments Scalability with Result Size k TASM-postorder scales well with k. Increasing k by 4 orders of magnitude only doubles runtime. time (seconds) 300 250 200 150 100 50 dyn, T:224MB dyn, T:112MB pos, T:224MB pos, T:112MB 0 1e0 1e1 1e2 1e3 1e4 k Dataset: XMark (synthetic XML for benchmark) Vary k (size of ranking) Compute TASM ( Q = 16) TASM-dynamic (state of the art) TASM-postorder (our solution) Measure wall clock time Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 25 / 28

Experiments Space complexity: TASM-Postorder vs. TASM-Dynamic TASM-postorder: space independent of document! 4e3 1e3 3GB Dataset: XMark (synthetic XML for benchmark) memory (MB) 1e2 1e1 8MB dyn, Q =16 dyn, Q =4 pos, Q =16 pos, Q =4 Vary document size Compute TASM (k = 5) TASM-dynamic (state of the art) TASM-postorder (our solution) Measure main memory usage 1e0 112 224 448 896 1792 document size (MB) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 26 / 28

Outline Conclusion and Future Work 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 27 / 28

Conclusion Conclusion and Future Work Conclusion Prefix Ring Buffer for space efficient pruning Dynamic programming does not scale for database size solutions. Upper bound τ: limit maximum subtree size for TASM TASM-postorder: highly scalable TASM algorithm TASM-postorder makes TASM feasible. Future Work New research opportunities: tune tree edit distance to different applications index the document: can we avoid a document scan? parallel TASM algorithm: where to split document? Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28

Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann. An optimal decomposition algorithm for tree edit distance. In ICALP, volume 4596 of LNCS, pages 146 157, Wroclaw, Poland, July 2007. Springer. K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. on Computing, 18(6):1245 1262, 1989. Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28