TASM: Top-k Approximate Subtree Matching

TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta, Canada denilson@cs.ualberta.ca 3 University of Zurich, Switzerland boehlen@ifi.uzh.ch 4 University of Trento, Italy themis@disi.unitn.eu ICDE 2010, March 3 Long Beach, CA, USA Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 1 / 28

Outline 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 2 / 28

Outline Motivation and Problem Definition 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 3 / 28

Motivation Query (XML fragment) authors article author author ICDE Tim John Motivation and Problem Definition booktitle top-k matches? Document (very large XML) DBLP 28M nodes, 531MB Rank the top-k matches for the article query in the DBLP document! Example Answer: k = 3 inproceedings authors author author ICDE Tim booktitle article authors author author booktitle inproceedings authors author author author ICDE John Tim John TKDE Tim John Peter (1 error) (2 errors) (3 errors) booktitle Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 4 / 28

Motivation and Problem Definition TASM: Top-k Approximate Subtree Matching Definition (TASM: Top-k Approximate Subtree Matching) Given: query tree Q, document tree T, size k of ranking Goal: Compute a top-k ranking R = (T 1,T 2,...,T k ) of all subtrees T i of document T with respect to query Q using the tree edit distance for the ranking. Subtree T i : a node and all its descendants largest subtree is document itself top-k ranking R = (T 1,T i,..., T k ) subtrees sorted by distance to query best k subtrees: T i / R ted(q, T k ) ted(q, T i ) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 5 / 28

Motivation and Problem Definition Ranking Function: Tree Edit Distance (TED) del(authors) ren(icde) article article article authors booktitle author author ICDE Tim John author Tim author John booktitle ICDE author Tim author John Tree Edit Distance: Minimum number of node edit operations (insert, rename, delete) that transform one tree into the other. TASM computes TED between query and document subtrees booktitle TKDE Size and number of computed subtrees define TASM complexity Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 6 / 28

Motivation and Problem Definition State of the Art TASM-Dynamic: dynamic programming solution 1 computes distance to every subtree of the document use smaller subtrees to compute larger ones rank subtrees by visiting memoization table Space complexity: O(mn), m: query size, n: document size Space complexity limits application to databases in database applications n is huge (database size!) TASM-Dynamic maintains two m n matrixes in RAM > 6GB RAM for our tiny query (m = 8) on DBLP (n = 28 10 6 ) For database size solutions dynamic programming is too expensive. State-of-the-art algorithms do not scale! 1 Zhang and Shasha 1989, Demaine et al. 2007 Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 7 / 28

Motivation and Problem Definition Problem Definition Find a solution for TASM (Top-k Approximate Subtree Matching) that scales to very large documents runs in small memory ranks subtrees correctly (no heuristics!) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 8 / 28

Outline TASM-Postorder 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 9 / 28

Outline TASM-Postorder Upper Bound on Subtree Size 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 10 / 28

TASM-Postorder Upper Bound on Subtree Size Subtree Size Upper Bound in Three Steps worst match 1. Rank first k subtrees of T in postorder: R = (T 1,T 2,...,T k ) (i) ted(q,t k ) Q + T k Q delete Q insert T k T k 2. Final ranking R = (T 1,T 2,...,T k ) (=TASM result) T k k T i s in R are better than worst match T k of R (ii) ted(q,t i ) ted(q,t k ) Q + T k 3. Size upper bound for subtree T i at least: insert missing nodes T i Q T i Q ted(q,t i ) Q T i T i ted(q,t i ) + Q 2 Q + T k 2 Q + k Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 11 / 28

TASM-Postorder Upper Bound on Subtree Size Upper Bound on Subtree Size Theorem (Upper Bound on Subtree Size) TASM needs to consider only small document subtrees of size τ or less: τ = 2 Q + k Upper bound is very powerful: independent of document size and structure! linear in query size and k Example: top-10 with example query Q = 8 on DBLP (28M nodes) with bound: max subtree size τ = 2 8 + 10 = 26 without bound: maximum subtree size is 28M (whole document)! Document-independent upper bound on subtree size! Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 12 / 28

TASM-Postorder Document Format: Postorder Queue dblp article proceedings book auth title conf article article title John X1 VLDB auth title auth title X2 Peter X3 Mike X4 John,1 auth,2 X1,1 title,2 article,5 VLDB,1 conf,2 Peter,1 auth,2 X3,1 title,2 article,5 Mike,1 auth,2 X4,1 title,2 article,5 proc,13 X2,1 title,2 book,3 dblp,22 Postorder queue: queue of (label,size)-pairs dequeue removes leftmost element, e.g., (John, 1) no random access! Relevant and state-of-the-art for XML Parsing full subtree known only at closing tag closing tags appear in postorder Implementation is efficient and heavily used for XML streams plain XML files (e.g., SAX) XML in database (Dewey, interval encoding,...) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 14 / 28

Candidate Subtrees TASM-Postorder Candidate subtrees are all subtrees T i of the document with T i τ AND T i is not contained in a larger subtree T j τ Pruning: find candidate subtrees Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 15 / 28

TASM-Postorder Simple Pruning Approach article 5 dblp 22 proceedings 18 book 21 auth 2 John 1 title 4 X1 3 conf 7 VLDB 6 article 12 auth 9 Peter 8 title 11 X3 10 article 17 auth 14 Mike 13 title 16 X4 15 title 20 X2 19 Simple pruning approach: (τ = 6 in example above) add nodes to memory buffer until non-candidate ( T i > τ) is added subtrees of non-candidate with T i τ are candidate subtrees Problem: memory buffer can grow very large! must keep subtrees in memory until non-candidate ancestor is read worst case: memory buffer stores O(n) nodes (frequent in data-centric XML!) Example: DBLP, τ = 50 99% of nodes are still in buffer when root node is read! Simple pruning not feasible for large documents! Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 16 / 28

TASM-Postorder Efficient Pruning is Tricky! Problem: when can we remove a node from the buffer? when we see T i τ, we don t yet know about parent (postorder!) subtree of parent might be smaller than τ! Our Solution does not wait for parent prefix ring buffer: fixed size buffer pruning rule: prune based on following nodes Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 17 / 28

TASM-Postorder Pruning in Small Memory prefix ring buffer (τ = 6) VLDB,1 John,1 auth,2 X1,1 title,4 article,5 e s Prefix ring buffer of size τ + 1 (main memory) stores prefix (τ nodes in postorder) of the document two operations append new node remove leftmost subtree/node Pruning rule: If leftmost node in full ring buffer is leaf: leftmost subtree is candidate subtree non-leaf: leftmost node is non-candidate node Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 18 / 28

TASM-Postorder Pruning Rule Intuition Candidate subtree: leftmost node is a leaf T i : leftmost subtree, starts with leftmost node T j : smallest subtree that contains T i due to postorder: T j contains all nodes in buffer since T i τ and T j > τ: T i is a candidate Non-candidate node: leftmost node is a non-leaf leftmost non-leaf is parent of previously removed nodes we remove either candidate subtrees and non-candidate nodes in both cases: parent is a non-candidate Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 19 / 28

TASM-Postorder Example article auth title conf John X1 τ = 6 dblp proceedings article VLDB auth title auth title X2 Peter X3 article title Mike X4 append book 1. fill ring buffer 2. check leftmost node leaf: candidate subtree to result non-leaf: non-candidate remove 3. until queue and buffer empty postorder queue (input) article,5 Mike,1 auth,2 prefix ring buffer (main memory) Peter,1 auth,2 X3,1 title,2 e VLDB,1 s conf,2 candidate subtrees: (output) article auth title John X1 Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 20 / 28

TASM-Postorder TASM-Postorder TASM-postorder 1. empty ranking R, tightening upper bound τ = τ 2. for each candidate subtree T i a. if R = k: update τ = min(τ, max(r) + Q ) b. compute tree edit distance for all subtrees of T i within τ c. update ranking R Theorem (TASM-Postorder) The space complexity of TASM-postorder is independent of the document size: O(m 2 + mk) (m: query size, k: result size) TASM-postorder scales to very large documents! Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 21 / 28

Outline Experiments 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 22 / 28

Pruning Effectiveness Experiments Prefix ring buffer pruning is very effective! Maximum subtree reduced from 37M to 18 nodes. Dataset: PSD protein sequences, 37M nodes, 683MB Compute TASM ( Q = 4, k = 1) TASM-dynamic (state of the art) TASM-postorder (our solution) Histogram of computed subtrees number of subtrees 1e7 1e6 1e5 1e4 1e3 1e2 1e1 TASM-Dynamic largest subtree: 37M entire document 1e0 1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7 subtree size (nodes) number of subtrees TASM-Postorder 1e7 1e6 1e5 largest subtree: 18 1e4 1e3 1e2 1e1 1e0 1e0 1e1 1e2 1e3 1e4 1e5 1e6 1e7 subtree size (nodes) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 23 / 28

Experiments Scalability: TASM-Postorder vs. TASM-Dynamic TASM-postorder much faster than TASM-dynamic. Dataset: XMark (synthetic XML for benchmark) Vary query size and document size Compute TASM (k = 5) 1e3 TASM-dynamic (state of the art) TASM-postorder (our solution) Measure wall clock time 1e3 time (seconds) 1e2 1e1 1e0 dyn, T:224MB dyn, T:112MB pos, T:224MB pos, T:112MB 4 8 16 32 64 query size (nodes) time (seconds) 1e2 1e1 1e0 dyn, Q =8 dyn, Q =4 pos, Q =8 pos, Q =4 112 224 448 896 1792 document size (MB) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 24 / 28

Experiments Scalability with Result Size k TASM-postorder scales well with k. Increasing k by 4 orders of magnitude only doubles runtime. time (seconds) 300 250 200 150 100 50 dyn, T:224MB dyn, T:112MB pos, T:224MB pos, T:112MB 0 1e0 1e1 1e2 1e3 1e4 k Dataset: XMark (synthetic XML for benchmark) Vary k (size of ranking) Compute TASM ( Q = 16) TASM-dynamic (state of the art) TASM-postorder (our solution) Measure wall clock time Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 25 / 28

Experiments Space complexity: TASM-Postorder vs. TASM-Dynamic TASM-postorder: space independent of document! 4e3 1e3 3GB Dataset: XMark (synthetic XML for benchmark) memory (MB) 1e2 1e1 8MB dyn, Q =16 dyn, Q =4 pos, Q =16 pos, Q =4 Vary document size Compute TASM (k = 5) TASM-dynamic (state of the art) TASM-postorder (our solution) Measure main memory usage 1e0 112 224 448 896 1792 document size (MB) Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 26 / 28

Outline Conclusion and Future Work 1 Motivation and Problem Definition 2 TASM-Postorder Upper Bound on Subtree Size 3 Experiments 4 Conclusion and Future Work Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 27 / 28

Conclusion Conclusion and Future Work Conclusion Prefix Ring Buffer for space efficient pruning Dynamic programming does not scale for database size solutions. Upper bound τ: limit maximum subtree size for TASM TASM-postorder: highly scalable TASM algorithm TASM-postorder makes TASM feasible. Future Work New research opportunities: tune tree edit distance to different applications index the document: can we avoid a document scan? parallel TASM algorithm: where to split document? Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28

Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann. An optimal decomposition algorithm for tree edit distance. In ICALP, volume 4596 of LNCS, pages 146 157, Wroclaw, Poland, July 2007. Springer. K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. on Computing, 18(6):1245 1262, 1989. Nikolaus Augsten (Bolzano, Italy) TASM: Top-k Approx. Subtree Matching ICDE 2010 28 / 28