Mining Frequent Closed Unordered Trees Through Natural Representations José L. Balcázar, Albert Bifet and Antoni Lozano Universitat Politècnica de Catalunya Pascal Workshop: learning from and with graphs 2007 Alicante
Trees are sanctuaries. Whoever knows how to listen to them, can learn the truth. Herman Hesse Mining frequent trees is becoming an important task Applications: chemical informatics computer vision text retrieval bioinformatics Web analysis. Many link-based structures may be studied formally by means of unordered trees
Introduction Unordered Trees One unordered tree with two different drawings, each of which corresponds to a different ordered tree.
Introduction Induced subtrees: obtained by repeatedly removing leaf nodes Embedded subtrees: obtained by contracting some of the edges
Introduction What Is Tree Pattern Mining? Given a dataset of trees, find the complete set of frequent subtrees Frequent Tree Pattern (FT): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CT): Include no tree which has a super-tree with the same support CT FT Closed Frequent Tree Mining provides a compact representation of frequent trees without loss of information
Introduction Ordered Subtree Mining D = {A, B}, min_sup = 2 # Closed Subtrees : 2 # Frequent Subtrees: 8 Closed Subtrees: X, Y Frequent Subtrees:
Introduction Unordered Subtree Mining A: B: X: Y: D = {A, B}, min_sup = 2 # Closed Subtrees : 2 # Frequent Subtrees: 9 Closed Subtrees: X, Y Frequent Subtrees:
Related Work Yu Chi, Richard Muntz, Siegfried Nijssen, Joost Kok Frequent Subtree Mining-An overview 2005 FREQUENT Labelled and Rooted Trees UnOrdered Induced Unot [Asai 2003] UFreqT [Nijssen 2003] HybridTreeMiner [Chi 2004] PathJoin [Xiao 2003] CLOSED Labelled and Induced Trees: CMTREEMINER [Chi, Yang, Xia, Muntz 2004] Labelled and relaxed included Trees: DRYADE [Termier, Rousset, Sebag 2004] Labelled and Attribute Trees: CLOATT [Arimura, Uno 2005]
Natural Representation Definition Given two sequences of natural numbers x, y x y: concatenation of x and y x + i: addition of i to each component of x x + = x + 1 Definition A natural sequence is a sequence (x 1,..., x n ) of natural numbers such that x 1 = 0 each subsequent number x i+1 belongs to the range 1 x i+1 x i + 1. Example x = (0, 1, 2, 3, 1, 2) = (0) (0, 1, 2) + (0, 1) +
Natural Representation Definition Let t be an ordered tree. If t is a single node, then t = (0). Otherwise, if t is composed of the trees t 1,..., t k joined to a common root r (where the ordering t 1,..., t k is the same of the children of r), then Example x = (0, 1, 2, 2, 3, 1) t = (0) t 1 + t 2 + t k + t is the natural representation of t. (0) (0, 1, 1, 2) + (0) +
Mining frequent subtrees in the ordered case Definition y is a one-step extension of x (in symbols, x 1 y) if x is a prefix of y and y = x + 1. a series of one-step extensions from (0) to a natural sequence x (0) 1 x 1 1 1 x k 1 1 x always exists and must be unique, since the x i s can only be the prefixes of x.
Mining frequent subtrees in the ordered case FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T. insert t into T for every t that can be extended from t in one step do if support(t ) min_sup then FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) return T
Mining frequent subtrees in the ordered case FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T. insert t into T 1 C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C 2 for each t in C 3 do T FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) return T
Canonical Forms Definition Let t be an unordered tree, and let t 1,..., t n be all the ordered trees obtained from t by ordering in all possible ways all the sets of siblings of t. The canonical representative of t is the ordered tree t 0 whose natural representation is maximal (according to lexicographic ordering) among the natural representations of the trees t i, that is, such that t 0 = max{ t i 1 i n}.
Mining frequent subtrees in the unordered case FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T. insert t into T C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C for each t in C do T FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) return T
Mining frequent subtrees in the unordered case FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T. 1 if not CANONICAL_REPRESENTATIVE(t) 2 then return T insert t into T C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C for each t in C do T FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) return T
Closure-based mining CLOSED_SUBTREE_MINING(t, D, min_sup, T ) if not CANONICAL_REPRESENTATIVE(t) then return T C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C for each t in C do T CLOSED_SUBTREE_MINING(t, D, min_sup, T ) return T
Closure-based mining CLOSED_SUBTREE_MINING(t, D, min_sup, T ) if not CANONICAL_REPRESENTATIVE(t) then return T C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C 1 do if support(t ) = support(t) 2 then t is not closed 3 if t is closed 4 then insert t into T for each t in C do T CLOSED_SUBTREE_MINING(t, D, min_sup, T ) return T
Example: Ordered Case min_sup = 2 A : (0, 1, 2, 3, 2, 1), B : (0, 1, 2, 3, 1, 2, 2) (0) 1 (0, 1) 1 (0, 1, 2) 1 (0, 1, 2, 2) 1 (0, 1, 2, 3) 1 (0, 1, 2, 3, 1)
Example: Unordered Case min_sup = 2 A : (0, 1, 2, 3, 2, 1), B : (0, 1, 2, 3, 1, 2, 2) A: B: X: Y: (0) 1 (0, 1) 1 (0, 1, 2) 1 (0, 1, 2, 2) 1 (0, 1, 2, 2, 1) 1 (0, 1, 2, 3) 1 (0, 1, 2, 3, 1)
Experiments: Gazelle Unordered Trees 10 8 CMTreeMiner Our method Time 6 4 2 0 0 5 10 15 20 25 30 35 40 Support x 1000
Conclusions and Future Work Through our proposed representation of ordered trees, we have presented efficient algorithms for mining ordered and unordered frequent closed trees. The sequential form of our representation, where the number-encoded depth furnishes the two-dimensional information, is key in the fast processing of the data. Future work : Consider labelled subtrees Consider embedded subtrees
Future Work
Tree Kernels Definition (Subset Trees) Set of connected nodes of a tree T Definition (Colins and Duffy 2001) Denote by T, T trees and by t T a subset tree of T, then k(t, T ) = w t δ t,t t T,t T Definition (Vishwanathan and Smola 2002) In case we count matching subtrees then t T denotes that t is a subtree of T and k(t, T ) = w t δ t,t t T,t T
Tree Kernels S. V. N. Vishwanathan and Alexander J. Smola. Fast Kernels for String and Tree Matching 2002 We can compute tree kernel by Converting trees to strings Computing string kernels Advantages Simple storage and simple implementation (dynamic array, suffices) All speedups for strings work for tree kernels, too(xml documents,etc.)