The Chinese University of Hong Kong Fst Frequent Free Tree Mining in Grph Dtses Peixing Zho Jeffrey Xu Yu The Chinese University of Hong Kong Decemer 18 th, 2006 ICDM Workshop MCD06
Synopsis Introduction Existing Approches Our Algorithm: F3TM Performnce Studies Conclusions ICDM Workshop MCD06 2
Introduction Grph, generl dt structure to represent reltions mong entities, hs een widely used in rod rnge of res Computtionl iology Chemistry Pttern recognition Computer networks etc. Mining frequent su-grphs in grph dtse If lrge grph contins nother smll grph : the su-grph isomorphism prolem ( NP-complete ) If two grphs re isomorphic : the grph isomorphism prolem (either P or NP-complete) ICDM Workshop MCD06 3
Introduction Free Tree (ftree) Connected, cyclic nd undirected grph Widely used in ioinformtics, computer vision, networks, etc. Speciliztion of generl grph voiding undesirle theoreticl properties nd lgorithmic complexity incurred y grph determining whether tree t 1 is contined in nother tree t 2 cn e solved in O(m 3/2 n/logm) time determining whether t 1 is isomorphic to t 2 cn e solved in O(n) determining whether tree is isomorphic to some su-trees of grph, costly tree-in-grph testing which is still NP-Complete ICDM Workshop MCD06 4
Introduction Frequent free tree mining Given grph dtse D = { g 1, g 2,, g N }. The prolem of frequent free tree mining is to find the set of ll frequent free trees where ftree, t, is frequent if the rtio of grphs in D, tht hs t s its su-tree, is greter thn or equl to user-given threshold Φ Two key concepts Cndidte genertion Frequency counting Our focus The less numer of cndidtes generted, the less numer of times to pply costly tree-in-grph testing the cost of cndidte genertion itself cn e high ICDM Workshop MCD06 5
FT-Algorithm Apriori-sed lgorithm Existing Approches Builds conceptul enumertion lttice to enumerte frequent ftrees in the dtse Follows pttern-join pproch to generte cndidte frequent ftrees FG-Algorithm A verticl mining lgorithm Builds n enumertion tree nd trverses it in depth-first fshion Tkes pttern-growth pproch to generte cndidte frequent ftrees ICDM Workshop MCD06 6
Our Algorithm: F3TM F3TM (Fst Frequent Free Tree Mining) A verticl mining lgorithm Requires reltively smll memory to mintin the frequent ftrees eing found Uses the pttern-growth pproch for cndidte genertion Two pruning lgorithms re proposed to fcilitte cndidte genertion nd they contriute drmtic speedup to the finl performnce of our ftree mining lgorithm Automorphism-sed pruning Cnonicl mpping-sed pruning ICDM Workshop MCD06 7
Cnonicl Form of Free Tree A unique representtion of ftree two ftrees, t 1 nd t 2, shre the sme cnonicl form if nd only if t 1 is isomorphic to t 2 Only free trees in their cnonicl form need to e considered in frequent ftree mining process A two-step lgorithm normlizing ftree to e rooted ordered tree ssigning string, s its code, to represent the normlized rooted ordered tree Both steps of the lgorithm re O(n), for n-ftree ICDM Workshop MCD06 8
Cndidte Genertion Theorem: the completeness of frequent ftrees is ensured if we grow vertices from the predefined positions of ftree, clled extension frontier Extension frontier represents ll legl positions of n n-ftree t on which new vertex cn e ppended to chieve the new (n+1)-ftree t, while no ftrees re omitted during this frontierextending process c d e f g ICDM Workshop MCD06 9
Automorphism-Bsed Pruning Given cndidte ftree t in T (the cndidtes set), in order to reduce the cost of frequency counting, we firstly check if there is cndidte ftree t' in T such s t = t' There is no need to count redundncies When T ecomes lrge, the cost of checking t = t' for every t' in T cn possily ecome the dominting cost 0 1 2 c d c d 3 4 5 6 c d c d c d c d ICDM Workshop MCD06 10
Automorphism-Bsed Pruning Automorphism-sed pruning efficiently prunes redundnt cndidtes in T while voids checking if ftree hs existed in T lredy, repetitively All vertices of free tree cn e prtitioned into different equivlence clsses se on utomorphism We only need to grow vertices from one representtive of n equivlence clss, if vertices of the equivlence clss re in the extension frontier of the ftree 0 0 0 c d c d 0 1 0 1 c d c d ICDM Workshop MCD06 11
Cnonicl Mpping-sed Pruning How to select potentil lels to e grown on the frequent ftrees during cndidte genertion? Existing lgorithms mintin mppings from ftree t to ll its k occurrences in g i Bsed on these mppings, it is possile to know which lels, tht pper in grph g i, cn e selected nd ssigned to generte cndidte (n+1)-ftree there re lot of redundnt mppings etween ftree t nd occurrences in g i ICDM Workshop MCD06 12
Cnonicl Mpping-sed Pruning g 1 g 2 1 4 1 2 3 2 3 4 1 t 2 3 mpping list (1;1,2,4) (1;1,4,2) (1;3,2,4) (1;3,4,2) (2;2,3,4) (2;2,4,3) ICDM Workshop MCD06 13
Cnonicl Mpping-sed Pruning Cnonicl mpping efficiently void multiple mppings from ftree to the sme occurrence of the tree in grph g i of D After orienting frequent ftree t to its cnonicl mpping t of g i in D, We cn select potentil lels from grph g i for cndidte genertion Given n-ftree t, nd ssume tht the numer of equivlence clsses of t is c, nd the numer of vertices in ech equivlence clss C i is n i (1 i c) The numer of mppings etween t nd n occurrence t' in grph g c i is up to ( ni )! i= 1 With cnonicl mpping, we only need to consider one out of mppings for cndidte genertion c i= 1 ( n )! i ICDM Workshop MCD06 14
The Rel Dtset Performnce Studies The AIDS ntivirl screen dtset from Developmentl Theropeutics Progrm in NCI/NIH 42390 compounds retrieved from DTP's Drug Informtion System 63 kinds of toms in this dtset, most of which re C, H, O, S, etc. Three kinds of onds re populr in these compounds: single-ond, doule-ond nd romtic-ond On verge, compounds in the dtset hs 43 vertices nd 45 edges. The grph of mximum size hs 221 vertices nd 234 edges ICDM Workshop MCD06 15
Rel Dt Set Performnce comprisons (with different minimum threshold: 10%, 20%, 50%) Totl running time (sec) 20000 15000 10000 5000 F3TM FG FT Totl running time (sec) 12000 10000 8000 6000 4000 2000 F3TM FG FT Totl running time (sec) 3500 3000 2500 2000 1500 1000 500 F3TM FG FT 0 0 2000 4000 6000 8000 10000 Size of dtsets 0 0 2000 4000 6000 8000 10000 Size of dtsets 0 0 2000 4000 6000 8000 10000 Size of dtsets ICDM Workshop MCD06 16
Conclusion Free tree hs computtionl dvntges over generl grph, which mkes it suitle cndidte for computtionl iology, pttern recognition, computer networks, XML dtses, etc. F3TM discovers ll frequent free trees in grph dtse with the focus on reducing the cost of cndidte genertion F3TM outperforms the up-to-dte existing free tree mining lgorithms y n order of mgnitude F3TM is sclle to mine frequent free trees in lrge grph dtset with low minimum support threshold ICDM Workshop MCD06 17
The Chinese University of Hong Kong Thnk you