Fast Frequent Free Tree Mining in Graph Databases

Similar documents
The Minimum Label Spanning Tree Problem: Illustrating the Utility of Genetic Algorithms

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Model Reduction of Finite State Machines by Contraction

New data structures to reduce data size and search time

Connected-components. Summary of lecture 9. Algorithms and Data Structures Disjoint sets. Example: connected components in graphs

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Computing with finite semigroups: part I

Minimal DFA. minimal DFA for L starting from any other

The Quest for Perfect and Compact Symmetry Breaking for Graph Problems

Resources. Introduction: Binding. Resource Types. Resource Sharing. The type of a resource denotes its ability to perform different operations

Convert the NFA into DFA

, if x 1 and f(x) = x, if x 0.

CS 275 Automata and Formal Language Theory

Hamiltonian Cycle in Complete Multipartite Graphs

Formal languages, automata, and theory of computation

u( t) + K 2 ( ) = 1 t > 0 Analyzing Damped Oscillations Problem (Meador, example 2-18, pp 44-48): Determine the equation of the following graph.

p-adic Egyptian Fractions

Surface maps into free groups

CM10196 Topic 4: Functions and Relations

1B40 Practical Skills

Review of Gaussian Quadrature method

Random subgroups of a free group

The Regulated and Riemann Integrals

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

Lecture 2: January 27

1 Online Learning and Regret Minimization

Lecture 08: Feb. 08, 2019

Designing finite automata II

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2014

Classification: Rules. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo regionale di Como

DATABASE DESIGN I - 1DL300

Intensity transformations

Computing the Optimal Global Alignment Value. B = n. Score of = 1 Score of = a a c g a c g a. A = n. Classical Dynamic Programming: O(n )

Lexical Analysis Finite Automate

5.1 How do we Measure Distance Traveled given Velocity? Student Notes

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

7.1 Integral as Net Change and 7.2 Areas in the Plane Calculus

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

Ehrenfeucht-Fraïssé Games: Applications and Complexity. Department of Mathematics and Computer Science University of Udine, Italy ESSLLI 2010 CPH

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

Advanced Algebra & Trigonometry Midterm Review Packet

Where did dynamic programming come from?

Continuous Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

CS 310 (sec 20) - Winter Final Exam (solutions) SOLUTIONS

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Chapter 2 Finite Automata

First Midterm Examination

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Formal Languages and Automata

Section 6: Area, Volume, and Average Value

CSE : Exam 3-ANSWERS, Spring 2011 Time: 50 minutes

Preview 11/1/2017. Greedy Algorithms. Coin Change. Coin Change. Coin Change. Coin Change. Greedy algorithms. Greedy Algorithms

First Midterm Examination

List all of the possible rational roots of each equation. Then find all solutions (both real and imaginary) of the equation. 1.

Reading from Young & Freedman: For this topic, read the introduction to chapter 24 and sections 24.1 to 24.5.

Fault Modeling. EE5375 ADD II Prof. MacDonald

Outline Last time: Deriving the State Diagram & Datapath (Cont.) Mapping the Datapath onto Control

3 Regular expressions

DATABASTEKNIK - 1DL116

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Pavel Rytí. November 22, 2011 Discrete Math Seminar - Simon Fraser University

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

State Minimization for DFAs

Nondeterminism and Nodeterministic Automata

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

INF1383 -Bancos de Dados

The size of subsequence automaton

Coherent Closed Quasi-Clique Discovery from Large Dense Graph Databases

CISC 4090 Theory of Computation

Section: Other Models of Turing Machines. Definition: Two automata are equivalent if they accept the same language.

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

MA123, Chapter 10: Formulas for integrals: integrals, antiderivatives, and the Fundamental Theorem of Calculus (pp.

Designing Information Devices and Systems I Spring 2018 Homework 7

AT100 - Introductory Algebra. Section 2.7: Inequalities. x a. x a. x < a

Generating finite transformation semigroups: SgpWin

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS

Matching patterns of line segments by eigenvector decomposition

Designing Information Devices and Systems I Fall 2016 Babak Ayazifar, Vladimir Stojanovic Homework 6. This homework is due October 11, 2016, at Noon.

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d

Graph Theory. Dr. Saad El-Zanati, Faculty Mentor Ryan Bunge Graduate Assistant Illinois State University REU. Graph Theory

7.2 Riemann Integrable Functions

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers

The maximal number of runs in standard Sturmian words

( ) Same as above but m = f x = f x - symmetric to y-axis. find where f ( x) Relative: Find where f ( x) x a + lim exists ( lim f exists.

Lecture 3 ( ) (translated and slightly adapted from lecture notes by Martin Klazar)

Section 6.1 Definite Integral

Homework Assignment 6 Solution Set

1 From NFA to regular expression

a * a (2,1) 1,1 0,1 1,1 2,1 hkl 1,0 1,0 2,0 O 2,1 0,1 1,1 0,2 1,2 2,2

Suppose we want to find the area under the parabola and above the x axis, between the lines x = 2 and x = -2.

arxiv: v1 [cs.db] 30 May 2012

Fingerprint idea. Assume:

MTH 505: Number Theory Spring 2017

Chapter 3. Vector Spaces

Lecture 3 Gaussian Probability Distribution

Transcription:

The Chinese University of Hong Kong Fst Frequent Free Tree Mining in Grph Dtses Peixing Zho Jeffrey Xu Yu The Chinese University of Hong Kong Decemer 18 th, 2006 ICDM Workshop MCD06

Synopsis Introduction Existing Approches Our Algorithm: F3TM Performnce Studies Conclusions ICDM Workshop MCD06 2

Introduction Grph, generl dt structure to represent reltions mong entities, hs een widely used in rod rnge of res Computtionl iology Chemistry Pttern recognition Computer networks etc. Mining frequent su-grphs in grph dtse If lrge grph contins nother smll grph : the su-grph isomorphism prolem ( NP-complete ) If two grphs re isomorphic : the grph isomorphism prolem (either P or NP-complete) ICDM Workshop MCD06 3

Introduction Free Tree (ftree) Connected, cyclic nd undirected grph Widely used in ioinformtics, computer vision, networks, etc. Speciliztion of generl grph voiding undesirle theoreticl properties nd lgorithmic complexity incurred y grph determining whether tree t 1 is contined in nother tree t 2 cn e solved in O(m 3/2 n/logm) time determining whether t 1 is isomorphic to t 2 cn e solved in O(n) determining whether tree is isomorphic to some su-trees of grph, costly tree-in-grph testing which is still NP-Complete ICDM Workshop MCD06 4

Introduction Frequent free tree mining Given grph dtse D = { g 1, g 2,, g N }. The prolem of frequent free tree mining is to find the set of ll frequent free trees where ftree, t, is frequent if the rtio of grphs in D, tht hs t s its su-tree, is greter thn or equl to user-given threshold Φ Two key concepts Cndidte genertion Frequency counting Our focus The less numer of cndidtes generted, the less numer of times to pply costly tree-in-grph testing the cost of cndidte genertion itself cn e high ICDM Workshop MCD06 5

FT-Algorithm Apriori-sed lgorithm Existing Approches Builds conceptul enumertion lttice to enumerte frequent ftrees in the dtse Follows pttern-join pproch to generte cndidte frequent ftrees FG-Algorithm A verticl mining lgorithm Builds n enumertion tree nd trverses it in depth-first fshion Tkes pttern-growth pproch to generte cndidte frequent ftrees ICDM Workshop MCD06 6

Our Algorithm: F3TM F3TM (Fst Frequent Free Tree Mining) A verticl mining lgorithm Requires reltively smll memory to mintin the frequent ftrees eing found Uses the pttern-growth pproch for cndidte genertion Two pruning lgorithms re proposed to fcilitte cndidte genertion nd they contriute drmtic speedup to the finl performnce of our ftree mining lgorithm Automorphism-sed pruning Cnonicl mpping-sed pruning ICDM Workshop MCD06 7

Cnonicl Form of Free Tree A unique representtion of ftree two ftrees, t 1 nd t 2, shre the sme cnonicl form if nd only if t 1 is isomorphic to t 2 Only free trees in their cnonicl form need to e considered in frequent ftree mining process A two-step lgorithm normlizing ftree to e rooted ordered tree ssigning string, s its code, to represent the normlized rooted ordered tree Both steps of the lgorithm re O(n), for n-ftree ICDM Workshop MCD06 8

Cndidte Genertion Theorem: the completeness of frequent ftrees is ensured if we grow vertices from the predefined positions of ftree, clled extension frontier Extension frontier represents ll legl positions of n n-ftree t on which new vertex cn e ppended to chieve the new (n+1)-ftree t, while no ftrees re omitted during this frontierextending process c d e f g ICDM Workshop MCD06 9

Automorphism-Bsed Pruning Given cndidte ftree t in T (the cndidtes set), in order to reduce the cost of frequency counting, we firstly check if there is cndidte ftree t' in T such s t = t' There is no need to count redundncies When T ecomes lrge, the cost of checking t = t' for every t' in T cn possily ecome the dominting cost 0 1 2 c d c d 3 4 5 6 c d c d c d c d ICDM Workshop MCD06 10

Automorphism-Bsed Pruning Automorphism-sed pruning efficiently prunes redundnt cndidtes in T while voids checking if ftree hs existed in T lredy, repetitively All vertices of free tree cn e prtitioned into different equivlence clsses se on utomorphism We only need to grow vertices from one representtive of n equivlence clss, if vertices of the equivlence clss re in the extension frontier of the ftree 0 0 0 c d c d 0 1 0 1 c d c d ICDM Workshop MCD06 11

Cnonicl Mpping-sed Pruning How to select potentil lels to e grown on the frequent ftrees during cndidte genertion? Existing lgorithms mintin mppings from ftree t to ll its k occurrences in g i Bsed on these mppings, it is possile to know which lels, tht pper in grph g i, cn e selected nd ssigned to generte cndidte (n+1)-ftree there re lot of redundnt mppings etween ftree t nd occurrences in g i ICDM Workshop MCD06 12

Cnonicl Mpping-sed Pruning g 1 g 2 1 4 1 2 3 2 3 4 1 t 2 3 mpping list (1;1,2,4) (1;1,4,2) (1;3,2,4) (1;3,4,2) (2;2,3,4) (2;2,4,3) ICDM Workshop MCD06 13

Cnonicl Mpping-sed Pruning Cnonicl mpping efficiently void multiple mppings from ftree to the sme occurrence of the tree in grph g i of D After orienting frequent ftree t to its cnonicl mpping t of g i in D, We cn select potentil lels from grph g i for cndidte genertion Given n-ftree t, nd ssume tht the numer of equivlence clsses of t is c, nd the numer of vertices in ech equivlence clss C i is n i (1 i c) The numer of mppings etween t nd n occurrence t' in grph g c i is up to ( ni )! i= 1 With cnonicl mpping, we only need to consider one out of mppings for cndidte genertion c i= 1 ( n )! i ICDM Workshop MCD06 14

The Rel Dtset Performnce Studies The AIDS ntivirl screen dtset from Developmentl Theropeutics Progrm in NCI/NIH 42390 compounds retrieved from DTP's Drug Informtion System 63 kinds of toms in this dtset, most of which re C, H, O, S, etc. Three kinds of onds re populr in these compounds: single-ond, doule-ond nd romtic-ond On verge, compounds in the dtset hs 43 vertices nd 45 edges. The grph of mximum size hs 221 vertices nd 234 edges ICDM Workshop MCD06 15

Rel Dt Set Performnce comprisons (with different minimum threshold: 10%, 20%, 50%) Totl running time (sec) 20000 15000 10000 5000 F3TM FG FT Totl running time (sec) 12000 10000 8000 6000 4000 2000 F3TM FG FT Totl running time (sec) 3500 3000 2500 2000 1500 1000 500 F3TM FG FT 0 0 2000 4000 6000 8000 10000 Size of dtsets 0 0 2000 4000 6000 8000 10000 Size of dtsets 0 0 2000 4000 6000 8000 10000 Size of dtsets ICDM Workshop MCD06 16

Conclusion Free tree hs computtionl dvntges over generl grph, which mkes it suitle cndidte for computtionl iology, pttern recognition, computer networks, XML dtses, etc. F3TM discovers ll frequent free trees in grph dtse with the focus on reducing the cost of cndidte genertion F3TM outperforms the up-to-dte existing free tree mining lgorithms y n order of mgnitude F3TM is sclle to mine frequent free trees in lrge grph dtset with low minimum support threshold ICDM Workshop MCD06 17

The Chinese University of Hong Kong Thnk you