An Efficient Algorithm for Discovering Frequent Subgraphs

Similar documents
1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Exercise sheet 6: Solutions

12.4 Similarity in Right Triangles

Fast Frequent Free Tree Mining in Graph Databases

A Lower Bound for the Length of a Partial Transversal in a Latin Square, Revised Version

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

NON-DETERMINISTIC FSA

8 THREE PHASE A.C. CIRCUITS

Lecture Notes No. 10

SECTION A STUDENT MATERIAL. Part 1. What and Why.?

Comparing the Pre-image and Image of a Dilation

CS 573 Automata Theory and Formal Languages

Matrices SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics (c) 1. Definition of a Matrix

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

VIBRATION ANALYSIS OF AN ISOLATED MASS WITH SIX DEGREES OF FREEDOM Revision G

Chapter 4 State-Space Planning

AP Calculus BC Chapter 8: Integration Techniques, L Hopital s Rule and Improper Integrals

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

Maintaining Mathematical Proficiency

Lesson 2: The Pythagorean Theorem and Similar Triangles. A Brief Review of the Pythagorean Theorem.

Reference : Croft & Davison, Chapter 12, Blocks 1,2. A matrix ti is a rectangular array or block of numbers usually enclosed in brackets.

6.5 Improper integrals

ANALYSIS AND MODELLING OF RAINFALL EVENTS

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

Linear Algebra Introduction

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Core 2 Logarithms and exponentials. Section 1: Introduction to logarithms

Figure 1. The left-handed and right-handed trefoils

5. Every rational number have either terminating or repeating (recurring) decimal representation.

AVL Trees. D Oisín Kidney. August 2, 2018

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

Metodologie di progetto HW Technology Mapping. Last update: 19/03/09

21.1 Using Formulae Construct and Use Simple Formulae Revision of Negative Numbers Substitution into Formulae

Algorithm Design and Analysis

Algorithms for Mining the Evolution of Conserved Relational States in Dynamic Networks

The Ellipse. is larger than the other.

Introduction to Olympiad Inequalities

ENERGY AND PACKING. Outline: MATERIALS AND PACKING. Crystal Structure

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

Preview 11/1/2017. Greedy Algorithms. Coin Change. Coin Change. Coin Change. Coin Change. Greedy algorithms. Greedy Algorithms

Mathematics SKE: STRAND F. F1.1 Using Formulae. F1.2 Construct and Use Simple Formulae. F1.3 Revision of Negative Numbers

Algorithm Design and Analysis

Finite State Automata and Determinisation

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

Alpha Algorithm: Limitations

Section 2.3. Matrix Inverses

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

are coplanar. ˆ ˆ ˆ and iˆ

A Primer on Continuous-time Economic Dynamics

Nondeterministic Automata vs Deterministic Automata

Reflection Property of a Hyperbola

For a, b, c, d positive if a b and. ac bd. Reciprocal relations for a and b positive. If a > b then a ab > b. then

Generalization of 2-Corner Frequency Source Models Used in SMSIM

50 AMC Lectures Problem Book 2 (36) Substitution Method

TIME AND STATE IN DISTRIBUTED SYSTEMS

Electromagnetism Notes, NYU Spring 2018

CS 491G Combinatorial Optimization Lecture Notes

Discrete Structures Lecture 11

6.3.2 Spectroscopy. N Goalby chemrevise.org 1 NO 2 H 3 CH3 C. NMR spectroscopy. Different types of NMR

Activities. 4.1 Pythagoras' Theorem 4.2 Spirals 4.3 Clinometers 4.4 Radar 4.5 Posting Parcels 4.6 Interlocking Pipes 4.7 Sine Rule Notes and Solutions

6.3.2 Spectroscopy. N Goalby chemrevise.org 1 NO 2 CH 3. CH 3 C a. NMR spectroscopy. Different types of NMR

Engr354: Digital Logic Circuits

Calculus Module C21. Areas by Integration. Copyright This publication The Northern Alberta Institute of Technology All Rights Reserved.

Unit 4. Combinational Circuits

Section 1.3 Triangles

NEW CIRCUITS OF HIGH-VOLTAGE PULSE GENERATORS WITH INDUCTIVE-CAPACITIVE ENERGY STORAGE

Arrow s Impossibility Theorem

Signal Flow Graphs. Consider a complex 3-port microwave network, constructed of 5 simpler microwave devices:

HOMEWORK FOR CLASS XII ( )

= state, a = reading and q j

Probability. b a b. a b 32.

System Validation (IN4387) November 2, 2012, 14:00-17:00

( ) 1. 1) Let f( x ) = 10 5x. Find and simplify f( 2) and then state the domain of f(x).

Lecture 6: Coding theory

Part 4. Integration (with Proofs)

CS 2204 DIGITAL LOGIC & STATE MACHINE DESIGN SPRING 2014

Green s Theorem. (2x e y ) da. (2x e y ) dx dy. x 2 xe y. (1 e y ) dy. y=1. = y e y. y=0. = 2 e

Learning Partially Observable Markov Models from First Passage Times

p-adic Egyptian Fractions

The Trapezoidal Rule

THE PYTHAGOREAN THEOREM

Logic Synthesis and Verification

(h+ ) = 0, (3.1) s = s 0, (3.2)

University of Sioux Falls. MAT204/205 Calculus I/II

XPath Rewriting Using Multiple Views. Author. Published. Journal Title DOI. Copyright Statement. Downloaded from. Griffith Research Online

PAIR OF LINEAR EQUATIONS IN TWO VARIABLES

April 8, 2017 Math 9. Geometry. Solving vector problems. Problem. Prove that if vectors and satisfy, then.

H (2a, a) (u 2a) 2 (E) Show that u v 4a. Explain why this implies that u v 4a, with equality if and only u a if u v 2a.

] dx (3) = [15x] 2 0

8.3 THE HYPERBOLA OBJECTIVES

DETERMINING SIGNIFICANT FACTORS AND THEIR EFFECTS ON SOFTWARE ENGINEERING PROCESS QUALITY

Unit-VII: Linear Algebra-I. To show what are the matrices, why they are useful, how they are classified as various types and how they are solved.

y1 y2 DEMUX a b x1 x2 x3 x4 NETWORK s1 s2 z1 z2

Mid-Term Examination - Spring 2014 Mathematical Programming with Applications to Economics Total Score: 45; Time: 3 hours

Chapter 3. Vector Spaces. 3.1 Images and Image Arithmetic

2.4 Linear Inequalities and Interval Notation

The Word Problem in Quandles

are fractions which may or may not be reduced to lowest terms, the mediant of ( a

Transcription:

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 1 An Effiient Algorithm for Disovering Frequent Sugrphs Mihihiro Kurmohi nd George Krpis, Memer, IEEE Deprtment of Computer Siene Universit of Minnesot 4-192 EE/CS Building, 200 Union St SE Minnepolis, MN 55455 {kurm, krpis}@s.umn.edu Astrt Over the ers, frequent itemset disover lgorithms hve een used to find interesting ptterns in vrious pplition res. However, s dt mining tehniques re eing inresingl pplied to non-trditionl domins, eisting frequent pttern disover pproh nnot e used. This is euse the trnstion frmework tht is ssumed these lgorithms nnot e used to effetivel model the dtsets in these domins. An lternte w of modeling the ojets in these dtsets is to represent them using grphs. Within tht model, one w of formulting the frequent pttern disover prolem is s tht of disovering sugrphs tht our frequentl over the entire set of grphs. In this pper we present omputtionll effiient lgorithm, lled FSG, for finding ll frequent sugrphs in lrge grph dtsets. We eperimentll evlute the performne of FSG using vriet of rel nd sntheti dtsets. Our results show tht despite the underling ompleit ssoited with frequent sugrph disover, FSG is effetive in finding ll frequentl ourring sugrphs in dtsets ontining over 200,000 grph trnstions nd sles linerl with respet to the sie of the dtset. Inde Terms Dt mining, sientifi dtsets, frequent pttern disover, hemil ompound dtsets. I. INTRODUCTION EFFICIENT lgorithms for finding frequent ptterns oth sequentil nd non-sequentil in ver lrge dtsets hve een one of the ke suess stories of dt mining reserh [1], [2], [20], [36], [41], [49]. Nevertheless, s dt mining tehniques hve een inresingl pplied to nontrditionl domins, there is need to develop effiient nd generl-purpose frequent pttern disover lgorithms tht re ple of pturing the strong sptil, topologil, geometri, nd/or reltionl nture of the dtsets tht hrterie these domins. In reent ers, leled topologil grphs hve emerged s promising strtion to pture the hrteristis of these dtsets. In this pproh, eh ojet to e nled is represented vi seprte grph whose verties orrespond to the entities in the ojet nd the edges orrespond to the reltions etween them. Within tht model, one w of This work ws supported NSF CCR-9972519, EIA-9986042, ACI- 9982274 nd ACI-0133464, Arm Reserh Offie ontrt DA/DAAG55-98-1-0441, nd Arm High Performne Computing Reserh Center ontrt numer DAAH04-95-C-0008. Aess to omputing filities ws provided the Minnesot Superomputing Institute. formulting the frequent pttern disover prolem is s tht of disovering sugrphs tht our frequentl over the entire set of grphs. The power of grphs to model omple dtsets hs een reognied vrious reserhers [3], [6], [10], [14], [19], [23], [26], [30], [37], [43], [46] s it llows us to represent ritrr reltions mong entities nd solve prolems tht we ould not previousl solve. For instne, onsider the prolem of mining hemil ompounds to find reurrent sustrutures. We n hieve tht using grph-sed pttern disover lgorithm reting grph for eh one of the ompounds whose verties orrespond to different toms, nd whose edges orrespond to onds etween them. We n ssign to eh verte lel orresponding to the tom involved (nd potentill its hrge), nd ssign to eh edge lel orresponding to the tpe of the ond (nd potentill informtion out their reltive 3D orienttion). One these grphs hve een reted, reurrent sustrutures ross different ompounds eome frequentl ourring sugrphs. In ft, within the ontet of hemil ompound lssifition, suh tehniques hve een used to mine hemil ompounds nd identif the sustrutures tht est disriminte etween the different lsses [5], [11], [27], [42], nd were shown to produe superior lssifiers thn more trditionl methods [21]. Developing lgorithms tht disover ll frequentl ourring sugrphs in lrge grph dtset is prtiulrl hllenging nd omputtionll intensive, s grph nd sugrph isomorphisms pl ke role throughout the omputtions. In this pper we present new lgorithm, lled FSG, for finding ll onneted sugrphs tht pper frequentl in lrge grph dtset. Our lgorithm finds frequent sugrphs using the level--level epnsion strteg dopted Apriori [2]. The ke fetures of FSG re the following: (i) it uses sprse grph representtion tht minimies oth storge nd omputtion; (ii) it inreses the sie of frequent sugrphs dding one edge t time, llowing it to generte the ndidtes effiientl; (iii) it inorportes vrious optimitions for ndidte genertion nd frequen ounting whih enles it to sle to lrge grph dtsets; nd (iv) it uses sophistited lgorithms for nonil leling to uniquel identif the vrious generted sugrphs without hving to resort to omputtionll epensive grph- nd sugrph-

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 2 isomorphism omputtions. We eperimentll evluted FSG on three tpes of dtsets. The first two dtsets orrespond to vrious hemil ompounds ontining over 200,000 trnstions nd frequent ptterns whose sie is lrge, nd the third tpe orresponds to vrious grph dtsets tht were snthetill generted using frmework similr to tht used for mrket-sket trnstion genertion [2]. Our results illustrte tht FSG n operte on ver lrge grph dtsets nd find ll frequentl ourring sugrphs in resonle mount of time nd sles linerl with the dtset sie. For emple, in dtset ontining over 200,000 hemil ompounds, FSG n disover ll sugrphs tht our in t lest 1% of the trnstions in pproimtel one hour. Furthermore, our detiled evlution using the snthetill generted grphs shows tht for dtsets tht hve modertel lrge numer of different verte nd edge lels, FSG is le to hieve good performne s the trnstion sie inreses. The rest of the pper is orgnied s follows. Setion II provides some definitions nd introdues the nottion tht is used in the pper. Setion III formll defines the prolem of frequent sugrph disover nd disusses the modeling strengths of the disovered ptterns nd the hllenges ssoited with finding them in omputtionll effiient mnner. Setion IV desries in detil the lgorithm. Setion V desries the vrious optimitions tht we developed for effiientl omputing the nonil lel of the ptterns. Setion VI provides detiled eperimentl evlution of FSG on lrge numer of rel nd sntheti dtsets. Setion VII desries the relted reserh in this re, nd finll, Setion VIII provides some onluding remrks. II. DEFINITIONS AND NOTATION A grph G = (V, E) is mde of two sets, the set of verties V nd the set of edges E. Eh edge itself is pir of verties, nd throughout this pper we ssume tht the grph is undireted, i.e., eh edge is n unordered pir of verties. Furthermore, we will ssume tht the grph is leled. Tht is, eh verte nd edge hs lel ssoited with it tht is drwn from predefined set of verte lels (L V ) nd edge lels (L E ). Eh verte (or edge) of the grph is not required to hve unique lel nd the sme lel n e ssigned to mn verties (or edges) in the sme grph. Given grph G = (V, E), grph G s = (V s, E s ) will e sugrph of G if nd onl if V s V nd E s E nd it will e n indued sugrph of G if V s V nd E s ontins ll the edges of E tht onnet verties in V s. A grph is onneted if there is pth etween ever pir of verties in the grph. Two grphs G 1 = (V 1, E 1 ) nd G 2 = (V 2, E 2 ) re isomorphi if the re topologill identil to eh other, tht is, there is mpping from V 1 to V 2 suh tht eh edge in E 1 is mpped to single edge in E 2 nd vie vers. In the se of leled grphs, this mpping must lso preserve the lels on the verties nd edges. An utomorphism is n isomorphism mpping where G 1 = G 2. Given two grphs G 1 = (V 1, E 1 ) nd G 2 = (V 2, E 2 ), the prolem of sugrph isomorphism is to find n isomorphism etween G 2 nd sugrph of G 1, TABLE I NOTATION USED THROUGHOUT THE PAPER Nottion Desription k-sugrph A onneted sugrph with k edges (lso written s sie-k sugrph) G k, H k (Su)grphs of sie k E(G) Edges of (su)grph G V (G) Verties of (su)grph G l(g) A nonil lel of grph G,,, e, f edges u, v verties d(v) Degree of verte v l(v) The lel of verte v l(e) The lel of n edge e H = G e H is grph otined the deletion of edge e E(G) D A dtset of grph trnstions {D 1, D 2,..., D N } Disjoint N prtitions of D (for i nd j, i j, D i D j = nd i Di = D) T A grph trnstion C A ndidte sugrph C k A set of ndidtes with k edges C A set of ll ndidtes F A frequent sugrph F k A set of frequent k-sugrphs F A set of ll frequent sugrphs k The sie of the lrgest frequent sugrph in D L E A set of ll edge lels in D L V A set of ll verte lels in D i.e., to determine whether or not G 2 is inluded in G 1. The nonil lel of grph G = (V, E), l(g), is defined to e unique ode (i.e., sequene of its, string, or sequene of numers) tht is invrint on the ordering of the verties nd edges in the grph [15]. As result, two grphs will hve the sme nonil lel if the re isomorphi. Emples of different nonil lel odes nd detils on how the re omputed re presented in Setion V. Both nonil leling nd determining grph isomorphism re not known to e either in P or in NP-omplete [15]. The sie of grph G = (V, E) is defined to e equl to E. Given sie-k onneted grph G = (V, E), dding n edge we will refer to the opertion in whih n edge e = (u, v) is dded to the grph so tht the resulting sie-(k + 1) grph remins onneted. Similrl, deleting n edge we refer to the opertion in whih e = (u, v) suh tht e E is deleted from the grph nd the resulting sie-(k 1) grph remins onneted. Note tht depending on the prtiulr hoie of e, the deletion of the edge m result in deleting t most one of its inident verties if tht verte hs onl e s its inident edge. Finll, the nottion tht we will e using through-out the pper is shown in Tle I. III. FREQUENT SUBGRAPH DISCOVERY PROBLEM DEFINITION The prolem of finding frequentl ourring onneted sugrphs in set of grphs is defined s follows: Definition 1 (Sugrph Disover): Given set of grphs D eh of whih is n undireted leled grph, nd prmeter σ suh tht 0 < σ 1, find ll onneted undireted grphs tht re sugrphs in t lest σ D of the input grphs. We will refer to eh of the grphs in D s grph trnstion or simpl trnstion when the ontet is ler, to D s the grph trnstion dtset, nd to σ s the support threshold.

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 3 There re two ke spets in the ove prolem sttement. First, we re onl interested in sugrphs tht re onneted. This is motivted the ft tht the resulting frequent sugrphs will e enpsulting reltions (or edges) etween some of the entities (or verties) of vrious ojets. Within this ontet, onnetivit is nturl propert of frequent ptterns. An dditionl enefit of this restrition is tht it redues the ompleit of the prolem, s we do not need to onsider disonneted omintions of frequent onneted sugrphs. Seond, we llow the grphs to e leled, nd s disussed in Setion II, input grph trnstions nd disovered frequent ptterns n ontin multiple verties nd edges rring the sme lel. This gretl inreses our modeling ilit, s it llow us to find ptterns involving multiple ourrenes of the sme entities nd reltions, ut t the sme time mkes the prolem of finding suh frequentl ourring sugrphs nontrivil. This is euse in suh ses, n frequent sugrph disover lgorithm needs to orretl identif how prtiulr sugrph mps to the verties nd edges of eh grph trnstion, tht n onl e done solving mn instnes of the sugrph isomorphism prolem, whih hs een shown to e in NP-omplete [16]. IV. FSG FREQUENT SUBGRAPH DISCOVERY ALGORITHM In developing our frequent sugrph disover lgorithm, we deided to follow the level--level struture of the Apriori [2] lgorithm used for finding frequent itemsets. The motivtion ehind this hoie is the ft tht the level--level struture of Apriori requires the smllest numer of sugrph isomorphism omputtions during frequen ounting, s it llows it to tke full dvntge of the downwrd losed propert of the minimum support onstrint nd hieves the highest mount of pruning when ompred with the most reentl developed depth-first-sed pprohes suh s delt [49], Tree Projetion [1], nd FP-growth [20]. In ft, despite the etr overhed due to ndidte genertion tht is inurred the level--level pproh, reent studies hve shown tht euse of its effetive pruning, it hieves omprle performne with tht hieved the vrious depth-firstsed pprohes, s long s the dt set is not dense or the support vlue is not etremel smll [18], [22]. The overll flow of our lgorithm, lled FSG, is similr to tht of Apriori, nd works s follows. FSG strts enumerting ll frequent single- nd doule-edge sugrphs. Then, it enters its min omputtionl phse, whih onsists of min itertion loop. During eh itertion, FSG first genertes ll ndidte sugrphs whose sie is greter thn the previous frequent ones one edge, nd then ounts the frequen for eh of these ndidtes nd prunes sugrphs tht do no stisf the support onstrint. FSG stops when no frequent sugrphs re generted for prtiulr itertion. Detils on how FSG genertes the ndidtes sugrphs, nd on how it omputes their frequen re provided in Setion IV- A nd Setion IV-B, respetivel. To ensure tht the vrious grph-relted opertions re performed effiientl, FSG stores the vrious input grphs nd the vrious ndidte nd frequent sugrphs tht it genertes using n djen list representtion. A. Cndidte Genertion FSG genertes ndidte sugrphs of sie k +1 joining two frequent sie-k sugrphs. In order for two suh frequent sie-k sugrphs to e eligile for joining the must ontin the sme sie-(k 1) onneted sugrph. The simplest w to generte the omplete set of ndidte sugrphs is to join ll pirs of sie-k frequent sugrphs tht hve ommon sie-(k 1) sugrph. Unfortuntel, the prolem with this pproh is tht prtiulr sie-k sugrph, n hve up to k different sie-(k 1) sugrphs. As result, if we onsider ll suh possile sugrphs nd perform the resulting join opertions, we will end up generting the sme ndidte pttern multiple times, nd generting lrge numer of ndidte ptterns tht re not downwrd losed. The net effet of this, is tht the resulting lgorithm spends signifint mount of time identifing unique ndidtes nd eliminting non-downwrd losed ndidtes (oth of whih opertions re non-trivil s the require to determine the nonil lel of the generted sugrphs). Note tht ndidte genertion pprohes in the ontet of frequent itemsets, (e.g., Apriori [2]) do not suffer from this prolem euse the use onsistent w to order the items within n itemset (e.g., leiogrphill). Using this ordering, the onl join two sie-k itemsets if the hve the sme (k 1)-prefi. For emple, prtiulr itemset {A, B, C, D} will onl e generted one ( joining {A, B, C} nd {A, B, D}), nd if tht itemset is not downwrd losed, it will never e generted if onl its {A, B, C} nd {B, C, D} susets were frequent. Fortuntel, the sitution for sugrph ndidte genertion is not s severe s the ove disussion seems to indite nd FSG ddresses oth of these prolems onl joining two frequent sugrphs if nd onl if the shre ertin, properl seleted, sie-(k 1) sugrph. Speifill, for eh frequent sie-k sugrph F i, let P(F i ) = {H i,1, H i,2 } e the two sie-(k 1) onneted sugrphs of F i suh tht H i,1 hs the smllest nonil lel nd H i,2 hs the seond smllest nonil lel mong the vrious onneted sie-(k 1) sugrphs of F i. We will refer to these sugrphs s the primr sugrphs of F i. Note tht if ever sie-(k 1) sugrph of F i is isomorphi to eh other, H i,1 = H i,2 nd P(F i ) = 1. FSG will onl join two frequent sugrphs F i nd F j, if nd onl if P(F i ) P(F j ), nd the join opertion will e done with respet to the ommon sie-(k 1) sugrph(s). The proof tht this pproh will orretl generte ll vlid ndidte sugrphs is presented in Appendi. This ndidte genertion pproh drmtill redues the numer of redundnt nd non-downwrd losed ptterns tht re generted nd leds to signifint performne improvements over the nive pproh (originll implemented in [29]). The tul join opertion of two frequent sie-k sugrphs F i nd F j tht hve ommon primr sugrph H is performed generting ndidte sie-(k + 1) sugrph tht ontins H plus the two edges tht were deleted from F i nd F j to otin H. However, unlike the joining of itemsets

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 4 Fig. 1. G 4 1 G 5 1 + + G 4 2 G 5 2 Join G 5 1 () B verte leling Join G 6 1 G 6 2 G 5 2 () B multiple utomorphisms of single ore Two ses of joining in whih two frequent sie-k itemsets led to unique sie- (k + 1) itemset, the joining of two sie-k sugrphs m produe multiple distint sie-(k+1) ndidtes. This hppens for the following two resons. First, the differene etween the ommon primr sugrph nd the two frequent sugrphs n e verte tht hs the sme lel. In this se, the joining of suh sie-k sugrphs will generte two distint sugrphs of sie k + 1. Fig. 1() shows suh n emple, in whih the pir of grphs G 4 nd G 4 genertes two different ndidtes G 5 nd G 5. Seond, the primr sugrph itself m hve multiple utomorphisms, nd eh of them n led to different sie-(k + 1) ndidte. In the worst se, when the primr sugrph is n unleled lique, the numer of utomorphisms is k!. An emple for this se is shown in Fig. 1(), in whih the primr sugrph squre of four verties leled with hs four utomorphisms resulting in three different ndidtes of sie si. Finll, in ddition to joining two different sugrphs, FSG lso needs to perform self join. This hppens, for emple, when the two grphs G k i nd G k j in Fig. 1 re identil. It is neessr euse, for emple, onsider grph trnstions without n lels. Then, there will e onl one frequent sie-1 sugrph nd one frequent sie-2 sugrph regrdless of the support threshold, euse those re the onl llowed strutures, nd edges nd verties do not hve lels ssigned. In generl, whenever F k = 1, self join is neessr to otin set of vlid (k+1)- ndidtes. B. Frequen Counting G 6 3 The simplest w to determine the frequen of eh ndidte sugrph is to sn eh one of the dtset trnstions nd determine if it is ontined or not using sugrph isomorphism. Nonetheless, hving to ompute these isomorphisms is prtiulrl epensive nd this pproh is not fesile for lrge dtsets. In the ontet of frequent itemset disover Apriori, the frequen ounting is performed sustntill fster uilding hsh-tree of ndidte itemsets nd snning eh trnstion to determine whih of the itemsets in the hsh-tree it supports. Developing suh n lgorithm for frequent sugrphs, however, is hllenging s there is no nturl w to uild the hsh-tree for grphs. For this reson, FSG insted uses trnstion identifier (TID) lists, proposed [13], [40], [47]. In this pproh for eh frequent sugrph FSG keeps list of trnstion identifiers tht support it. Now when FSG needs to ompute the frequen of G k+1, it first omputes the intersetion of the TID lists of its frequent k-sugrphs. If the sie of the intersetion is elow the support, G k+1 is pruned, otherwise FSG omputes the frequen of G k+1 using sugrph isomorphism limiting the serh onl to the set of trnstions in the intersetion of the TID lists. The dvntges of this pproh re two-fold. First, in the ses where the intersetion of the TID lists is ellow the minimum support level, FSG is le to prune the ndidte sugrph without performing n sugrph isomorphism omputtions. Seond, when the intersetion set is suffiientl lrge, FSG onl needs to ompute sugrph isomorphisms for those grphs tht n potentill ontin the ndidte sugrph nd not for ll the grph trnstions. 1) Reduing Memor Requirements of TID lists: The omputtionl dvntges of TID lists ome t the epense of higher memor requirements for mintining them. To ddress this limittion we implemented dtse-prtitioning-sed sheme tht ws motivted similr sheme developed for mining frequent itemsets [39]. In this pproh, the dtse is prtitioned into N disjoint prts D = {D 1, D 2,..., D N }. Eh of these su-dtses D i is mined to find set of frequent sugrphs F i, lled lol frequent sugrphs. The union of the lol frequent sugrphs C = i F i, lled glol ndidtes, is determined nd their frequen in the entire dtse is omputed reding eh grph trnstion nd finding the set of sugrphs tht it supports. The suset of C tht stisfies the minimum support onstrint is output s the finl set of frequent ptterns F. Sine the memor required for storing the TID lists depends on the sie of the dtse, their overll memor requirements n e redued prtitioning the dtse in suffiientl lrge numer of prtitions. One of the prolems with nive implementtion of the ove lgorithm is tht it n drmtill inrese the numer of sugrph isomorphism opertions tht re required to determine the frequen of the glol ndidte set. In order to ddress this prolem, FSG inorportes three tehniques: (i) priori pruning the numer of ndidte sugrphs tht need to e onsidered; (ii) using itmps to limit the frequen ounting of prtiulr ndidte sugrph to onl those prtitions tht this frequen hs not lred eing determined loll; nd (iii) tking dvntge of the lttie struture of C to hek eh grph trnstion onl ginst the sugrphs tht re desendnts of ptterns tht re lred eing supported tht trnstion. The net effet of these optimitions is tht, s shown in Setion VI-A.1, the FSG s overll run-time inreses slowl s the numer of prtitions inreses. The priori pruning of the ndidte sugrphs is hieved s follows. For eh prtition D i, FSG finds the set of lol frequent sugrphs nd the set of lol negtive order

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 5 sugrphs 1, nd stores them into file S i long with their ssoited frequenies. Then, it orgnies the union of the lol frequent nd lol negtive order sugrphs ross the vrious prtitions into lttie struture (lled pttern lttie), inrementll inorporting the informtion from eh file S i. Then, for eh node v of the pttern lttie it omputes n upper ound f (v) of its ourrene frequen dding the orresponding upper ounds for eh one of the N prtitions, f (v) = f1 (v) + + fp (v). For eh prtition D i, fi (v) is determined using the following eqution: { fi fi (v), if v S (v) = i min u (fi, (u)), otherwise where f i (v) is the tul frequen of the pttern orresponding to node v in D i, nd u is onneted sugrph of v tht is smller from it one edge (i.e., it is its prent in the lttie). Note tht the vrious fi (v) vlues n e omputed in ottom-up fshion single sn of S i, nd used diretl to updte the overll f (v) vlues. Now, given this set of frequen upper ounds, FSG proeeds to prune the nodes of the pttern lttie tht re either infrequent or fil the downwrd losure propert. V. CANONICAL LABELING FSG relies on nonil leling to effiientl hek if prtiulr pttern stisfies the downwrd losure propert of the support ondition nd to eliminte duplite ndidte sugrphs. Developing lgorithms tht n effiientl ompute the nonil lel of the vrious sugrphs is ritil to ensure tht FSG n sle to ver lrge grph dtsets. Rell from Setion II tht the nonil lel of grph is nothing more thn ode tht uniquel identifies the grph suh tht if two grphs re isomorphi to eh other, the will e ssigned the sme ode. A simple w of defining the nonil lel of grph is s the string otined ontenting the upper tringulr entries of the grph s djen mtri when this mtri hs een smmetrill permuted so tht this string eomes the leiogrphill lrgest (or smllest) over the strings tht n e otined from ll suh permuttions. This is illustrted in Fig. 2 tht shows grph G 3 nd the permuttion of its djen mtri 2 tht leds to its nonil lel. In this ode, ws otined ontenting the verte-lels in the order tht the pper in the djen mtri nd ws otined ontenting the olumns of the upper tringulr portion of the mtri. Note tht n other permuttion of G 3 s djen mtri will led to ode tht is leiogrphill smller (or equl) to. If grph hs V verties, the ompleit of determining its nonil lel using this sheme is in O( V!) mking it imprtil even for moderte sie grphs. In prtie, the ompleit of finding the nonil lel of grph n e redued using vrious heuristis to 1 A lol negtive order sugrph is the one generted s lol ndidte sugrph ut does not stisf the minimum threshold for the prtition. 2 The smol v i in the figure is verte ID, not verte lel, nd lnk elements in the djen mtri mens there is no edge etween the orresponding pir of verties. This nottion will e used in the rest of the setion. Fig. 2. v 2 v 0 v 1 () G 3 v 0 v 1 v 2 v 0 v 1 v 2 () ode = v 1 v 0 v 2 v 1 v 0 v 2 () ode = Simple emples of odes nd nonil djen mtries nrrow down the serh spe or using lternte nonil lel definitions tht tke dvntge of speil properties tht m eist in prtiulr set of grphs [15], [31], [32]. In prtiulr, the Nut progrm [31] developed Brendn MK implements numer of suh heuristis nd hs een shown to sle resonl well to moderte sie grphs. Unfortuntel, Nut does not llow grphs to hve edge lels nd s suh it nnot e used diretl FSG. As result we developed our own nonil leling lgorithm tht inorportes some of the eisting heuristis etended to vertend edge-leled grphs s well s numer of new heuristis tht re well-suited for our prtiulr prolem. Detils of our nonil leling lgorithm re provided in the rest of this setion. Note tht our nonil leling lgorithm opertes on the djen mtri representtion of grph. For this reson, FSG onverts its internl djen list representtion of eh ndidte or frequent sugrph into its orresponding djen mtri representtion, prior to omputing its nonil lel. One the nonil lel hs een otined, the djen mtri representtion is disrded. A. Verte Invrints Verte invrints [15] re some inherent properties of the verties tht do not hnge ross isomorphism mppings. An emple of suh n isomorphism-invrint propert is the degree or lel of verte, whih remins the sme regrdless of the mpping (i.e., verte ordering). Verte invrints n e used to prtition the verties of the grph into equivlene lsses suh tht ll the verties ssigned to the sme prtition hve the sme vlues for the verte invrints. Using these prtitions we n define the nonil lel of grph to e the leiogrphill lrgest ode otined ontenting the olumns of the upper tringulr djen mtri (s it ws done erlier), over ll possile permuttions of the verties sujet to the onstrint tht the verties of eh one of the prtitions re numered onseutivel. Thus, the onl modifition over our erlier definition is tht insted of mimiing over ll permuttions of the verties, we onl mimie over those permuttions tht keep the verties in eh prtition together. Note tht two grphs tht re isomorphi will led to the sme prtitioning of the verties nd the will e ssigned the sme nonil lel. If m is the numer of prtitions reted using verte invrints, ontining p 1, p 2,..., p m verties, respetivel, then the numer of different permuttions tht we need to onsider is m i=1 (p i!), whih n e sustntill smller thn the V! permuttions required the erlier pproh. We

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 6 v 0 Fig. 3. v 3 () v 1 v 2 v 0 v 1 v 2 v 3 v 0 v 1 v 2 v 3 ode = 000 () v 1 v 0 v 3 v 2 v 1 v 0 v 3 v 2 p 0 p 1 p 2 ode = 000 () v 1 v 3 v 0 v 2 A smple grph of sie three nd its djen mtries v 1 v 3 v 0 v 2 p 0 p 1 p 2 ode = 000 (d) hve inorported in FSG three tpes of verte invrints tht utilie informtion out the degrees nd lels of the verties, the lels nd degrees of their djent verties, nd informtion out their djent prtitions. ) Verte Degrees nd Lels: This invrint prtitions verties into disjointed groups suh tht eh prtition ontins verties with the sme lel nd the sme degree. Fig. 3 illustrtes the prtitioning indued this set of invrints for n emple grph of sie four. Bsed on their degree nd their lels, the verties re prtitioned into three groups p 0 = {v 1 }, p 1 = {v 0, v 3 } nd p 2 = {v 2 } s shown in Fig. 3(). Fig. 3 shows the djen mtri orresponding to the prtitiononstrined permuttion tht leds to the nonil lel of the grph. Using the prtitioning sed on verte invrints, we tr onl 1! 2! 1! = 2 permuttions, lthough the totl numer of permuttions for four verties is 4! = 24. ) Neighor Lists: Invrints tht led to finer-grin prtitioning n e reted inorporting informtion out the lels of the edges inident on eh verte, the degrees of the djent verties, nd their lels. In prtiulr, we desrie n djent verte v tuple (l(e), d(v), l(v)) where l(e) is the lel of the inident edge e, d(v) is the degree of the djent verte v, nd l(v) is its verte lel. Now, for eh verte u, we onstrut its neighor list nl(u) tht ontins the tuples for eh one of its djent verties. Using these neighor lists, we then prtition the verties into disjoint sets suh tht two verties u nd v will e in the sme prtition if nd onl if nl(u) = nl(v). Note tht this prtitioning is performed within the prtitions lred omputed the previous set of invrints. Fig. 4 illustrtes the prtitioning produed lso inorporting the neighor list invrint on the grph of Fig. 4(). Speifill, Fig. 4() shows the prtitioning produed the verte degrees nd lels, nd Fig. 4() shows the prtitioning tht is produed lso inorporting neighoring lists. The neighor lists re shown in Fig. 4(d). For this emple we were le to redue the numer of permuttions tht needs to e onsidered from 4! 2! to 2!. ) Itertive Prtitioning: Itertive prtitioning generlies the ide of the neighor lists, inorporting the prtition informtion [15]. This time, insted of tuple (l(e), d(v), l(v)), we use pir (p(v), l(e)) for representing the neighor lists where p(v) is the identifier of prtition to whih neighor verte v elongs nd l(e) is the lel of the inident edge to the neighor verte v. The effet of itertive prtitioning is illustrted in Fig. 5. In this emple grph, ll edges hve the sme lel nd ll verties hve the sme lel. Initill the verties re prtitioned into two groups onl their degrees, nd in eh Fig. 4. Fig. 5. v 0 v 2 v 4 v 1 v 3 v 0 v 5 v 2 v 4 v 1 v 3 v 0 v 5 v 1 v 4 v 2 () ode = 00000000 () v 3 p 0 p 1 p 2 p 3 p 4 Use of neighor lists v 1 v 0 v 2 v 3 v 4 v 5 v 6 v 7 v 5 v 2 v 4 v 1 v 3 v 0 v 5 v 7 v 0 v 1 v 2 v 3 () v 2 v 1 v 0 v 3 v 4 v 5 v 6 v 7 p 0 p 1 v 2 v 4 v 1 v 3 v 0 v 5 p 0 p 1 ode = 00000000 () (, 3, ), (, 3, ), (, 3, ) (, 3, ), (, 3, ), (, 3, ) (, 1, ), (, 3, ), (, 3, ) (, 3, ), (, 3, ), (, 1, ) (, 3, ) (, 3, ) (p 0, ) (d) (p 0, ), (p 0, ), (p 1, ) (p 0, ), (p 1, ), (p 1, ) (p 0, ), (p 1, ), (p 1, ) (p 0, ) (p 0, ) (p 0, ) (p 0, ) ode = 000000000000000000000 () v 1 v 0 v 2 v 5 v 3 v 4 v 6 v 7 v 1 v 0 v 4 v 6 v 7 v 2 v 5 v 3 p 0 p 1 p 2 (p 1, ), (p 1, ), (p 2, ) (p 0, ), (p 2, ), (p 2, ) (p 0, ), (p 2, ), (p 2, ) (p 0, ) (p 1, ) (p 1, ) (p 1, ) (p 1, ) ode = 000000000000000000000 () v 1 v 0 v 2 v 5 v 3 v 4 v 6 v 7 v 1 v 0 v 4 v 6 v 7 v 6 p 0 p 1 v 2 v 5 v 3 v 5 v 4 p 2 p 3 (d) (p 1, ), (p 1, ), (p 2, ) (p 0, ), (p 3, ), (p 3, ) (p 0, ), (p 3, ), (p 3, ) (p 0, ) (p 1, ) (p 1, ) (p 1, ) (p 1, ) ode = 000000000000000000000 An emple of itertive prtitioning

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 7 prtition the re sorted their neighor lists (Fig. 5()). The ordering of those prtitions is sed on the degrees nd the lels of eh verte nd its neighors. Then, we split the first prtition p 0 into two, euse the neighor lists of v 1 is different from those of v 0 nd v 2. B renumering ll the prtitions, updting the neighor lists, nd sorting the verties sed on their neighor lists, we otin the mtri s shown in Fig. 5(). Now, euse the prtition p 2 eomes non-uniform in terms of the neighor lists, we gin divide p 2 to ftor out v 5, renumer prtitions, updte nd sort the neighor lists, nd sort verties to otin the mtri in Fig. 5(d). B. Degree-sed Prtition Ordering In ddition to using the verte invrints to ompute finegrin prtitioning of the verties, the overll run-time of the nonil leling n e further redued properl ordering the vrious prtitions. This is euse, proper ordering of the prtitions m llow us to quikl determine whether set of permuttions n potentill led to ode tht is smller thn the urrent est ode or not; thus, llowing us to prune lrge prts of the serh spe. Rell from Setion V-A tht we otin the ode of grph ontenting its djent mtri in olumn-wise fshion. As result, when we permute the rows nd the olumns of prtiulr prtition, the ode orresponding to the olumns of the preeding prtitions is not ffeted. Now, while we eplore prtiulr set of within-prtition permuttions, if we otin prefi of the finl ode tht is lrger thn the orresponding prefi of the urrentl est ode, then we know tht regrdless of the permuttions of the susequent prtitions, this ode will never e smller thn the urrentl est ode, nd the eplortion of this set of permuttions n e terminted. The ritil propert tht llows us to prune suh unpromising permuttions is our ilit to otin d ode prefi. Idell, we will like to order the prtitions in w suh tht the permuttions of the verties in the initil prtitions led to drmtill different ode prefies, whih it turn will llow us to prune prts of the serh spe. In generl, the likelihood of this hppening depends on the densit (i.e., the numer of edges) of eh prtition, nd for this reson we sort the prtitions in deresing order of the degree of their verties. C. Verte Stilition Verte stilition is effetive for finding isomorphism of grphs with regulr or smmetri strutures [31]. The ke ide is to rek the topologil smmetr of grph foring prtiulr verte into its own prtition, when the itertive prtitioning leves lrge verte prtition whih nnot e deomposed into smller prtitions nmore. For emple, onsider le G = (V, E) of k edges where ll the edges nd the verties hve the sme lel. Eh verte is equivlent to n other sine the re identil in terms of their degree, lel, neighors, nd resulting prtitions. As result, verte nnot e distinguished from others nd there will e onl singe prtition ontining ll the k verties. To otin nonil lel under suh prtitioning with the itertive prtitioning onl, it would require O(k!) opertions. Verte stilition reks suh regulr struture ssuming tht prtiulr verte in lrge prtition with mn equivlent verties is different from the others. The seleted verte forms new singleton prtition for itself, whih triggers for the rest of the verties the suessive itertive prtitioning the detils of whih re desried in Setion V-A.0.. Beuse we hve hosen the verte ritrril, we hve to repet the sme proess for the remining verties in the originl prtition. During the suessive itertive prtitioning, the verte stilition m e pplied repetedl if the itertive prtitioning n not deompose lrge prtition effetivel. For emple, in the se of le with k edges, one prtiulr verte v is hosen from the initil prtition with ll the k verties, it reks the smmetr nd we immeditel otin (k 1)/2 + 1 prtitions sed on the distne from v to eh verte. Thus, the neessr numer of permuttions to ompute the nonil lel fter this prtitioning is ( (k 1)/2 + 1)!. Beuse there re k suh hoies for the first verte v, the entire omputtionl ompleit for the nonil leling of G is ounded O(k(k/2)!) whih is signifintl smller thn O(k!). Note tht the verte stilition is not limited to les nd tht it is pplile to n tpes of grphs. One prtition eomes smll enough, the strightforwrd permuttion n e simpler nd fster thn verte stilition, in order to otin nonil lel. Thus, our nonil leling lgorithm pplies verte stilition onl if the sie of verte prtition is greter thn five. For further detils on verte stilition the reders should refer to tetook on permuttion groups suh s [12]. VI. EXPERIMENTAL EVALUATION We eperimentll evluted the performne of FSG using tul grphs derived from the moleulr struture of hemil ompounds, nd grphs generted snthetill. The first tpe of dtsets llows us to evlute the effetiveness of FSG for finding rther lrge ptterns nd its slilit to lrge rel dtsets, wheres the seond one, set of sntheti dtsets, llows us to evlute the performne of FSG on dtsets whose hrteristis (e.g., numer of grph trnstions, verge grph sie, verge numer of verte nd edge lels, nd verge length of ptterns) differs drmtill; thus, providing insights on how well FSG sles with respet to these hrteristis. All eperiments were done on dul AMD Athlon MP 1800+ (1.53 GH) mhines with 2 Gtes min memor, running the Linu operting sstem. All the times reported re in seonds. A. Chemil Compound Dtsets We derived grph dtsets from two pulil ville dtsets of hemil ompounds. The first dtset 3 ontins 340 hemil ompounds nd ws originll provided for the Preditive Toiolog Evlution (PTE) Chllenge [43], nd the seond dtset 4 ontins 223,644 hemil ompounds nd 3 ftp://ftp.oml.o..uk/pu/pkges/ilp/dtsets/rinogenesis/ progol/rinogenesis.tr.z 4 http://dtp.ni.nih.gov/dos/3d dtse/struturl informtion/ struturl dt.html

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 8 is ville from the Developmentl Therpeutis Progrm (DTP) t Ntionl Cner Institute. From the desription of hemil ompounds in those two dtsets, we reted trnstion for ompound, verte for n tom, n edge for ond. Eh verte hs verte lel ssigned for its tom tpe nd eh edge hs n edge lel ssigned for its ond tpe. In the PTE dtset there re 66 tom tpes nd 4 ond tpes, nd in the DTP dtset there re 104 tom tpes nd 3 ond tpes. Eh grph trnstion otined from the PTE nd the DTP dtsets hs 27 nd 22 edges on the verge, respetivel. d) Results: Tle II shows the results FSG on four dtsets derived from the PTE nd DTP dtsets. The first dtset ws otined using ll the ompounds of the PTE dtset, wheres the remining three dtsets were otined rndoml seleting 50,000, 100,000, nd 200,000 ompounds from the DTP dtset. There re three tpes of results shown in the tle, the run-time in seonds (t), the sie of the lrgest disovered frequent sugrph (k ), nd the totl numer of frequent ptterns ( F ) tht were generted. The minimum support threshold ws rnging from 10% down to 1.0%. Dshes in the tle orrespond to eperiments tht were orted due to high omputtionl requirements. All the results in this tle were otined using single prtition of the dtset. FSG is le to effetivel operte on dtsets ontining 200,000 trnstions nd disover ll frequent onneted sugrphs whih our in 1% of the trnstions in pproimtel one hour. With respet to the numer of trnstions, the run-time sles lmost linerl. For instne, with the 2% support, the run-time for 50,000 trnstions is 263 seonds, wheres the orresponding run-time for 200,000 trnstions is 1,343 seonds, n inrese ftor of 5.1. As the support dereses, the run-time inreses refleting the inrese of the numer of frequent sugrphs found from the input dtset. For emple, with 200,000 trnstions, the run-time for the 1% support is 4.2 times longer thn tht for the 3% support, nd the numer of found frequent sugrphs for the 1% support ws 8.2 times more thn tht for the 3% support. Compring the performne on the PTE nd DTP-derived dtsets we notie tht the run-time for the PTE dtset drmtill inreses s the minimum support dereses, nd eventull overtkes the run-time for most of the DTP-derived dtsets. This ehvior is due to the mimum sie nd the totl numer of frequent sugrphs tht re disovered in this dtset (oth of whih re shown in Tle II). For lower support vlues the PTE dtset ontins oth more nd longer frequent sugrphs thn the DTP-derived dtsets do. This is due to the inherent hrteristis of the PTE dtset euse it ontins lrger nd more similr ompounds. For emple, the PTE dtset ontins 26 ompounds with over 50 edges nd the lrgest ompound hs 214 edges. Despite tht, FSG requires 459 seonds for support vlue of 2.0%, nd is le to disover ptterns ontining over 22 edges. 1) Reduing Memor Requirement of TID lists: To evlute the effetiveness of the dtse-prtitioning-sed pproh (desried in Setion IV-B.1) for reduing the mount of memor required TID lists (TID list memor), we performed set of eperiments in whih we used two dtsets derived from the DTP dtset ontining 100,000 nd 200,000 hemil ompounds, respetivel. For eh dtset we used FSG to find ll frequent ptterns tht our in t lest 1% of the trnstions prtitioning the dtset in 2, 3, 4, 5, 10, 20, 30, 40, nd 50 prtitions. These results re shown in Tle III. For eh eperiment, this tle shows the totl run-time, the mimum mount of TID list memor, nd the mimum mount of memor required to store the pttern lttie (pttern lttie memor). From these results we n see tht the dtse-prtitioningsed pproh is quite effetive in reduing the TID list memor, euse it dereses lmost linerl s the numer of prtitions. Moreover, the vrious optimitions desried in Setion IV-B.1 re quite effetive in limiting the degrdtion in runtime of the resulting lgorithm. For emple, for the 200,000-ompound dtset nd 50 prtitions, the runtime inreses onl ftor of 3.4 over tht for single prtition. Also, the pttern lttie memor inreses slowl s the numer of prtitions inreses, nd unless the numer of prtitions is quite lrge, it is still dominted the memor required to store the TID lists. Note tht these results suggest tht there is n optiml point for the numer of prtitions tht leds to the lest mount of memor, s the pttern lttie memor will eventull eeed the TID list memor s the numer of prtitions inreses. B. Sntheti Dtsets To evlute the performne of FSG on dtsets with different hrteristis we developed sntheti grph genertor whih n ontrol the numer of trnstions D, the verge numer of edges in eh trnstion T, the verge numer of edges I of the potentill frequent sugrphs, the numer of potentill frequent sugrphs S, the numer of distint edge lels L E, nd the numer of distint verte lels L V of the generted dtset. The design of our genertor ws inspired the sntheti trnstion genertor developed the Quest group t IBM nd used etensivel to evlute lgorithms tht find frequent itemsets [1], [2], [20]. The tul genertor works s follows. First, it genertes set of S potentill frequent onneted sugrphs lled seed ptterns whose sie is determined Poisson distriution with men I. For eh seed pttern, the topolog nd the lels of the edges nd the verties re hosen rndoml. Eh seed pttern hs weight ssigned, whih eomes proilit tht the seed pttern is seleted to e inluded in grph trnstion. The weights re lulted dividing rndom vrile whih oes n eponentil distriution with unit men the numer of edges in the seed pttern, nd the sum of the weights of ll the seed ptterns is normlied to one. We ll this set S of seed ptterns seed pool. The reson tht we divide the eponentil rndom vrile the numer of edges is to redue the hne tht lrger weights re ssigned to lrger seed ptterns. Otherwise, one lrge weight ws ssigned to lrge seed pttern, the resulting dtset would ontin n eponentill lrge numer of frequent ptterns. Net, the genertor retes D trnstions. First, the genertor determines the trget sie of eh trnstion, whih is

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 9 TABLE II RUN-TIME IN SECONDS FOR THE PTE AND DTP CHEMICAL COMPOUND DATASETS. Support Run-time t[se], Sie of Lrgest Frequent Pttern k, nd Numer of Frequent Ptterns F threshold PTE D = 340 DTP D = 50, 000 DTP D = 100, 000 DTP D = 200, 000 [%] t[se] k F t[se] k F t[se] k F t[se] k F 10.0 3 11 844 74 9 351 156 9 360 337 9 373 9.0 3 11 977 80 9 400 169 10 420 366 10 442 8.0 4 11 1323 87 11 473 184 11 490 401 11 512 7.0 4 12 1770 94 11 562 200 11 591 437 11 635 6.0 6 13 2326 109 12 782 230 12 813 503 12 860 5.0 9 14 3608 122 12 1017 259 12 1068 570 12 1140 4.0 16 15 5935 146 13 1523 316 13 1676 705 13 1855 3.0 60 22 22758 186 14 2705 398 14 2810 894 14 3004 2.0 459 25 136927 263 14 5295 571 14 5633 1343 15 6240 1.0 658 16 19373 1458 16 20939 3776 17 24683 Note. Dshes indite the omputtion ws orted euse of the too long run-time. D : Numer of trnstions TABLE III RUN-TIME AND TID LIST MEMORY WITH PARTITIONING Run-time [se] D Numer of Prtitions 1 2 3 4 5 10 20 30 40 50 100,000 1432 1878 2032 2189 2356 2924 3899 4842 6122 7459 200,000 3698 4494 5095 5064 5538 6418 7856 9516 11165 12670 Mimum mount of memor for storing TID lists [Mtes] D Numer of Prtitions 1 2 3 4 5 10 20 30 40 50 100,000 53.8 27.0 18.1 13.6 11.0 5.6 2.9 2.0 1.5 1.2 200,000 118 59.1 39.5 29.6 23.9 12.1 6.2 4.2 3.2 2.6 Mimum mount of memor for storing pttern lttie [Mtes] D Numer of Prtitions 1 2 3 4 5 10 20 30 40 50 100,000 1.4 1.5 1.5 1.6 1.9 2.5 3.2 3.8 4.3 200,000 1.7 1.8 1.8 1.8 2.0 2.4 2.8 3.2 3.6 Note. The two dtsets re generted from the DTP dtset smpling 100,000 nd 200,000 hemil ompounds. The minimum support σ = 1.0% Pttern lttie memor is left lnk for single prtition euse the lttie is not uilt. D : Numer of trnstions Poisson rndom vrile whose men is equl to T. Then, the genertor selets seed pttern from the seed pool, rolling n S -sided die. Eh fe of this die orresponds to the proilit ssigned to seed pttern in the seed pool. If the sie of the seleted seed pttern fits in the trget trnstion sie, the genertor dds it to the trnstion. If the sie of the urrent intermedite trnstion does not reh its trget sie, we keep seleting nd putting nother seed pttern into it. When dding the seleted seed pttern mkes the intermedite trnstion sie greter thn the trget trnstion sie, we dd it for the hlf of the ses, nd disrd it nd move onto the net trnstion for the rest of the hlf. The genertor dds seed pttern into trnstion onneting rndoml seleted pir of verties, one from the trnstion nd the other from the seed pttern. ) Results: Using this genertor, we otined numer of different dtsets vring the numer of verte lels L V, the verge sie of the potentill frequent sugrphs I, nd the verge sie of eh trnstion T, while keeping fied the totl numer of trnstions D = 10, 000, seed ptterns S = 200, nd edge lels L E = 1 respetivel. Despite our est efforts in designing the genertor, we oserved tht s oth T nd I inrese, different dtsets reted under the sme prmeter omintion led to different run- Runtime Medin[s] 10 4 10 3 10 2 10 1 10 0 10 1 I = 5 T = 40 T = 30 T = 20 T = 10 T = 5 0 5 10 15 20 L v Runtime Medin[s] 10 4 10 3 10 2 10 1 10 0 10 1 I = 7 T = 40 T = 30 T = 20 T = 10 0 5 10 15 20 L v Runtime Medin[s] 10 4 10 3 10 2 10 1 10 0 10 1 T = 40 T = 30 T = 20 T = 10 I = 9 0 5 10 15 20 L v Fig. 6. Medin of 10 run-times in seonds for sntheti dt sets. T is the verge sie of trnstions, I is the verge sie of seed ptterns, nd L V is the numer of distint verte lels. time euse some m ontin hrder seed ptterns (e.g., regulr ptterns with similr lels) thn others do. To redue this vriilit, we reted ten different dtsets for eh prmeter omintion with different seeds for the pseudo rndom numer genertor nd run FSG on ll of them. The medin of these run-times for eh of the ten dtsets is shown in Fig. 6. Note tht these results were otined using 2% s the minimum support threshold. In generl, the FSG s run-time dereses s the numer of verte lels L V inreses, wheres it inreses when the verge sie of the seed ptterns I or the verge trnstion

To pper in IEEE Trnstions on Knowledge nd Dt Engineering 10 sie T inreses. These trends re onsistent with the inherent hrteristis of the dtsets euse of the following resons: (i) As the numer of verte lels inreses, the spe of possile utomorphisms nd sugrph isomorphisms dereses leding to fster ndidte genertion nd frequen ounting. (ii) As the sie of the verge seed pttern inreses, euse of the omintoril nture of the prolem, the totl numer of frequent ptterns to e found from the dtset inreses eponentill inresing the overll run-time. (iii) As the sie of the verge trnstion T inreses frequen ounting sugrph isomorphism eomes epensive, regrdless of the sie of ndidte sugrphs. Moreover, the totl numer of frequent ptterns to e found from the dtset lso inreses euse more seed ptterns n e put into eh trnstion. Both of these ftors ontriute in inresing the overll runtime. VII. RELATED WORK Over the ers, numer of different lgorithms hve een developed to find frequent ptterns orresponding to frequent sugrphs in grph dtsets. Developing suh lgorithms is prtiulrl hllenging nd omputtionll intensive, s grph nd sugrph isomorphisms pl ke role throughout the omputtions. For this reson, onsiderle mount of work hs een foused on pproimte lgorithms [23], [28], [35], [46] tht use vrious heuristis to prune the serh spe. However, numer of et lgorithms hve een developed [5], [10], [17], [24], [25], [45] tht gurntee to find ll sugrphs tht stisf ertin minimum support or other onstrints. Prol the most well-known heuristi-sed pproh is the SUBDUE sstem, originll developed in 1994, ut hs een improved over the ers [8], [23]. SUBDUE finds ptterns whih n effetivel ompress the originl input dt sed on the minimum desription length priniple, sustituting those ptterns with single verte. To nrrow the serh-spe nd improve its omputtionl effiien, SUBDUE uses heuristi em serh pproh, whih quite often results in filing to find sugrphs tht re frequent. Nevertheless, despite its heuristi nture, its omputtionl performne is onsiderl worse ompred to some of the reent frequent sugrph disover lgorithms. Eperiments reported in [17] for the PTE dtset [43], show tht SUBDUE spends out 80 seonds on Pentium III 900 MH omputer to find five most frequent sustrutures. In ontrst, the FSG lgorithm developed our group [29], tkes onl 20 seonds on Pentium III 450 MH to find ll 3,608 frequent sugrphs tht our in t lest 5% of the ompounds. A numer of pprohes for finding ommonl ourring sugrphs hve een developed in the ontet of indutive logi progrmming (ILP) sstems [19], [33], [34], [38], [44], s grphs n e esil epressed using first-order logi. Eh verte nd edge is represented s predite nd sugrph orresponds to onjuntion of suh predites. The gol of ILP-sed pprohes is to indue set of rules ple of orretl lssifing set of positive nd negtive emples. In the se of grphs modeled ILP sstems, these rules usull orrespond to sugrphs. Most ILP-sed pprohes re greed in nture nd use vrious heuristis to prune the spe of possile hpotheses. Thus, the tend to find sugrphs tht hve high support nd n t s good disrimintors etween lsses. However, the re not gurnteed to disover ll frequent sugrphs. A notle eeption is the ILP sstem WARMR developed Dehspe nd De Redt [9] ple of finding ll frequentl ourring sugrphs. WARMR is not speilied for hndling grphs, however, it does not emplo n grph-speifi optimitions nd s suh, it hs high omputtionl requirements. In the lst three ers, three different lgorithms hve een developed ple of finding ll frequentl ourring sugrphs with resonle omputtionl effiien. These re AGM Inokuhi et l. [24], [25], the hemil sustruture disover lgorithm developed Borgelt nd Berthold [5], nd the gspn lgorithm developed Yn nd Hn [45]. Among them, the erl version of AGM [24] ws developed prior to FSG, wheres the other lgorithms were developed fter the initil development of FSG [29]. AGM initill developed to find frequentl indued sugrphs [24] nd lter etended to find ritrr frequent sugrphs [25] disovers the frequent sugrphs using redthfirst pproh, nd grows the frequent sugrphs one-vertet--time. To distinguish sugrph from nother, it uses nonil leling sheme sed on the djen mtri representtion. Eperiments reported in [24] show tht AGM hieves good performne for sntheti dense dtsets, nd it required 40 minutes to 8 ds to find ll frequent indued sugrphs in the PTE dtset, s the minimum support threshold vried from 20% to 10%. Their modified lgorithm [25] uses previousl found emeddings of frequent pttern in trnstion to sve the sugrph isomorphism omputtion nd improves the performne signifintl t the epense of inresed memor requirements. The hemil sustruture mining lgorithm developed Borgelt nd Berthold [5], finds frequent sustrutures (onneted sugrphs) using depth-first pproh similr to tht used delt [49] in the ontet of frequent itemset disover. In this lgorithm, one frequent sugrph hs een identified, it then proeeds to eplore the input dtset for frequent sugrphs ll of whih ontin the frequent sugrph. To redue the numer of sugrph isomorphism opertions, it keeps the emeddings of previousl disovered sugrphs nd tries to etend the emeddings one edge whih is similr to the modified version of AGM [25]. In ddition, sine ll the emeddings of the frequent sugrph re known, the projet the originl dtset into smller one removing edges nd verties tht re not used n emeddings. Nevertheless, despite these optimitions, the reported speed of the lgorithm is slower thn tht hieved FSG. This is primril due to two resons. First, their ndidte sugrph genertion sheme does not ensure tht the sme sugrph is generted onl one, s result, the end up generting nd determining the frequen of the sme sugrph multiple times. Seond, in hemil dtsets, the sme sugrph tends to hve mn emeddings (in the rnge of 20 200), s result the ost of keeping trk of them outweighs n enefits.