Neighborhood Based Fast Graph Search in Large Networks

Size: px

Start display at page:

Download "Neighborhood Based Fast Graph Search in Large Networks"

Maria Brooks
6 years ago
Views:

1 Neighborhood Bsed Fst Grph Serh in Lrge Networks Arijit Khn Dept. of Computer Siene University of Cliforni Snt Brbr, CA 9306 Ziyu Gun Dept. of Computer Siene University of Cliforni Snt Brbr, CA 9306 Nn Li Dept. of Computer Siene University of Cliforni Snt Brbr, CA 9306 Supriyo Chkrborty Dept. of Eletril Engineering University of Cliforni Los Angeles, CA Xifeng Yn Dept. of Computer Siene University of Cliforni Snt Brbr, CA 9306 Shu To IBMT.J.Wtson 9 Skyline Drive Hwthorne, NY 0532 shuto@us.ibm.om ABSTRACT Complex soil nd informtion network serh beomes importnt with vriety of pplitions. In the ore of these pplitions, lies ommon nd ritil problem: Given lbeled network nd query grph, how to effiiently serh the query grph in the trget network. The presene of noise nd the inomplete knowledge bout the struture nd ontent of the trget network mke it unrelisti to find n ext mth. Rther, it is more ppeling to find the top-k pproximte mthes. In this pper, we propose neighborhood-bsed similrity mesure tht ould void ostly grph isomorphism nd edit distne omputtion. Under this new mesure, we prove tht subgrph similrity serh is NP hrd, while grph similrity mth is polynomil. By studying the priniples behind this mesure, we found n informtion propgtion model tht is ble to onvert lrge network into set of multidimensionl vetors, where sophistited indexing nd similrity serh lgorithms re vilble. The proposed method, lled Ness (Neighborhood Bsed SimilritySerh), is pproprite for grphs with low utomorphism nd high noise, whih re ommon in mny soil nd informtion networks. Ness is not only effiient, but lso robust ginst struturl noise nd informtion loss. Empiril results show tht it n quikly nd urtely find high-qulity mthes in lrge networks, with negligible ost. Ctegories nd Subjet Desriptors H.3.3 [Informtion Serh nd Retrievl]: Serh proess; I.2.8 [Problem Solving, Control Methods, nd Serh]: Grph nd tree serh strtegies Generl Terms Algorithms, Performne Permission to mke digitl or hrd opies of ll or prt of this work for personl or lssroom use is grnted without fee provided tht opies re not mde or distributed for profit or ommeril dvntge nd tht opies ber this notie nd the full ittion on the first pge. To opy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speifi permission nd/or fee. SIGMOD, June 2 6, 20, Athens, Greee. Copyright 20 ACM //06...$0.00. Keywords Grph Query, Grph Serh, Grph Alignment, RDF. INTRODUCTION Reent dvnes in soil nd informtion siene hve shown tht linked dt pervde our soiety nd the nturl world round us [36]. Grphs beome inresingly importnt to represent omplited strutures nd shem-less dt suh s wikipedi, freebse [5] nd vrious soil networks. Given n ttributed network nd smll query grph, how to effiiently serh the query grph in the trget network is ritil tsk for mny grph pplitions. It hs been extensively studied in hemi-informtis, bioinformtis, XML nd Semnti Web. SPARQL [27] is the stte-of-the RDF query lnguge for Semnti Web. SPARQL requires urte knowledge bout the grph struture to write query nd lso it performs n ext grph pttern mthing. However, due to the noise nd the inomplete informtion (struture nd ontent) in mny networks, it is not relisti to find ext mthes for given query. It is more ppeling to find the top-k pproximte mthes. Unfortuntely, grph similrity mesures suh s subgrph isomorphism, mximum ommon subgrphs, grph edit distne, missing edges tht re pproprite for hemil strutures nd biologil networks, re not suitble for entity-reltionship grphs nd soil networks. There re two hllenging issues for these grph theoreti mesures. First, entity-reltionship grphs nd soil networks hve quite different hrteristis from physil networks. They re not governed by physil lws nd often full of noise, thus mking strit topologil similrity exmintion nerly impossible. How the entities re onneted in these networks re not s importnt s how losely these entities re onneted. Seond, these grphs re very lrge nd omplex with lot of ttributes ssoited. If ury is to be ensured, the lgorithms developed for edit distne nd missing edges re not slble. These two issues motivte us to invent new grph similrity mesures tht re less sensitive to struture hnges, nd hve slble indexing nd serh solutions. Figure () shows grph query to Find the thlete who is from Romni nd won gold in 3000m nd bronze in 500m both in 984 olympis.. Compre this query ginst possible mth in FreeBse (Olympis) shown in Figure (b), it is observed tht these two grphs re by no mens similr under trditionl grph similrity definitions. Grph edit distne between

2 Romni Bronze Romni 500m 984 () Query 3000m Mrii Pui (b) Mth in Freebse Gold Bronze 500m m Gold Figure : Top- Mth for Query () in FreeBse these two grphs is 7. The size of their mximum ommon grph is 3. The number of mximum missing edges for the query grph is 4. However, Mrii Pur in Figure (b) is good mth for the query shown in Figure (), beuse she hs ll these ttributes quite lose to her in Figure (b). In prtie, it is hrd to ome up with query tht extly onforms with the grph strutures in the trget network due to the lk of shems in linked dt. However, it is esy to write query like Figure (), where user onnets entities with possible links. As long s the the proximity between these entities is pproximtely mintined in query grph, the system shll be ble to deliver mthes like Figure (b). The bove pproximte query form n serve s primitive for mny dvned grph opertors suh s RDF query nswering, network lignment, subgrph similrity serh, nme dismbigution nd dtbse shem mthing. For exmple, bsed on prtil informtion relted to one person, e.g. his friends, one n lign his physil soil irle with his yber soil network on Febook. In mny ses, nodes in soil or informtion networks hve inomplete informtion or even nonymized informtion. Nevertheless, the prtil neighborhood informtion vilble from query grph will be helpful to identify entities in the trget network. Clerly, there is need to dopt pproximte similrity serh tehniques to solve the bove problem. In bioinformtis, pproximte grph lignment hs been extensively studied, e.g. PthBlst [2], Sg [33]. These studies resort to strit pproximtion definition suh s grph edit distne, whose optiml solution is expensive to ompute. Sine they re trgeting reltively smll biologil networks with less thn 0k nodes, it is diffiult to pply them in soil nd informtion networks with thousnds or even millions of nodes. As illustrted in NetAlign [23], in order to hndle lrge grphs with 0k nodes, one hs to srifie ury to hieve better query response time. Reently there hve been other studies on pproximte mthing with lrge grphs, i.e., TALE [34], SIGMA [24] nd G-Ry [35]. However, both TALE nd SIGMA onsider the number of missing edges s the qulittive mesure of pproximte mthing nd hene, the tehniques nnot pture the notion of proximity mong lbels, s shown in Figure. G-Ry, on the other hnd, tries to mintin the shpe of the query by llowing some pproximtion in the mth. Unfortuntely, shpe is not n importnt ftor in entity-reltionship grphs. In this pper, we introdue novel neighborhood-bsed similrity mesure by vetorizing nodes ording to the lbel distribution of their neighbors. We further extend the similrity notion to grph by finding the embeddings in the trget grph tht mximize the sum of node mthes. This grph mthing tehnique voids omplited subgrph isomorphism nd grph edit distne lultion, whih beomes infesible for lrge grphs. It is observed tht soil/informtion networks usully hve more diversified node lbels nd therefore less uto-isomorphi struture, but my ontin more noise. Our objetive funtion n provide better similrity semntis for grphs with vrious rndom noise. It simplifies the proedure of grph mthing, leding to the development of n effiient grph serh frmework, lled Ness (Neighborhood Bsed Similrity Serh). With the introdution of slble indies built on vetorized nodes nd n intelligent query optimiztion tehnique, Ness n quikly nd urtely find high-qulity mthes in lrge networks, with negligible time ost. Our ontributions. We propose novel similrity serh problem in grphs, neighborhood-bsed similrity serh, whih ombines the topologil struture nd ontent informtion together during the serh proess. The similrity definition proposed in this work is ble to void expensive isomorphism testing s muh s possible. The priniples to derive pproprite funtions to fit this definition re refully exmined. We found tht the informtion propgtion model stisfies these priniples, where eh node propgtes ertin frtion of its lbels to its neighbors, nd thereby we ould onvert eh node into multidimensionl vetor, where sophistited indexing nd similrity serh lgorithms re vilble. Tht is, we suessfully turn grph serh problem into high-dimension index problem. We first identify set of rules to define pproximte mthes of nodes bsed on their neighborhood struture nd lbels. These rules re importnt sine the query my not lwys hve omplete informtion bout the ext neighborhood struture in the trget grph. The pproximte node mth onept is further extended to subgrph similrity serh, i.e. multiple node lignment for given query grph. We prove tht under this mesure, subgrph similrity serh is NP hrd. However, in omprison with grph isomorphism, whih is neither known to be solvble in polynomil time nor NP-hrd, grph similrity mth is proved to be polynomil. We demonstrte tht, without performing subgrph isomorphism testing, it is possible to prune unpromising nodes by itertively propgting node informtion mong shrinking ndidte set, whih signifintly redues query exeution time. We further nlyze how to index the vetor struture s well s optimize query proessing to speed up similrity serh. The informtion propgtion model nd the neighborhood vetoriztion pproh keep the index struture muh simpler thn the grph itself, thus mking it esy to be updted dynmilly for grph hnges rising from node/edge insertion nd deletion. In summry, we propose ompletely new grph similrity serh frmework, Ness, to define nd determine pproximte mthes in mssive grphs. As tested in rel nd syntheti networks, Ness is ble to find high-qulity mthes effiiently in lrge sle networks. 2. PRELIMINARIES A lbeled grph G =(V G,E G,L G) hs lbel set L G nd eh node u V G is tthed with set of lbels. The lbel set of node u in G is denoted by L(u) L G. For the ske of simpliity, we ssume there re no lbels nd weights on the edges. Nevertheless, the proposed tehniques ould be extended for grphs with lbeled or weighted edges. Given two lbeled grphs G nd G, G is lled subgrph isomorphi to G, if there exists subgrph H of G, suh tht G is isomorphi to H. Formlly, we define subgrph isomorphism s follow. DEFINITION (SUBGRAPH ISOMORPHISM). A subgrph isomorphism is n injetive funtion f : V G V G, s.t., () u

3 V G, L(u) L(f(u)), nd (2) (u, v) E G, (f(u),f(v)) E G. DEFINITION 2 (EMBEDDING). Given grph G nd query grph Q, n embedding of Q is n (injetive) funtion f : V Q V G, suh tht, v V Q,L(v) L(f(v)), wheref(v) V (G). In this work, we only studied the one-to-one node mthing for query grph Q nd the node lbels re preserved in the embedding. However, our ost funtion nd lgorithms n be extended to inlude other mthing nd node lbel similrity senrios. Given two grphs G nd Q, there might be mny possible embeddings. Certinly, the qulity of n embedding depends on whether it preserves the onnetions nd lbels in the query grph or not. Subgrph isomorphism tully defines n ext embedding, written s f e. The qulity of n embedding n be defined in vrious wys; i.e., for given lbel-preserved embedding f, we n ount the number of edge mismthes, C e = {(u, v) E Q : (f(u),f(v)) E G}, s the embedding s qulity. In generl, for ost funtion C : f R, we define the top-k grph similrity serh problem s below. PROBLEM STATEMENT. Given grph G nd query grph Q, find the top-k embeddings with respet to ost funtion C. The edge mismth ost funtion C e hs been studied in [38, 34, 24]. Unfortuntely, it nnot differentite the se where two nodes re lose to eh other but there is no diret edge between them. f f 2 b u u 3 u 2 b u' u' 3 u' 2 G b v v 2 v 3 Q Figure 2: Problem with Edge Mismth Cost Funtion b u d f d g e Figure 3: Informtion Propgtion Model Figure 2 shows one exmple. There re two lbel-preserved embeddings f nd f 2 of the query grph Q in trget grph G. Inf nd f 2, there is no edge onneting nd b. Thus, C e will ssign equl ost to both embeddings. On the other hnd, the grph edit distne between f nd Q is 2, wheres it is only between f 2 nd Q. Although, intuitively it is observed tht f is better mth thn f 2, beuse the nodes with lbels nd b re only 2-hops wy in f, wheres they re disonneted in f 2. This observtion inspires us to develop neighborhood-bsed similrity mesure tht disounts how nodes re extly onneted, but fouses on the proximity mong the lbels rried by these nodes. It needs to hieve the following two objetives: () The ost funtion should identify pproximte embeddings, nd (2) it must be esy to ompute. In the next setion, we will define the neighborhood-bsed similrity ost funtion nd the omplexity nlysis of tht funtion. 3. NEIGHBORHOOD-BASED GRAPH SIM- ILARITY In order to solve the problem rised by the edge mismth ost funtion, we define novel neighborhood-bsed similrity mesure by ompring the h-hop neighbors of node, defined s follows. h DEFINITION 3 (h-hop NEIGHBORS). Given grph G nd node u V (G),theh-hop neighborhood of u is the set of nodes v whose distne from u is less thn or equl to h. To ompre the neighborhoods of two nodes, we resort to n informtion propgtion model [22] tht is ble to trnsform neighborhoods into vetors in multidimensionl spe, where sophistited indexing nd fst similrity serh lgorithms re vilble. 3. Informtion Propgtion Model Figure 3 shows the informtion propgtion model to hrterize the neighborhood informtion round node u. The lbel informtion enoded in u s neighbors is propgted to u through different pths nd umulted t u. One ould use the umulted informtion nd its strength s vetor to desribe the neighborhood of u. The neighborhood vetor of u is denoted by R(u), whih onsists of set of tuples, R(u) ={ l, A(u, l) },wherel is lbel present in the neighborhood of u nd A(u, l) represents the strength of lbel l t node u in grph. There re mny different mehnisms to propgte informtion. However, not every one is vlid for grph similrity serh. Any vlid one must omply with the following priniple, PROPERTY (COST FUNCTION). For grph similrity ost funtion C, given n ext embedding f e, C(f e) must be equl to 0. Here, we onsider simple but effetive informtion propgtion model so tht the derived neighborhood-bsed similrity mesure stisfies the bove priniple. It propgtes informtion long the shortest pths between two nodes with exponentil dey to the length. Eq. desribes the formul of A(u, l) in R(u) ={ l, A(u, l) } tht represents the h-hop neighborhood of node u in grph. A(u, l) = h i= α i d(u,v)=i I(l L(v)), () where I(l L(v)) is n inditor funtion whih tkes vlue one when l is in the lbel set of v nd zero otherwise. d(u, v) is the distne between u nd v. α is onstnt lled the propgtion ftor. It is between 0 nd, whose optimum vlue will be disussed lter. Eq. 2 onfines Eq. to n embedding f in G by only onsidering the verties nd the shortest pths in f. A f (u, l) = h i= α i v V f,d(u,v)=i I(l L(v)). (2) Using this informtion propgtion model, we shll formulte the neighborhood-bsed ost funtion. 3.2 Neighborhood-bsed Cost Funtion Given query grph Q nd its embedding f in the trget grph G, we n pply the informtion propgtion model to propgte lbels in Q nd f. Sine verties in f might not be diretly onneted, we will onsider ll of the shortest pths onneting these verties during propgtion. To derive the neighborhood-bsed ost funtion C N (f), we first ompute the differene between the neighborhood vetors R f (u) nd R Q(v), representing the neighborhoods u nd v in the embedding nd the query grph, respetively. C N (v, u) = l R Q (v) M(A Q(v, l),a f (u, l)), (3)

4 where M(x, y) is positive differene funtion s given below. { x y, if x>y; M(x, y) = 0, otherwise. The reson to dpt positive differene funtion is tht if the embedding f ing rries more lbels thn Q, we shll not penlize it. Only when there re lbels nd edges missed in f, C N(v, u) will return positive vlue. Note tht, the summtion in Eqution 3 is onsidered over ll lbels l present in R Q(v), i.e.{l : A Q(v, l) > 0}. For brevity, we simply denote this by l R Q(v) in Eqution 3, nd the sme nottion will be used in the remining of the pper. Given n embedding f, we ggregte the differenes for ll pirs (v, u), whereu = f(v). The neighborhood bsed grph similrity ost C N (f) is given s follows. C N(f) = C N(v, f(v)) (4) v V Q f u 2 b u f 2 v G u 3 b v 2 b u 2 Q Figure 4: Neighborhood Bsed Similrity Cost f b d G Figure 5: Exmple of Flse Positive Figure 4 provides n exmple of neighborhood bsed grph mthing ost. In grph G, lbel b is propgted to node u from node u 2 nd u 2, vi the orresponding shortest pths respetively. Assume α =0.5 nd h =2,wehveA G(u,b)= = We n derive the neighborhood vetors for other nodes in G: R G(u )={ b, 0.75,, 0.5 }, R G(u 2)={, 0.5,, 0.25 }, R G(u 3)={, 0.5, b, 0.75 } nd R G(u 2)={, 0.5,, 0.25 }. Similrly, R Q(v )={ b, 0.5 } nd R Q(v 2)={, 0.5 }. In Figure 4, we hve two possible embeddings f nd f 2. R f (u ) = { b, 0.5 } nd R f (u 2)={, 0.5 }. Hene, C N(f )=( ) + ( ) = 0. For f 2, we mth v to u nd v 2 to u 2.WehveR f2 (u )={ b, 0.25 } nd R f2 (u 2)={, 0.25 }. Therefore, C N(f 2)=( ) + ( ) = 0.5. Note tht, for the embedding f 2, node u 3 will not ontribute ny lbels to R f2 sine it does not prtiipte in the mthing. However, it is on the shortest pth from u 2 to u, thus propgting lbels between u 2 nd u. We must mention tht the vetoriztion of the neighborhoods nd the omprison mong these vetors n be done in vrious wys. However, the finl ost funtion must stisfy the bsi property of C (Property ) to void flse negtives for ext embeddings. The following theorem shows tht C N follows this property. THEOREM. For n ext embedding f e, C N (f e)=0. PROOF. For n ext embedding f e,if(v,v 2) E Q,then (f e(v ), f e(v 2)) E G. Thus, the shortest distne between the node pirs f e(v ),f(v 2) in f e nnot be higher thn the shortest distne between the node pirs v,v 2 in Q. Hene, it follows from Eq. tht l, v, A f (f e(v),l) A Q(v, l). Therefore, bsed on Eq. 3ndEq.4,C N(f e)=0. b Q d Theorem ensures tht there is no flse negtives for ext embeddings. However, there might be some flse positives s shown in Figure 5. In this exmple, if h =, C N(f) =0, lthough f is not n ext embeddings of Q. Fortuntely, if we inrese h to 2, C N(f) > 0. In rel-life grphs tht hve low utomorphism nd more distint lbels in nodes, flse positives n mostly be voided, s shown in our experiments nd in the following Lemm. LEMMA. Given grph G nd query grph Q, ifehof their nodes hs distint lbel, for ny inext embedding f, h > 0,α>0, C N(f) > 0. PROOF. Omitted. Our definition of neighborhood-bsed ost funtion is robust ginst struturl differenes nd other forms of noises. As long s two lose lbels in query grph re lose enough in the trget grph, we onsider it s potentil mth. We n lso rnk the embeddings bsed on the proximity of their lbels in the trget grph ompred to tht in the query grph. Thus, even if there exists no ext embedding of the query grph, the ost funtion n identify the losely pproximte mthes nd rnk them bsed on their struturl differenes. We formlly define our problem sttement s follows. PROBLEM STATEMENT 2. [Neighborhood-Bsed Top-k Similrity Serh] Given trget grph G nd query grph Q, find the top-k embeddings with respet to the ost funtion C N. In the following disussion, we show tht the bove problem is NP-hrd by reduing the lique problem to it. LEMMA 2. Given grph G nd query grph Q, u V G,v V Q, L(u) =, L(v) =,ifqisomplete grph, then for ll inext embeddings f, C N (f) > 0. PROOF. Sine u V G, v V Q, L(u) =, L(v) =,for ny inext embedding f, eh node u = f(v) hs only one lbel, whih is sme s the lbel of node v in Q. Sine, Q is omplete grph, there exists t lest one node f(v) in f nd lbel l suh tht the number of -hop neighbors of v in Q tht hs lbel l is more thn the number of -hop neighbors of f(v) in f with lbel l. Hene, A Q(v, l) >A f (f(v),l). Therefore, it follows from the definition of C N tht, C N(f) > 0. THEOREM 2. Neighborhood-Bsed Top-k Similrity Serh is NP-hrd. PROOF. Let us onsider the se where L(u) =, L(v) =, u V G,v V Q,ndQisomplete grph. Suppose the top- mth f n be identified in polynomil time. Given f, it n lso be verified in polynomil time, whether C N (f) =0. Now, if C N(f) =0, by Lemm 2, there exists lique of size of Q in the trget grph G. So, it is possible to solve the lique problem in polynomil time. However, we know tht, the lique deision problem is NP-hrd [0], therefore we hve ontrdition. Hene, the similrity serh problem is NP-hrd. The grph isomorphism problem is neither known to be solvble in polynomil time nor NP-omplete. However, given two grphs Q nd G of sme size, it is possible to determine in polynomil time, if G itself is n embedding of Q with ost C N(f) =0.We ll this problem s the Grph Similrity Mth problem. Thus, we suspet tht neighborhood-bsed similrity serh might hve lower time omplexity thn grph theoreti mesures suh s grph isomorphism nd edit distne.

5 THEOREM 3. Grph Similrity Mth is polynomil in n,where n = V Q. PROOF. SineG itself is n embedding f of Q, we n determine the individul node mthing osts C N (v,u) in polynomil time, for ll v V Q, u V G. Next, we onstrut flow network nd determine the minimum ost of mximum flow in tht network (see Figure 6). From the soure node s, dd direted edge to eh node v in Q. The pity of eh of these edges is nd the ost is 0. Similrly, from eh node u in G, dd direted edge to the sink node t. The pity nd ost of eh of these edges re nd 0 respetively. From eh node v in Q, dd direted edge to eh node u in G, ifl(v) L(u). The pity nd ost of this edge re nd C N(v, u) respetively. Due to the pity onstrints, eh node in Q n be mthed with t most one node in G, nd lso only one node of Q n be mthed with sme node in G. Clerly, if the mximum flow in this network is n nd the minimum ost of the mximum flow is 0, then G is n embedding of Q with ost C N(f) =0. However, this flow problem n be solved using the Ford nd Fulkerson lgorithm [] in O(n 3 ) time. Therefore, given two grphs Q nd G of the sme size, it is possible to determine in polynomil time, if G itself is n embedding f of Q with ost C N(f) =0. follow, A G(u, l) = h n i (l)α i (l) i=2 < n2 (l)α 2 (l) n(l)α(l) To void flse positive, we wnt A G(u, l) <A Q(v, l) =α(l) s shown in Figure 7. Hene, α(l) <. n(l)+n 2 (l) In the next setion, we will introdue n itertive method to find the top-k embeddings in lrge grph. 4. SEARCH ALGORITHM In this setion, we introdue slble itertive pproh to find the top-k grph embeddings. Our gol is not to enumerte ll the possible embeddings f in G for given query grph, whose ost is prohibitive. Insted of enumerting f, we diretly use A G(u, l) to bound A f (u, l) sine A G(u, l) A f (u, l). LEMMA 3. Given query grph Q nd its embedding f in G, l, u V f, A G(u, l) A f (u, l). PROOF. Omitted. (5),0 s Q v v 2,C N(v,u ) u u 2 G,0 t Lemm 3 shows tht A G(u, l) in the neighborhood vetor R G(u) nnot be lower thn A f (u, l) of the sme lbel l in the neighborhood vetor R f (u), wheref is subgrph of G. THEOREM 4. Given query grph Q nd its embedding f in G, M(A Q(v, l),a G(f(v),l)) C N (f) v V Q l R Q (v) v n u n PROOF. It follows from Lemm 3 so tht M(A Q(v, l),a f (u, l)) M(A Q(v, l),a G(u, l)). Figure 6: Flow Network to Solve Grph Similrity Mth 3.3 Propgtion Ftor: α In the informtion propgtion model desribed in Eq., the propgtion ftor, α, should be less thn in order to reflet the reltion tht the strength A(u, l) of lbel l t node u dereses with the inrese of distne. However, we find the top-k embeddings by repetedly mthing the individul nodes from G nd Q tht stisfies ost threshold ɛ (The detiled proedure will be disussed in the next setion). Now, if α is lrge, eh node will propgte high frtion of lbels to its neighbors nd this n inrese the number of flse positives t the initil node mthing stge, thus slowing down the overll serh proess. In Figure 7, for α =0.5 nd h =2,wegetR G(u) ={, } = {, 0.5 } nd R Q(v) ={, 0.5 }. Thus, node u G will be reported s mth of node v Q even for ost threshold ɛ =0. Clerly, this is flse positive. To solve this problem, we do not employ uniform propgtion ftor for different lbels. Insted, for eh lbel l, we selet n optimum α(l). For given lbel l, let us ssume tht, the mximum number of one-hop neighbors with lbel l, of ny node in G is n(l). To onsider the worst se, let us ssume tht, some node u in G hs no one-hop neighbor with lbel l; but it hs n 2 (l) two-hop neighbors with lbel l, n 3 (l) three-hop neighbors with lbel l nd so on. Therefore, the strength of lbel l t node u in G will be s Theorem 4 shows tht without enumerting embeddings of Q in the trget grph G, we n derive the lower bound: M(A Q(v, l), A G(u, l)), whereu is possible mth of v in G. u G A G(u, ) = 0.5 Figure 7: High α v Q A Q(v, ) = 0.5 Flse Positive for u b d G b Figure 8: Node Mthing Exmple Our lgorithm works by itertively pruning unpromising nodes in the trget grph.. Mth the individul nodes of the query grph with some nodes in the trget grph, whih stisfies predefined ost threshold ɛ (See Eq. 7). 2. Disrd the lbels of the unmthed nodes in the trget grph. 3. Propgte the lbels only mong the mthed nodes from the previous step. Reompute the neighborhood vetors R G(u) only for the mthed nodes. Repet Step until onvergene. u b b d Q v

6 During eh itertion, we remove the lbels of the unmthed nodes in the trget grph G nd then reompute the neighborhood vetors only for the mthed nodes. Sine the modified trget grph hs more unlbeled nodes ompred to the previous itertion, it will derese A G(u, l). With this new nd redued set of neighborhood vetors nd using the sme ost threshold ɛ, we determine the individul node mthes with the nodes of the query grph. Therefore, some dditionl nodes in G will be unmthed t eh itertion. The itertion ontinues until there is no unmthed nodes found. For rel life grphs, with less utomorphism nd more distint lbels, we n unlbel most of the unpromising nodes using this tehnique. Thus finding the top-k embeddings from the set of remining mthed nodes of G beomes lmost trivil. To determine the runtime omplexity of our itertive serh lgorithm, let us denote the number of promising nodes present before i-th itertion s n i nd the number of unpromising nodes disovered t i-th itertion s k i;wherei. Clerly, n = n nd n i+ = n i k i. If there re totl r itertions, r i= ki = O(n). Let the omplexity of itertion i be T i. In the first itertion, for eh node, it needs to propgte its lbels t h hops. Thus, T = O(nld h ),wherel is the verge number of lbels, d h is the verge number of h-hop neighbors for eh node in G. However, for eh of the subsequent itertions, it is not neessry to perform suh propgtion for ll the nodes in the grph. Rther, the number of unpromising nodes t itertion i +, fori, n be determined by either propgting the remining n i+ nodes lbels, or by subtrting the effet of k i unpromising nodes from previous itertion. Hene, T i+ = O(min{n i+,k i}ld h ),fori. Therefore, the overll runtime omplexity of our serh lgorithm is given s follow. T + r i=2 r T i = O(nld h )+ O(min{n i+,k i}ld h ) i= r = O(nld h )+ O(k ild h ) = O(nld h ) (6) In prtie, it onverges muh fster. Next, we shll disuss the detils of the itertive lgorithm nd the lgorithm to find the top-k embeddings from the nodes filtered by the itertive lgorithm. 4. Node Mth Given the trget grph G nd the query grph Q, we ompute the vetors R G(u) nd R Q(v) for ll nodes u V G,v V Q, onsidering their h-hop neighborhoods. For eh node pir u V G,v V Q,s.t. L(v) L(u), we lulte the node mthing ost, ost(u, v) s the differene of their neighborhood vetors, ost(u, v) = M(A Q(v, l),a G(u, l)). (7) l R(v) Figure 8 shows n exmple. Assume α = 0.5 nd h = 2. We get R G(u) ={ b, 0.5,, } = { b, 0.5,, 0.5 }, nd similrly, R G(u )={ b,,, 0.25 }. Menwhile, for the query grph Q, wehver Q(v) ={ b, 0.5,, 0.25 }. Hene, ost(u, v) =0nd lso ost(u,v)=0following the bove eqution. Now, for eh node v V G, we mintin list of nodes u V G, suh tht L(v) L(u) nd ost(u, v) ɛ. Here, ɛ is predefined ost threshold. The vlue of ɛ will be disussed shortly. i= 4.2 Top-k Serh In order to find the top-k grph embedding, we initilize the ost threshold ɛ to smll vlue ɛ 0 0 nd perform the bove mentioned itertive proedure until it termintes. Given the mthed nodes, if we nnot find t lest k embeddings from them, with ost C N(f) ɛ V Q eh; then the threshold ost ɛ is doubled nd we repet the bove proedure, until the k embeddings re found. Otherwise, we find the top-k embeddings mong the mthed nodes. Note tht, t this point, ny embedding formed by ll unmthed nodes will hve ost C N(f) >ɛ V Q. However, it is possible to hve some embedding with few mthed nd unmthed nodes, nd the ost of suh embeddings might lso be C N (f) ɛ V Q. The problem is eliminted s follow. We set ɛ equl to the highest ost of the disovered top-k embeddings nd then run the lgorithm gin (this step will find top-k embeddings whose node ost might be higher thn ɛ). In this se, ny embedding formed by t lest one of the unmthed node will hve ost more thn tht of ny of the top-k embeddings found erlier. Hene, the top-k embeddings identified only using the mthed nodes will be the best top-k embeddings. The omplete lgorithm is given below. Algorithm Top-k Serh Input: Trget grph G, query grph Q, positive integer k. Output: Top-k mthes f bsedontheostmetric N. proedure : ɛ ɛ 0, ompute R G(v), v V Q 2: list 0(v) ={u : u V G L(v) L(u)} 3: i, strt with originl grph G nd ompute R G(u), u V G 4: for ll v V Q do 5: list i(v) ={u : u V G L(v) L(u) ost(u, v) ɛ} 6: end for 7: (list, i) =Itertive Unlbel(list, i, G, Q) 8: if k mthes of ost C N (f) ɛ V Q n be found in {u : u list i(v) v V Q} then 9: report top-k mthes nd stop 0: else : ɛ 2ɛ 2: go bk to step 2 3: end if Algorithm 2 Itertive Unlbel (list,i,g,q) proedure : if list i(v) < list i (v) for some v V G then 2: for ll u V G do 3: if u list i(v) v V Q then 4: unlbel u 5: end if 6: end for 7: reompute R(u) u V G 8: (list, i) =Itertive Unlbel(list, i +,G,Q) 9: else 0: return (list, i) : end if From the finl list of mthed nodes for eh node in V Q,how n we find embeddings with ost C N(f) ɛ V Q eh (line 8 of Algorithm )? One simple tehnique is to onsider ll possible ombintions from the lists nd verify their osts. When the number of mthed nodes in eh of the finl lists is smll, it is not time

7 onsuming to hek. However, when the lists re long, we n do better thn brute fore enumertion using dynmi progrmming. After finl list of mthed nodes list(v) for eh v V Q is generted, we perform the propgtion one more mong the mthed nodes; however this time we propgte the node id s insted of lbels. After this propgtion, eh mthed node u in G will hve its neighboring nodes (denoted s neighbor(u)) within h hops who hve influene on the ost (Eq. ). The finl embeddings n be formed s follows. We selet node u list(v) for some v V Q nd initilize set P ossible_mth = neighbor(u). We hve two situtions: () within h hops of u, there is no f(v ) v v in Q. (2) v v of Q, wetryto identify mth u inside P ossible_mth nd extend this set by dding neighbor(u ) nd lso eliminting the node u from P ossible_mth. For the first sitution, we ould derive the ost for node u, l L(v) AQ(v, l). We n reurse mong these two situtions to find the embeddings. In this wy, we n find the low-ost embeddings without enumerting ll possible ombintions mong the nodes in the finl lists. 5. INDEXING The most expensive prts of Ness re the omputtion of R G(u) for ll u in G (Line 3 of Algorithm ) nd the determintion of list (v) for ll v in V Q (Line 5 of Algorithm ). However, the omputtion of R G(u) n be done off-line by performing bredth first serh up to h-hops from eh node in G. Its time omplexity is O( V G d h ),whered is the verge degree of eh node. To speed up the omputtion of list (v) for ll v V Q,weuse two types of simple index strutures. In the first type of indexing, we build hsh tble orresponding to eh lbel. The nodes in G re hshed bsed on their lbels. Given query node v, weuse this hsh struture to quikly identify the set of possible mthes u, suh tht L(v) L(u). If the lbels of v re very seletive, there will be limited number of possible mthes u nd we n quikly determine the nodes u mong these mthes, for whih ost(u, v) ɛ. Algorithm 3 Neighborhood Bsed Indexing Off-line Proedure : pre ompute R G(u) ={ l, A G(u, l) } for ll u V G 2: for ll lbel l do 3: rete sorted list S(l) of nodes in desending order of A G(u, l), suh tht u i(l) is i-th node in S(l) 4: end for On-line Proedure : i 2: sum(i) M(A Q(v, l),a G(u i(l),l)) l R(v) 3: if sum(i) ɛ then 4: i i + 5: go to step 2 6: else 7: verify u j(l) if ost(u j(l),v) ɛ, j<i, l R Q(v) 8: end if However, if the lbels of v re not very seletive nd there re mny possible mthes using the hshing tehnique disussed bove, we use the seond index struture, whih is built on the neighborhood vetor R G(u) following the priniple of Threshold Algorithm [2]. The neighborhood vetor R G(u) ={ l, A G(u, l) } for eh node u V G is pre omputed. Next, for eh lbel l, we generte sorted list S(l) of nodes u in desending order of their A G(u, l) l R Q (v) vlues. Let us denote the node t position i from the top of S(l) s u i(l). In the online phse, we strt from the top of the eh sorted list S(l) in prllel nd go to the next position in the subsequent itertion. For some position i from the top, we ompute, sum(i) = M[A Q(v, l),a G(u i(l),l)]. Assume t itertion i = i, sum(i ) beomes greter thn the ost threshold ɛ. Then, we terminte this itertive proedure nd verify for ll nodes u j(l), wherej < i,l R Q(v), ifost(u j(l),v) ɛ. For eh v V Q, we need to verify only O((i ) l ) nodes for their ost; where l denotes the number of lbels in R Q(v). This n redue the omplexity of the online lgorithm signifintly. The omplete proedure for neighborhood bsed indexing is given in Algorithm 3. Proof of Corretness. Let us denote S i(l) s ll the nodes up to position i from top of the sorted list S(l), i.e.s i(l) ={u j(l), j i}. The following lemm will be useful to prove the orretness of our indexing lgorithm. LEMMA 4. If sum(i) >ɛ, then for ll u {S i (l) :l R Q(v)}, ost(u, v) >ɛ. PROOF. It follows diretly from the ft tht, eh S(l) is sorted list of nodes u in desending order of A G(u, l) vlues. Therefore, in Algorithm 3, we strt from i = nd find the smllest i, forwhihsum(i) >ɛ. Following the previous lemm, for ny node u {S i (l) :l R Q(v)}, we n eliminte them without tully omputing ost(u, v). We note tht, our indexing n be esily implemented in diskbsed mnner for very lrge grphs. Also we n pply externl memory bredth first serh lgorithms, e.g., Ulrih Meyer [] nd Lrs Arge [2], to ompute the neighborhood vetors R G(u) for ll the nodes. Dynmi Updte. Our indexing struture n effiiently ommodte dynmi updtes in G, i.e., insertion/ deletion of nodes, edges nd lbels. If node u is dded or deleted in G, it will only hnge the vetors of u s h-hop neighbors. We only need to propgte the lbels of these nodes nd modify their neighborhood vetors. They lso need to be updted in the sorted lists of lbel l for ll l L(u). The ddition/ deletion of lbel n be hndled similrly. If n edge (u,u 2) is dded/ deleted in G, we need to updte vetors for the h hop neighbors of both u nd u QUERY OPTIMIZATION In this setion, we eliminte the non-disrimintive lbels both from the trget nd query grphs t the initil stge of our mthing lgorithm to mke the tehnique more effiient. The effiieny of the lgorithm Itertive Unlbel is relted to the number of individul node mthes for eh node in the query grph. If there exists some node whih is not very seletive in terms of its own lbels or the lbels present in its neighborhood, there will be mny mthes orresponding to tht node t the initil stge of our lgorithm. In order to eliminte the problem posed by these nodes, we first eliminte ll the non-disrimintive lbels both from the trget grph nd the query grph, nd then we lso ignore the nodes in the query grph, whih do not ontin suffiient number of disrimintive lbels in themselves nd in their neighborhoods. These nondisrimintive lbels re onsidered t the lst stge of our mthing lgorithm, i.e., when we serh for the finl mthes. In the following disussion, we shll lrify the notion of disrimintive nd non-disrimintive lbels in the perspetive of node nd grph mthes.

8 ? Sheil MCrthy? Andre Mgi in the Wter () Query Mrth Plimpton? John Stephen Wters Spielberg () Query Drren E. Burrows Thoms Burstin S. MCrthy Thoms Burstin S. MCrthy Peker The Goonies Cry-Bby Amistd Andre Andre Mgi in the Wter The Lotus Eter Mgi in the Wter Bright Angel John Stephen Wters Spielberg John Wters Stephen Spielberg (b) Mth_ () Mth_2 (b) Mth_ () Mth_2 Figure 0: Top-2 Mthes (Query ) Figure : Top-2 Mthes (Query 2) # of nodes () hevy-hed Pruned Not Pruned A Q(v, l) A G(u, l) # of nodes A Q(v, l) (b) hevy-til A G(u, l) Figure 9: Disrimintive (Hevy-Hed) vs. Non-Disrimintive (Hevy-Til) Distribution Let us onsider the distribution of A G(u, l) vlues of some lbel l, <l, A G(u, l)> R G(u), for different nodes u V G. Figure 9 shows one exmple. For lbel l, we plot the different A G(u, l) vlues long the X-xis. The Y -xis shows the number of nodes u hving tht prtiulr A G(u, l) vlue in their neighborhood vetor R G(u). The distribution in Figure 9() is skewed towrds the smller vlues of A G(u, l), wheres Figure 9(b) is skewed towrds the higher vlues of A G(u, l). We ll them s hevy-hed nd hevy-til distributions respetively. Given query node v, sine we prune ll the nodes u in G for whih l R Q (v) M[A Q(v, l),a G(u, l)] >ɛ, the lbels with hevy-hed distribution hve more pruning power thn those with hevy-til distribution. Therefore, we should retin lbels with hevy-hed distribution for node mth, s those lbels re more disrimintive. 7. EXPERIMENTAL RESULTS In this setion, we present the experimentl results to demonstrte the effetiveness nd the effiieny of the neighborhood bsed similrity serh tehnique on number of rel-life nd syntheti grph dtsets inluding DBLP, Intrusion, Freebse nd WebGrph. In order to evlute the effetiveness, we show two possible pplitions - RDF query nswering nd network lignment. We test the robustness of our pproh by providing the ury of the best mthes for queries of different sizes nd under the presene of rndom noise. The effiieny nd slbility of our pproh re lso investigted. All experiments re performed using single ore in 40GB, 2.50GHz Xeon server. 7. Grph Dt Sets DBLP Collbortion Grph. The DBLP ollbortion grph is downloded from ley /db. There re 684K distint uthors nd 7M o-uthor edges mong them. We onsider the nme of eh uthor s the lbel of tht node. There re 683, 927 distint lbels in DBLP. We use the DBLP dtset for effiieny test. Freebse Entity Reltionship Grph. Freebse is lrge ollbortive knowledge bse of strutured dt hrvested from mny soures inluding Wikipedi. We downloded the film entity reltionship grph dt from / This grph hs 72K nodes, eh representing n entity, i.e., tor, movie, diretor, produer nd so on. An edge represents the reltionship between two entities. Nmes of entities re treted s lbels. There re totl 579K edges nd 59, 54 distint lbels in this grph. Freebse grph is used for effetiveness, robustness nd effiieny nlysis. Intrusion Alert Network. This network ontins the nonymous log dt of intrusion lerts in omputer network. It hs 200K nodes nd 703K edges where eh node is omputer nd n edge mens possible ttk suh s Denil-of-Servie nd TCP Servie Sweep. Eh node hs 25 lbels (omputer generted lerts in this se) on verge. There re round, 000 types of lerts. We use this grph for robustness nd effiieny experiments. WebGrph with Syntheti Lbels. We downloded the uk web grph dt from [4]. This web grph is olletion of UK web pges. For our experiments, we use subset tht ontins 0M pges (i.e. nodes) nd 23M hyperlinks (i.e. edges). We uniformly ssign 0, 000 synthetilly generted lbels ross vrious nodes, suh tht eh node gets one lbel. We test the slbility of our pproh on this grph. 7.2 RDF Query Answering In ddition to the query shown in Figure, we show two more exmples using the Freebse grph dtset. Query : Who did inemtogrphy for t lest two Sheil M- Crthy movies, one of them being Andre? The person ws lso inemtogrpher of the movie Mgi in the Wter. Here, we would like to emphsize tht, Sheil MCrthy did not t in the movie Andre. However, s disussed erlier, this type of inury is ommon, sine the user my not hve the -

9 ACCURACY () Aury (Intrusion) ERROR RATIO (b) Error Rtio (Freebse) ERROR RATIO () Error Rtio (Intrusion) Figure 2: Robustness of Network Alignment AVG # OF ITERATIONS () Top-k Serh (Algorithm ) AVG # OF ITERATIONS (b) Itertive Unlbel (Algorithm 2) SEARCH TIME (SEC) () Online Serh Time Figure 3: Convergene of Online Serh Algorithm (DBLP) urte informtion, or there n be some noises in the trget grph. Using our pproh, we get the following top-2 nswers for this query, s shown in Figure 0. Query 2: Whih tors hve ppered in both "John Wters" movie nd "Steven Spielberg" movie? The query nd the orresponding top-2 mthes re shown in Figure. Here, we would like to emphsize tht, tors in the Freebse dtset re not diretly onneted with the diretors nd inemtogrphers; rther vi some movies. To write SPARQL query, we need to mintin this struturl property. However, given the query grph s shown in Figure, whih does not mintin this struturl property; we still obtin the results, where the embeddings re very lose to the query grph. 7.3 Network Alignment We perform network lignment for query grphs of different sizes nd in the presene of vrious mount of noise. For these experiments, three different sets of query grphs re used with dimeters 2, 3, 4 nd the number of nodes 00, 50, 200 respetively. These query sets will simulte the sitution when we lign smll soil network to lrge one. In eh query set, we rndomly selet 00 subgrphs with the speified dimeters nd nodes from the originl grph dtsets. Then we introdue noise by dding edges to the query grphs, whih re not present in the originl grph. The noise rtio is defined s the number of edges dded divided by the originl number of edges present in the query grph. We use propgtion depth 2 nd α is seleted s desribed erlier in Setion 3.3. The robustness of our pproh in the presene of rndom noise is mesured using two metris. The ury is defined s the number of orretly identified nodes of the trget grph in ll the top- mthes divided by the totl number of nodes in ll query grphs in the orresponding query set. The ury is for both DBLP nd Freebse dtsets with different mounts of noise, sine these grphs hve more number of distint lbels. The ury vs. noise rtio plots for Intrusion dtset is shown in Figure 2(). The ury remins t reltively high level when the noise rtio inreses up to 0.2. We lso mesure the error rtio, whih is defined s the number of inorretly identified nodes of the trget grph in ll the top- mthes divided by the totl number of nodes in ll query grphs in the orresponding query set. The lower is the error rtio, the more distinguishble the nodes re in terms of their neighborhood struture nd ontents. The error rtio remins lose to 0 for DBLP grph t different mount noise. The error rtio vs. noise rtio plots for Freebse nd Intrusion re shown in Figure 2(b) nd 2() respetively. It n be observed tht the error rtio remins t reltively low level for Freebse grph, when the noise rtio inreses up to 0.2. Hene, these experiments indite tht DBLP nd Freebse is less utomorphi ompred to the Intrusion network. 7.4 Effiieny Results We provide the running time of our lgorithm for different dtsets in Tble. For these experiments, we rndomly selet query grphs with50 nodes nd dimeter 2 from the originl grph dtsets. The vetoriztion nd indexing is performed with propgtion depth 2 nd the serh lgorithm is used to identify the top- mthes. It n be observed tht our lgorithm is very effiient for lrge grph dtsets. The on-line phse for Intrusion grph requires more time beuse the verge number of lbels per node is muh higher thn tht in other grphs. This leds to more time used for ost omputtion (Eq. (7)). We lso verify the onvergene rte of our Top-k Serh nd Itertive Unlbel lgorithms for vrious network lignment experiments disussed erlier. The onvergene rte of these lgorithms is mesured s the verge number of itertions required before they terminte. When the noise rtio is inresed, our lgorithm requires more itertions to stisfy the ost threshold. Thus, the orresponding running time lso inreses s shown in Figure 3 for the DBLP dtset. Moreover, it requires more time to identify the

10 AVG # OF ITERATIONS () Convergene (Freebse) SEARCH TIME (SEC) (b) Serh Time (Freebse) AVG # OF ITERATIONS () Convergene(Intrusion) SEARCH TIME (SEC) (d) Serh Time (Intrusion) Figure 4: Convergene of Online Serh Algorithm (Freebse & Intrusion) mthes of lrger query grph. The onvergene plots for Freebse nd Intrusion networks re given in Figure 4. Dtset 2-hop Indexing Top- Serh (Off-line) (Online) DBLP, 733 se 0.06 se (0.7M, 7M, 0.7M) Freebse 280 se 0.22 se (0.2M, 0.6M, 0.2M) Intrusion 227 se.6 se (0.2M, 0.5M, K) WebGrph 5, 25 se 0.26 se (0M, 23M, 0K) Tble : Effiieny: Off-line Indexing nd Online Serh 7.5 Neighborhood-bsed Cost Funtion Properties Rell tht we proved in Theorem tht our neighborhood-bsed ost funtion ensures there is no flse negtives when the ost threshold is set to 0. In this subsetion, we investigte the flse positive rte by using our neighborhood-bsed ost funtion with threshold set to 0. This experiment is performed on DBLP, Freebse nd Intrusion dtsets. In prtiulr, for eh dtset, we selet 00 smll query subgrphs with 0 nodes eh from the originl grph. For eh of the query grphs, by using 2-hop propgtion, we identify ll mthes with ost = 0. Among these mthes, we mnully verify if there is ny flse positives, i.e. mth whih is not grph isomorphi with the query grph. The perentge of flse positives is lulted s the number of flse positives divided by the totl number of mthes obtined. We show the results in Tble 2. It n be seen tht using our ost funtion with ost threshold set to 0, the perentge of flse positives on rel-life soil/ informtion networks is very smll. Dtset Flse Positive DBLP 0% Freebse 0% Intrusion 0.3% Tble 2: Flse Positive Rtio Dtset Serh with Serh w/o Index&Op- Index&Optimiztion timiztion DBLP 0.06 se 9.63 se Freebse 0.22 se.75 se Tble 3: Benefits of Index nd Optimiztion As we hve disussed erlier, the higher the vlue of h is, the lower the number of flse positives will be. Therefore, for trget grph, we n employ error rtio s ost funtion nd lern the stisftory vlue of h from trining queries generted from the trget grph. DBLP grph is used in this experiment. We use trining set of 00 smll query grphs (with 0 nodes eh) generted from the DBLP grph. The queries re generted in suh wy tht the lbels in the query nodes re mostly not unique. Some noise is lso dded in these query grphs s explined erlier. Next, we strt with h =0nd grdully inrese h until the error rtio beomes less thn smll vlue. We show the results for DBLP grph in Figure 5. It n be observed tht, by setting h =2, we n redue the error rtio to n eptble level when the noise rtio is below 0.. This indites tht for the rel-life soil/ informtion networks with few uto-morphism nd mny distint lbels, we only need smll propgtion depth to mke the error rtio lose to zero. 7.6 Pruning Cpity of Serh Algorithm We verify the pruning pity of our Top-k serh lgorithm with respet to the number of distint lbels present in the trget grph. For this experiment, we use subgrph extrted from the WebGrph dtset, whih ontins, 000 nodes nd 4, 067 edges. We vry the number of distint lbels from to 800. Given rndomly extrted query grph with the number of nodes V Q = 8, 0 nd 2 respetively, we hek how mny subgrphs need to be verified during the finl mth phse of our pproh. The smller this number is, the more powerful the pruning of our lgorithm is. We plot the number of subgrphs need to be verified in the finl mth phse vs. the number of distint lbels in Figure 6. Note tht the Y xis is in log sle. It n be observed tht, when there is only distint lbel in the entire grph, we need to verify bout 0 25 subgrphs for query grph with 8 nodes during the finl mth phse. However, s the number of distint lbels inreses, the number of subgrphs tht we need to verify dereses rpidly. For 800 distint lbels, we only need to verify very smll number of subgrphs (e.g. 2 subgrphs when V Q =8) in the finl mth phse of our pproh. Thus, our lgorithm n be very effiient on grphs with few utomorphisms nd mny distint lbels. 7.7 Indexing nd Query Optimiztion In Tble 3, we ompre the running time of our online serh lgorithm with tht of liner sn with no indexing nd query optimiztion. Eh of the query grphs hs 50 nodes nd dimeter 2 for this experiment. It n be observed tht, our indexing nd query optimiztion tehniques n signifintly speed up online serh. We lso ompre the index onstrution time of dynmi updte with the ost of rebuilding the whole index when the trget grph is modified. The propgtion depth is 2 for these experiments. The results for DBLP dtset re shown in Figure 7. As we n see, for wide rnge of updtes in the trget grph, it is more effiient to updte the index struture rther thn re-indexing the grph. The

11 ERROR RATIO = 0 = 0.05 = 0.0 = PROPAGATION DEPTH # OF SUB GRAPHS (0 x ) V Q =8 V Q =0 35 V Q = # OF DISTINCT LABELS TIME (SEC) Dynmi Updte Re-Index % NODE UPDATE Figure 5: Stisftory h Vlue (DBLP) Figure 6: Pruning Cpity (WebGrph) Figure 7: Dynmi Updte Index (DBLP) results lso indite tht our index struture is very effiient ginst dynmi updtes in the trget grph. 7.8 Slbility We show the slbility of our pproh on the WebGrph dtset. The vetoriztion time s funtion of the number of nodes in the grph is shown in Figure 8(). Figure 8(b) shows the hnge trends of the online serh time with respet to the number of nodes. The propgtion depth is 2 for indexing nd we identify the top- mthes using our serh lgorithm. Eh of the query grphs hs 0 nodes nd dimeter 3 for this experiment. As it n be observed, for grph with 0 million nodes, our pproh n return the top- mth in 0. seond. The orresponding index building time is lso tolerble. Both the index building time nd the online serh time is roughly liner in the number of nodes. These results show tht our tehnique is highly slble for lrge sle informtion/ soil networks. TIME (SEC) # OF NODES (M) () Vetoriztion Time TIME (SEC) # OF NODES (M) (b) Serh Time For subgrph serh, Shsh et l. [3] extend the pth-bsed tehnique for full-sle grph retrievl; Yn et l. propose gindex [37] using frequent subgrphs. These studies inspired new grph index strutures suh s δ-tolerne Closed Frequent Subgrphs [8], Tree [40], nd GCoding[4]. He et l. [7] develop losure tree index to perform pproximte grph serh. Tin et l. [33] design frgment bsed index to ssemble n pproximte mth. Shng et l. introdue n effiient lgorithm for testing subgrph isomorphism [29]. Ferro et l. propose novel indexing sheme, SING [26], bsed on lolity informtion. All these methods re built stritly on grph strutures, not good for pproximte serh shown in Figure. There hve been signifint studies on inext grph mthing on ttributed grphs [30, 7]. Tong et l. [35] propose the best-effort pttern mthing in lrge ttributed grphs. It finds the best mth not bsed on the proximity mong the lbels, rther bsed on the shpe of the query grph. Tin et l. [34] proposed n pproximte subgrph mthing tool, lled TALE, with effiient indexing nd high pruning pbilities. Mongiovì et. l. introdue set-overbsed inext grph mthing tehnique, lled SIGMA [24]. Both tehniques only use edge misses to mesure the qulity of grph mthing. Therefore, they re not pproprite for the proximity bsed serh senrio studied in this work. There hve been some reent work on inext grph mthing, i.e., simultion bsed ubi time grph pttern mthing [3], homomorphism bsed subgrph mthing [4], Belief propgtion bsed net lignment [3], edgeedit-distne bsed subgrph indexing tehnique [39] nd grph prtition bsed subgrph identifition sheme [6]. Figure 8: Slbility Results (WebGrph) 8. RELATED WORK Grph serh hs been studied in different ontexts suh s grph isomorphism, grph indexing, struture mthing, et. In XML, where the strutures enountered re often trees nd ltties, queries built on pth expression beome populr [28] nd their orresponding indies hve been developed [9]. In bioinformtis, ext nd pproximte grph lignment hs been extensively studied, e.g., PthBlst [2], Sg [33], NetAlign [23], IsoRnk [32]. They re trgeting reltively smll biologil networks with less thn 0k nodes. It is diffiult to pply them in soil nd informtion networks with thousnds or even millions of nodes. Kernel bsed grph mthing tehniques re lso proposed, e.g., ommon wlks [6, 8], shortest pth [5], limited-size subgrphs [9] nd subtree ptterns [20]. Reently, Shervshidze et. l [25] proposed fst subtree pttern kernel bsed on the Weisfeiler- Lehmn method. Kernel methods do not support subgrph serh well. 9. CONCLUSIONS In this pper, we defined new grph similrity mesure, neighborhood bsed grph similrity, nd proposed n informtion propgtion model to onvert lrge network into set of multidimensionl vetors, where sophistited indexing nd similrity serh lgorithms re vilble. We proved, under this mesure, tht subgrph similrity serh is NP hrd, while grph similrity mth is polynomil. We introdued riterion to selet the best propgtion rte with respet to different node lbels in grph. We further investigted the tehniques to index the neighborhood vetors nd to ompress them by deleting non-disrimintive lbels, thus optimizing the query proessing time. The proposed method, lled Ness, is not only effiient, but lso robust ginst struture hnges nd informtion loss. Empiril results show tht it ould quikly nd urtely find high-qulity mthes in lrge networks, with negligible time ost. In future work, it will be interesting to onsider the grph lignment problem, when the node lbels in two grphs re not extly identil, i.e the sme user n hve slightly different usernmes in Febook nd Twitter.

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18 Computt onl Biology Leture 18 Genome Rerrngements Finding preserved genes We hve seen before how to rerrnge genome to obtin nother one bsed on: Reversls Knowledge of preserved bloks (or genes) Now we re