Querying Communities in Relational Databases

Size: px

Start display at page:

Download "Querying Communities in Relational Databases"

Barnaby Fletcher
5 years ago
Views:

1 Querying Communities in Reltionl Dtbses Lu Qin, Jeffrey Xu Yu, Lijun Chng, Yufei To The Chinese University of Hong Kong, Hong Kong, Chin Abstrt Keyword serh on reltionl dtbses provides users with insights tht they n not esily observe using the trditionl RDBMS tehniques. Here, n l-keyword query is speified by set of l keywords, {k,k,,k l }. It finds how the tuples tht ontin the keywords re onneted in reltionl dtbse vi the possible foreign key referenes. Coneptully, it is to find some struturl informtion in dtbse grph, where nodes re tuples nd edges re foreign key referenes. The existing work studied how to find onneted trees for n l-keyword query. However, tree my only show prtil informtion bout how those tuples tht ontin the keywords re onneted. In this pper, we fous on finding ommunities for n l-keyword query. A ommunity is n indued subgrph tht ontins ll the l-keywords within given distne. We propose new effiient lgorithms to find ll/top-k ommunities whih onsume smll memory, for n l-keyword query. For topkl-keyword queries, our lgorithm llows users to intertively enlrge k t run time. We onduted extensive performne studies using two lrge rel dtsets to onfirm the effiieny of our lgorithms. I. INTRODUCTION Keyword serh on reltionl dtbses hs been widely studied in reent yers. It tkes reltionl dtbse s dtbse grph G D by onsidering the tuples s nodes nd the foreign key referenes s edges between nodes, nd serhes the hidden onnetions between those tuples tht ontin keywords speified in user-given l-keyword query (k,k,,k l ). Almost ll existing work im t finding the miniml onneted trees tht ontin ll the l-keywords in dtbse grph or in the underneth reltionl dtbse [], [], [3], [4], [5], [6], [7], where some foused on finding ll the miniml onneted trees nd some foused on finding the top-k miniml onneted trees. In this pper, we explore two key issues. The first is whether it is the best of users interest to find miniml onneted trees on dtbse grph G D, nd the seond is how to effiiently find subgrphs (insted of trees) for user-given l- keyword queries, if it is highly desirble. We disuss the first issue in the introdution, nd fous on the effiieny issue in the rest of the pper. Consider smll grph, G, shown in Fig. (). The grph G shows the o-uthorship nd the ittion between two ppers. There re 5 nodes (3 uthors nd ppers). The three uthors re: John Smith, Jim Smith, nd, nd two ppers re: pper nd pper. There re 6 edges. The pper, pper, ws o-uthored between John Smith nd, nd the pper, pper, ws o-uthored mong, John Smith, nd Jim Smith. In ddition, pper ited pper. The edges re weighted. The edge from pper to John Smith nd re nd, respetively, beuse John Smith ws the first uthor, nd ws the seond uthor. In similr fshion, the uthor order for the three uthors who wrote pper ws indited in the weights ssoited with the orresponding edges. We ssume tht the weight on the ittion edge from pper to pper is 4. pper 4 John Smith pper () Grph Fig.. 3 Jim Smith... t Center nodes Pth nodes k... kl Keyword nodes (b) Community Grph nd Community Next, onsider -keyword query, Kte nd Smith, ginst the smll grph. All the 5 trees, T i,for i 5, re listed in Fig.. T (Fig. ()) shows tht John Smith nd wrote pper pper. T (Fig. (b)) indites tht John Smith wrote pper, whih ws ited by pper written by Kte Green. Eh of the first 4 trees, T i,for i 4, infig. gives some piees of informtion between John Smith nd. But, none of the 4 trees give better whole piture of the reltionships between these two uthors. There re two problems. One is tht user my find some informtion, but t the sme time my miss some informtion he/she is relly interested in when the user is browsing the resulting trees. For exmple, user my wnt to find how mny ppers John Smith nd o-uthored. T only shows tht they o-uthored one pper. The other problem is tht the number of resulting trees n be lrge for n l-keyword query, nd it mkes diffiult for users to find ll informtion he/she needs. We propose to find ommunities (multi-enter grphs) for n l-keyword query. Fig. (b) illustrtes ommunity. It is n indued subgrph over set of nodes, nmely, keyword nodes, enter nodes, nd dditionl pth nodes. () For given l-keyword query, in ommunity, there re up to l keyword nodes. All the user-given l keywords pper in the keyword nodes. (b) There re enter nodes, where eh enter onnets to every keyword node within threshold lled rdius. In other words, the distne long the shortest pth between enter node nd keyword node is less thn or equl to the rdius. () The dditionl nodes, lled pth nodes, re the nodes tht pper on ny pth from enter node, v,to keyword node, v l, if the distne long suh pth between

2 John Smith pper pper 4 John Smith pper John Smith pper 4 pper John Smith pper pper 3 Jim Smith () T (b) T () T 3 (d) T 4 (e) T 5 Fig.. Five Trees pper 4 John Smith pper pper () R (-Center) (b) R (-Center) Fig. 3. Two Communities 3 Jim Smith v nd v l is less thn or equl to the rdius. Fig. 3 shows two suh ommunities for the sme -keyword query with rdius 6. Consider the ommunity (R ) in Fig. 3(). There re keyword nodes tht ontin the two keywords, Kte nd Smith, respetively. There re enter nodes indited by pper nd pper. Beuse the given rdius is 6, there is pth from pper to vi pper, with the totl weight 5( 6), the edge from pper to pper is lso inluded in the ommunity. The ommunity, R (Fig. 3()) inludes ll the informtion represented by the 4 trees, T i,for i 4 in Fig.. The semntis of suh ommunities hve been studied. The similr onepts re lso used in o-ittion nlysis, uthority/hub [8]. To our best knowledge, it is the first time tht the uthors study finding multi-enter ommunities in reltionl dtbses. In ddition, we study finding ommunities using rdius, whih is the minimum totl weight long pth from enter node to keyword node. It is different from other reported studies tht find the ore of web ommunities s biprtite grphs [8]. Contributions of this pper: () We propose generl onept lled ommunity s multi-enter direted grph in reltionl dtbse when the reltionl dtbse is onsidered s dtbse grph, G D. () We propose n lgorithm to enumerte ll ommunities in polynomil dely under querynd-dt omplexity [9]. We show tht the ommunities found re omplete nd duplition-free. By omplete, we men tht we find ll ommunities. We introdue wek duplitionfree onept under whih we design polynomil dely lgorithm with time omplexity of O(l (n log n + m)) nd spe omplexity of O(l n + m), where n nd m re the number of nodes nd the number of edges in G D, nd l is the number of user-given keywords. The polynomil dely enumertion lgorithms re onsidered s the best lgorithm when enumerting results [9], [0], [4]. (3) We propose polynomil dely lgorithm with the sme time omplexity of O(l (n log n+m)) nd spe omplexity of O(l k+l n+m), to find the ext top-k ommunities under rnking order. One min dvntge of our lgorithm is tht we llow user to intertively reset the vlue of k for finding the topk ommunities during run-time without overhed. (4) We propose n effiient indexing method to index the dtbse grph, G D. With suh n index, we show tht we n projet smll dtbse grph for n l-keyword query, nd find the sme set of ommunities. The serh spe n be signifintly redued. (5) We ondut extensive performne studies to onfirm the effiieny of our lgorithms using rel dtsets. Orgniztions: The reminder of the pper is orgnized s follows. Setion II gives our problem sttement. Setion III disusses severl possible solutions, nd highlights the min ides of our polynomil dely lgorithms fter introduing severl tegories of enumertion lgorithms. In Setion IV, we disuss our lgorithm to find ll ommunities, nd in Setion V, we disuss our lgorithm to find top-k ommunities. We lso introdue our indexing nd give n lgorithm to projet subgrph of G D for n l-keyword query to be evluted. Experimentl studies re given in Setion VII followed by disussions on relted work in Setion VIII. Finlly, Setion IX onludes the pper. II. PROBLEM STATEMENT Following the nottions used in [], [3], [4], [7], we model reltionl dtbse RDB s weighted direted grph, G D = (V,E), where V is the set of nodes (tuples) in RDB, nd E is the set of edges between nodes bsed on the foreign key referenes between the orresponding tuples in RDB. Here, node, v V, my ontin keywords, direted edge, (u, v) E, is ssoited with weight, denoted w e ((u, v)). Given two nodes u nd v, weusedist(u, v) to denote the totl weight long the shortest pth between u nd v. Inthe following disussions, we fous on generl direted grph. Our pproh n be esily pplied to undireted or bi-direted grphs [], [3], [7]. We use V (G) nd E(G) to denote the set of nodes nd the set of edges for given grph G. Welso denote the number of nodes nd the number of edges in the dtbse grph, G D,usingn = V (G D ) nd m = E(G D ). Fig. 4 shows n exmple of dtbse grph, G D, whih onsists of 3 nodes, v i,for i 3. Some nodes ontin keywords: v 4 nd v 3 ontin keyword, v nd v 8 ontin keywordb, nd v 3, v 6, v 9 nd v ontin keyword. All the edges re weighted, for exmple, w e ((v,v )) = 5. Given dtbse grph G D,nl-keyword query onsists of set of l keywords, {k,k,,k l }.Anl-keyword query is to find set of ommunities, R = {R (V,E),R (V,E), }, whih we define below.

3 v b v v 5 b v v3 b v () R 5 v 4 5 b 4 4 v 6 3 v v 7 5 Fig. 4. v 4 v 4 v 5 (b) R 5 v 8 5 v 5 v 9 3 v 0 v 3 3 A Simple Dtbse Grph, G D v 4 b v 8 v 6 v 9 Fig. 5. v 7 () R 3 b v8 v 9 Five ommunities 5 6 (d) R 4 b v 8 v 3 v 3 v 3 v 0 v (e) R 5 v v 3 Definition.: (Community) A ommunity, R i (V,E), is multi-enter indued subgrph of the dtbse grph G D. Here, V is union of three subsets, V = V V l V p.()v l represents set of nodes lled knodes (keyword nodes). Every knode v l V l ontins t lest keyword nd ll l keywords must pper in t lest one knode in V l.()v represents set of nodes lled nodes (enter nodes). For ny node v V, there exists t lest single pth suh tht dist(v,v l ) Rmx (rdius) between v nd every v l V l.(3)v p represents the nodes, lled pnodes, whih pper on ny pth from node v V to knode v l V l if dist(v,v l ) Rmx. Note tht node or pnode my lso ontin some keywords, nd node n be knode nd node t the sme time. E(R i ) is the set of edges, (u, v) E(G D ), for every pir of u nd v tht pper in V (R i ). It is worth noting tht ommunity, R i, is uniquely determined by knodes, V l, whih we ll the ore of the ommunity, nd denote it s ore(r i ). For simpliity, we use C to represent ore s list of l nodes, C =[,,, l ], nd my use C[i] to denote i C, where i ontins the keyword k i. Exmple.: Consider the dtbse grph G D (Fig. 4). Let Rmx =8. For3-keyword query {, b, }, 5 ommunities, R i,for i 5, re shown in Fig. 5. For exmple, for R 5 (Fig. 5 (e)), knodes (ore(r 5 ))rev l = {v 3,v 8,v }, nodes re V = {v,v }, nd pnodes rev p = {v 0 }. For ommunity, R i, ost funtion n be defined, denoted ost(r i ), using edge weights w e (). We define ost(r i ) s the minimum totl edge weight from node to every knode on the orresponding shortest pth, in this pper. For exmple, onsider the ommunity R 5 (Fig. 5 (e)). There re two enters, v nd v. The totl edge weight over the shortest pths from v to the 3 knodes, v 8, v, nd v 3, is = ( + 3) (3 + 3). The totl edge weight over the shortest pths from v to the 3 knodes, v 8, v, nd v 3,is 4=(3++3)+3+3. Therefore, ost(r 5 )=.Given For simpliity, we ignore node weights in this pper, nd our pproh n support node weights. two ommunities, R i nd R j, R i is rnked higher thn R j, denoted R i R j,ifost(r i ) ost(r j ). The highest rnk is number. The rnking for the 5 ommunities given in Fig. 5 is shown in Tble I. Note tht our work does not rely on speifi ost funtion. Rnk Knodes Grph Cost Center b v 4 v 8 v 6 R 3 7 {v 4,v 7} v 3 v 8 v 9 R 4 0 {v 9} 3 v 3 v 8 v R 5 {v,v } 4 v 4 v v 3 R 4 {v } 5 v 4 v v 9 R 5 {v 5} TABLE I RANKING Problem Sttement: In this pper we study two interrelted problems for n l-keyword query ginst dtbse grph G D, with user-given rdius Rmx, nmely, finding ll ommunities nd finding the top-k ommunities, for given l-keyword query. We denote them s COMM-ll nd COMMk, respetively. For both, the resulting ommunities must be omplete nd duplition-free. By omplete, we men tht we explore ll ombintions of the keywords to identify ommunities bsed on ll possible ores. By duplition-free, we men for ny two ommunities R nd R, ore(r) ore(r ). In other words, let C =[,,, l ] nd C = [,,, l ] be the two ores for the two ommunities, R nd R, C[i] C [i] for some i. We define duplition with the following onsidertion. A ommunity is uniquely determined by its ore, beuse otherwise the ost of heking whether two grphs re the sme is too expensive bsed on grph isomorphism. III. AN OVERVIEW In this setion, we disuss severl possible solutions nd ddress the effiieny we wnt to hieve. For proessing n l-keyword query, {k,k,,k l }, with Rmx, ginst G D,letV i be the set of nodes in V (G D ) tht ontin the keyword k i, nd let V i be the number of nodes in V i, for i l. Beuse ore C uniquely determines ommunity, we disuss how to find ll nd duplition-free ores in this setion, nd ddress the effiieny problems. First, we onsider nive pproh using the 3-keyword query in Exmple.. For proessing COMM-ll, it needs nested loop to hek ll ombintions of nodes tht ontin keywords s follows. : for v i V do : for v j V do 3: for v k V 3 do 4: form ore ndidte C[v i,v j,v k ]; 5: output C if there is enter whih onnets every nodes in C within Rmx; The three for-loops together hek every ombintion of the three sets, V, V, nd V 3, nd ompute every possible ores, C. The omplexity is O( V mx 3 ), where V mx =

4 mx{ V, V, V 3 }. In generl, for n l-keyword query, it is in nture exponentil O(n l ) where n = V (G D ). Sine it heks every distint ombintion of nodes tht ontin the three keywords, the result is omplete nd duplition-free. In order to find ommunities bsed on the user-given keywords, n expnding pproh n be dopted to solve our problem, whih is to expnd from nodes, step-by-step, until they n identify ommunities. First, we n expnd from ll the nodes in V i tht ontin the keyword, k i,for i l. We ll it bottom-up expnding, nd outline it below. : V V V V 3; : let eh u G D mintin l-sets where eh set, u.v i, keeps the nodes v V i tht u n reh within Rmx; 3: repet 4: find new node, w, tht is expnded from v i V within Rmx; 5: dd v i into w.v i if v i ontins the keyword k i; 6: if w nd ll w.v i re non-empty then 7: output new ores found; 8: until w = During the expnsion proess, when keyword node, v i V i expnds to node, u, it implies tht u n reh v i, nd we mintin v i in set denoted u.v i t the node u. In other words, the set u.v i mintins ll the nodes ontining keyword k i tht n be rehed from u within Rmx. Ifllu.V i,for i l, re non-empty, there exist ores of ommunities. The number of ore ndidtes t node u is O( u.v mx 3 ), where u.v mx = mx{ u.v, u.v, u.v 3 }. When it is to be output, the lgorithm first heks if the ndidte is duplition. For doing so, the lgorithm mintins pool of the lredy output ores. When new ore is to be output, it heks whether it hs lredy been in the pool. If it does not exist in the pool, the lgorithm will output it nd dd it into the pool. Seond, in similr fshion, we n expnd from ny node u V (G D ) intop-downfshionuptormx, nd hek whether it n ontin ores of ommunities in similr wy s to mintin u.v i s used in the bottom-up expnding pproh. Note tht both top-down/bottom-up expnding pprohes n find ll ommunities. In other words, the results re omplete nd duplition-free. A. Enumertion Dely In this pper, we investigte new novel enumertion lgorithms [9] for supporting COMM-ll nd COMM-k queries. To our problem, n enumertion lgorithm, A, outputs ll/topk ommunities, O =(R,R,,R O ), for n l-keyword query ginst dtbse grph G D, with rdius Rmx. We onsider the grph G D nd the l-keywords, s the input, nd denote it s I. The size of input, I, is I = n + m + l, where n nd m re the numbers of nodes nd edges of G D, respetively. Note tht Rmx is onstrint rther thn n input dt. The size of output is O, where in O R i R j if ore(r i ) ore(r j ) (duplition-free). First, onsider the COMM-ll queries. The nive pproh (nested loop), in worst se, tkes exp(i) time, tht is O(n l ), to output ll the results of size O. The omplexity of the nive pproh is irrelevnt to the output size, O, whih n be even muh lrger thn the input I. Therefore, even n enumertion lgorithm, A, is not polynomil to the input I, it my be seen s resonble, beuse, when O domintes, ll lgorithms need to output O. Therefore, it is requested to onsider the omplexity by tking both input, I, nd output, O, into onsidertion. In the literture [9], [], there re severl tegories of enumertion lgorithms, nmely, polynomil totl time, inrementl polynomil time, nd polynomil dely. Here, polynomil totl time mens tht the proessing time of the lgorithm, A, is polynomil to both sizes of input nd output, I + O. Theinrementl polynomil time implies tht the proessing time to output the o-th nswer, whih does not neessrily follow ny rnking order, is polynomil to the ombined size of the input nd the first o nswers output lredy, I + o. And, the polynomil dely implies tht the o-th output is output in time whih is polynomil only to I. Obviously, the best lgorithm is polynomil dely lgorithm. A bottom-up/top-down expnding lgorithm is not polynomil dely lgorithm, beuse it needs to hek the lredy output results in order to ensure duplition-free. A bottomup/top-down expnding lgorithm is n inrementl polynomil time lgorithm, beuse it is polynomil to the ombined size of input nd the size of results tht hve been generted. Seond, onsider the COMM-k queries, whih re to output the top-k ommunities in n order (rnking). In [], Lwler gives proedure (Lwler s proedure) to ompute the k best solutions to disrete optimiztion problems, nd shows tht if the number of omputtionl steps required to find n optiml solution to problem with l (0, ) vribles is (l), then the mount of omputtion required to obtin the k best solutions is O(l (l)). Kimelfeld et l. propose polynomil dely lgorithm tht dopts the Lwler s proedure to find top-k steiner trees for keyword serh problems [4]. Bsed on the Lwler s proedure, for COMM-k queries, it is strightforwrd to obtin n lgorithm of the time omplexity, O(l (l)), where (l) is the time omplexity to ompute the top- ommunity. However, in this pper, we propose new lgorithms to ompute COMM-k queries, whih is O((l)) insted of O(l (l)). B. New Enumertion Dely Algorithms We highlight the min ides of our novel polynomil dely lgorithms for proessing COMM-ll/COMM-k queries followed by detil disussions in the following setions. : find the first best ore, C; : while C do 3: output the ommunity bsed on C; 4: C Next( ); First, we disuss our lgorithm for proessing COMM-ll queries. As shown bove, it first finds the first best ore C. In the while loop, it outputs the ommunity bsed on C whih is

5 duplition-free. Here, Next() is preedure to determine the next ore. The min issue is how to enumerte ll (omplete). We explin it using the 3-keyword query in Exmple.. Suppose tht the first ore determined is C =[v,v b,v ] for n l-keyword query. Here, v, v b, nd v ontin the keyword,, b, nd, respetively. We need to ensure tht suh C will not be enumerted gin. In doing so, we divide the entire serh spe, V V V 3, into 4 subspes (l +): S : {v } {v b } {v }, S : (V {v }) V V 3, S 3 : {v } (V {v b }) V 3, nd S 4 : {v } {v b } (V 3 {v }). It is importnt to know the following fts. () S is the urrent ore found. (b) V V V 3 = S S S 3 S 4. It implies tht we n enumerte ll ores (omplete). () S i Sj = (i j) (duplition-free). In order to enumerte ll, we propose depth-first trversl lgorithm. Coneptully, there exists virtul root node, whih represents the entire serh spe, nd, s shown in this exmple, it hs 4 hild nodes (S, S, S 3, S 4 ) representing 4 subspes. Suppose tht we find the next best ore in one of the subspes, sy S 4. With the sme proedure, we further divide S 4 into 4 subspes in the similr wy in depth-first trversl fshion in trversing the virtul tree. The time omplexity of our lgorithm is O(l (n log(n)+m)) using spe O(l n+m). The similr ide n be esily extended to support ny l-keyword queries. Below, we outline our lgorithm for COMM-k queries. : H ; : find the first best ore, C; 3: H.enhep(C); 4: while H do 5: g H.dehep(); 6: output the ommunity bsed on g.c; 7: Next( ); For COMM-k queries, we need to output ommunities following its rnking order. In doing so, we use Fiboni hep (H). We explin our min ide using the sme exmple. We find the first best ore, C, in the entire spe V V V 3, nd we ensure the first ore found is the ore for the top- ommunity. We enhep C with other informtion into the hep H, nd enter the while loop. In the while loop, first, we dehep the ore C, with the smllest ost, from H. We ompute its ommunity nd output it. Then, we try to ll Next(). In Next(), we ttempt to find the next best ore in eh of the three subspes, S, S 3, nd S 4, individully. If we find the best ore, C i,ins i,for <i 4, we enhep C i to H. With H, the next best ore, with the smllest ost, n be seleted in the next itertion from ll the ores kept in H. We repetedly dehep one ore, C,fromH, identify the subspe where C is in, sy S, further divide S into l =3subspes, find the best ores in the 3 subspes, nd enhep them into H for finding the next best ore. The time omplexity of our lgorithm is lso O(l (n log(n) +m)) using spe O(l k + l n + m). Algorithm COMM-ll (G D, {k,k,,k l }, Rmx) Input: the dtbse grph (G D), the set of keywords {k,k,,k l }, rdius Rmx. Output: ll ommunities. : for i =to l do : V i the set of nodes in G D ontining k i; 3: S i V i; 4: N i Neighbor(G D,S i, Rmx); 5: (C, ost) BestCore(N,N,,N l ); 6: while C do 7: R GetCommunity(G D,C,Rmx); 8: output R ; 9: C Next(G D, C, Rmx); 0: Proedure Next(G D,C,Rmx) : for i =to l do : N i Neighbor(G D, {C[i]}, Rmx); 3: for i = l downto do 4: S i S i {C[i]}; 5: N i Neighbor(G D,S i, Rmx); 6: (C,ost ) BestCore(N,N,,N l ); 7: if C then 8: return C ; 9: S i V i; 0: N i Neighbor(G D,S i, Rmx); : return ; IV. FIND ALL COMMUNITIES The lgorithm for proessing COMM-ll isshowninalgo- rithm, where it tkes three inputs, the dtbse grph G D,the set of keywords, {k,k,,k l }, nd the rdius Rmx. First, for every keyword k i, it finds ll nodes in V (G D ) tht ontin k i, nd ssigns them into V i (line ), whih n be done using the full text index [] effiiently. For every V i, we introdue S i whih represents the urrently vilble subset of V i,s ndidtes, for finding next ommunity. Initilly, S i is set to be V i (line 3). For S i, it finds the subset of V (G D ), lled neighborset of S i nd denote s N i, by lling proedure Neighbor() (Algorithm ). In N i, every node v j must hve t lest one neighbor v k S i suh s dist(v j,v k ) Rmx. Note tht S i V i N i V (G D ). Then, it ttempts to output ll ommunities in the rest of the lgorithm. It finds the first ore of ommunity, C, ssoited with ost, by lling BestCore() (Algorithm 3) with ll neighborsets, N i,for i l (line 5). In the while loop, the unique ommunity, R, for non-empty ore C is determined by GetCommunity() (Algorithm 4). The while loop repets by lling the Next() proedure, whih finds the next ore, C. In Next(), there re two min prts: the preprtion phse (line -), nd the serh phse to find the next ore (line 3-0). We explin the two min prts below. Rell tht C is represented s list of l nodes, C =[,,, l ], where i = C[i] ontins the keyword k i. It ttempts to find the next ore, C =[,,, l ] whih ontins t lest one node i i C. It is importnt to note tht C C, beuse For simpliity, we ssume tht ll vribles, V i nd S i,for i l, used re globl vribles, nd we do not need to pss them to the proedure Next().

6 in the lst itertion, t lest i i sine i is removed from S i (line 4). In ddition, C is not only different from the urrent, C, but lso different from ny ore found up to this stge (duplition free). Finlly, the proedure Next() serh the entire serh spe, nd does not miss ny possible ores. We will explin these issues in detil lter, fter showing n exmple. b v 4 S S S 3 N N v8 v 6 N3 () initil V V V 3 v 4 b v 8 v 6 N S 3 N 3 N V V V 3 (b) fter [v 4,v 8,v 6 ] Fig. 6. Finding ores v 4 b v 8 S S 3 N N v 4 v v 3 N3 () [v 4,v,v 3 ] Reonsider the dtbse grph G D in Fig. 4 for the 3- keyword query, k =, k = b, nd k 3 =, with Rmx = 8. Initilly, fter line 4, S = V = {v 4,v 3 }, S = V = {v 8,v }, nd S 3 = V 3 = {v 6,v 3,v 9,v }.Also, the three neighborsets re: N = {v, v 4, v 5, v 7, v 8, v 9, v, v, v 3 }, N = {v, v, v 4, v 5, v 7, v 8, v 9, v 0, v,v }, nd N 3 = {v, v, v 3, v 4, v 5, v 6, v 7, v 9, v, v }.The BestCore() will identify ore bsed on the nodes in the intersetion of N N N 3 = {v,v 4,v 5,v 7,v 9,v,v }, beuse only node in the intersetion of the neighborsets n possibly serve s enter to onnet the nodes tht ontin ll three keywords. Suppose tht BestCore() identifies ore C =[v 4,v 8,v 6 ] entered t v 7 with ost of 7 (line 5) (Refer to Fig. 6()). The first ommunity bsed on the ore C is uniquely identified s R 3 in Tble I (Fig. 5 ()), nd is output (line 7-8). Then, it lls Next() to find the next ore (line 9). In Next(), it ttempts to find the next ore bsed on the urrent ore C =[v 4,v 8,v 6 ]. It omputes the three neighborsets, N, N, nd N 3, for the three nodes in C regrding them s the enter, respetively. After line, N = {v,v 4,v 5,v 7 }, N = {v 4,v 7,v 8,v 9,v 0,v,v }, nd N 3 = {v 4,v 6,v 7 }. Then, in the for loop, initilly i =3. Note tht C =[v 4,v 8,v 6 ], C[3] = v 6, nd S 3 = {v 3,v 9,v } fter removing C[3] = v 6 from S 3 (line 4). It implies tht the next ore should not ontin v 6 in S 3. It reomputes N 3 using S 3, N 3 is reset to be N 3 = {v,v,v 3,v 5,v 9,v,v } (line 5). Then, it ttempts to find the next ore by lling BestCore() using the three newly omputed neighborsets, N, N, nd N 3. However, s n be seen, the intersetion of N N N 3 =, therefore, it is impossible to find ore. BestCore() will return n empty C (line 6)(Refer to Fig. 6(b)). Beuse C =, it will move to the next itertion. Before returning to the min while loop, it resets S 3 V 3 (line 9), beuse ny new ombintion, to form new ore C in the next V V V 3 Algorithm Neighbor(G D, V i, Rmx) Input: G D is the dtbse grph, nd V i V (G D) Output: neighborset of V i within Rmx. : let G t(v t,e t) be virtul grph suh s V t = V (G D) {t} nd E t = E(G D) {(v, t) v V i} where every w e((v, t)) = 0; : run the Dijkstr s lgorithm to find the shortest pths from ll nodes in V t to t; {onsider (u, v) E t s (v, u) (reverse order)} 3: N i {u dist(u, t) Rmx u V (G D)}; 4: return N i; itertion, n possibly ontin ny node in the entire V 3.It lso reomputes N 3 = {v,v,v 3,v 4,v 5,v 6,v 7,v 9,v,v }. In the next itertion, for i =, it repets the similr proedure strting from S S {v 8 } = {v } beuse C[] = v 8. The new neighborset N beomes {v,v,v 5 } (line 5), nd C =[v 4,v,v 3 ] (line 6). Sine C, it returns the new ore C in line 8 (Refer to Fig. 6()). Fig. 6 shows the min ides. The three sets, V i,for i 3, re represented s three retngles. In retngle, the shded prt is the subset of V i tht does not need to be serhed in n itertion. The irles represent neighborsets. A. The Three Subproblems In this setion, we disuss the detils of the three proedures used in Algorithm, nmely, Neighbor(), BestCore(), nd GetCommunity(). The lgorithm for Neighbor() is shown in Algorithm. It tkes three inputs: the dtbse grph G D, the set of nodes V i where every node v V i ontins the keyword k i, nd the rdius Rmx. TheNeighbor() will find the neighborset of V i, denoted N i, suh tht every node u N i hs t lest node v V i where dist(u, v) Rmx. The nodes in N i hve the potentil to be node in ommunity. Obviously, V i N i. In Algorithm, it onstruts virtul grph G t (V t,e t ) where V t = V (G D ) {t} nd E t = E(G D ) {(v, t) v V i } (line ). In other words, the virtul grph G t hs one dditionl sink node, t, nd dditionl edges from every v V i to t. For newly dded edge (v, t), the weight, w e ((v, t)), is set to be zero. Then, it runs Dijkstr s lgorithm to find the shortest pths from the newly dded sink node t to ll nodes in V t, by onsider every edge (u, v) E t s (v, u) (reverse order) (line ). Then, it identifies the neighborset of V i,s N i = {u dist(u, t) Rmx u V (G D )}. It is interesting to note tht, beuse the shortest pth from ny node u N i to the sink node t must bypss node v V i nd the weight from, v V i,tot is zero, if dist(u, t) Rmx, foru N i, the node u must hve t lest ner node v V i suh s dist(u, v) Rmx. The time omplexity of Algorithm is the time omplexity of Dijkstr lgorithm O(n log(n)+m) where n = V (G D ) nd m = E(G D ) for given dtbse grph G D. For every node, u, in the omputed neighborset, N i, we store the nerest node v V i tht ontin the keyword k i, nd the shortest distne, nd denote them s sr(n i,u) nd min(n i,u), respetively. The spe omplexity for N i is O(n).

7 Algorithm 3 BestCore({N,N,,N l }) Input: ll l neighborsets. Output: the best ore nd its ost. : C ; best + ; : N l i= Ni; 3: for ll u N do 4: nerestcore(u); 5: if ost() < best then 6: C ; best ost(); 7: return (C, best); The BestCore() lgorithm is shown in Algorithm 3. It tkes l neighborsets s input, nd finds the best ore C =[,,, l ] where i ontins the keyword k i. Note tht V i N i,for i l. It omputes the intersetion of ll neighborsets N i, for i l, denoted N (line ). Every node u N must be ble to serve s enter to form ore C =[,,, l ] beuse dist(u, i ) Rmx for every i C. In for loop (line 3-6), the ore C with the smllest ost(c) is determined. Here, nerestcore(u) identifies the ore entered t u, ost() omputes the ost of the ore. We hieve O(n) time to find the best next ore with some preprtion whih is done by shring the omputtionl ost done in Neighbor() using dditionl dt strutures. In implementtion, we mintin dt struture, with three elements, for every node u V (G D ). The first element is list of l pirs. The i-th pir mintins the nerest node of u, syv i, tht ontins the keyword k i s well s the totl weight long the pth between u nd v i (dist(u, v i )). For the i-th pir, we reord (v i, dist(u, v i )) if there exits v i V i nd dist(u, v i ) Rmx, otherwise we reord (, + ). The seond element mintins the totl weight l i= dist(u, v i) if dist(u, v i ) Rmx. The third element keeps how mny v i, for i l, dist(u, v i ) Rmx. If the ounter is l, it implies tht the orresponding u n be possible enter in ommunity. The spe for the tble is O(l n). We updte the dt struture while omputing neighborsets without dditionl ost in terms of time omplexity. With suh dt struture vilble, in BestCore(), we only need to sn the dt struture one nd then find the ore with the smllest ost. The lgorithm for GetCommunity() is shown in Algorithm 4. It tkes three inputs, the dtbse grph G D, ore C, nd Rmx, nd uniquely determines ommunity, R(V, E), bsed on the ore C. Note tht ommunity bsed on ore C is n indued subgrph R(V, E) where V = V V l V p.here V l is the set of knodes nd V l = C. We need to determine the set of nodes V nd the set of pnodes V p. First, we identify the set of nodes, V, where eh node v V n reh every knode C, suh s dist(v, ) Rmx (line ). In order to find the set of nodes, V, we onstrut virtul grph G (V,E ) where V = V (G D ) nd E = {(u, v) (v, u) E(G D )}, for given dtbse grph G D. In our implementtion, for every node, u V (G D )(= V (G )), we keep pir of numbers, nmely, u.sum nd Algorithm 4 GetCommunity(G D, C, Rmx) Input: the grph G D,oreC, nd the rdius Rmx. Output: ommunity uniquely determined by the ore C. : find the set of nodes, V, where eh node v V n reh every C within Rmx; : let G s(v s,e s) be virtul grph suh s V s = V {s} nd E s = E {(s, v) v V } where every w e((s, v)) = 0; 3: run the Dijkstr s lgorithm to find the shortest pths from s to ll nodes in V s; 4: let G t(v t,e t) be virtul grph suh s V t = V {t} nd E t = E {(u, t) u C} where every w e((u, t)) = 0; 5: Let every (u, v) E t be (v, u), run the Dijkstr s lgorithm to find the shortest pths from t to ll nodes in V t; 6: V {u dist(s, u)+dist(u, t) Rmx u V (G D)}; 7: onstrut n indued subgrph R in G D inluding ll nodes in V; 8: return R; s Fig. 7. v v ommunity v 3 v0 v 8 b v Finding the ommunity for ore u.ount. Both re initilized zero. Then, for eh knode, C, we ompute the shortest pths from to ll the other nodes using Dijkstr s lgorithm. For every u V (G D ),if dist(u, ) Rmx, we updte u.sum u.sum + dist(u, ) nd u.ount u.ount +. There re l knodes inc, nd we run Dijkstr s lgorithm l times. If u.ount = l, for u V (G D ), it indites tht u n be dded into the set of enters, V. Seond, bsed on V l = C nd V omputed, we ompute V s follows. () We onstrut virtul grph G s (V s,e s ) where V s = V (G D ) {s} nd E s = E {(s, v) v V }. The weight for eh newly dded edge (s, v) is set to be zero. Like Neighbor(), it runs Dijkstr s lgorithm to ompute the shortest pths from s to ll the other nodes (line -3). After line 3, every node, u, is ssoited with ounter reording the distne dist(s, u), ifdist(s, u) Rmx. () We onstrut nother virtul grph G t (V t,e t ) where V t = V (G D ) {t} nd E t = E {(v, t) v C}. The weight for eh newly dded edge (v, t) is set to be zero. We tret the grph G t s reversed grph by virtully deling with (v, u) E t s (u, v). Agin, like Neighbor(), we run Dijkstr s lgorithm to ompute the shortest pths from t to ll the others (line 4-5). Every node, u, is ssoited with nother ounter reording the distne dist(u, t) if dist(u, t) Rmx. Then V is omputed by seleting the nodes u V (G D ) if dist(s, u)+dist(u, t) Rmx (line 6). Note tht V V, C V. The totl time omplexity for Algorithm 4 is O(l (n log(n)+m)). We explin Algorithm 4 using the dtbse grph G D in Fig. 4 for the 3-keyword query, k =, k = b, nd k 3 =, with Rmx =8.LetR(V, E) be the ommunity for ore C =[v 3,v 8,v ]. Here, V = V V l V p. V l is the given set t

8 of knodes, C, V = {v,v }, nd V p = {v 0 }.Thesetof edges, E, n be esily identified by snning E(G D ).Fig.7 shows the ommunity found, where s nd t re two virtul nodes used in GetCommunity(). B. The Time/Spe Complexity Theorem IV.: Algorithm enumertes ommunities in polynomil dely time, O(l (n log(n) +m)), with the spe omplexity of O(l n + m). Proof Sketh: The time omplexity to get ommunity from ore(line7)iso(l (n log(n) +m)), we only need to prove tht the omplexity to get the next ore (line 9) is O(l (n log(n)+m)). Lines - invoke l times of Neighbor(), whih osts O(l (n log(n)+m)). Lines 3-0 loop for t most l times. In eh itertion, we invoke Neighbor() times, whih osts O(n log(n) +m) nd time of BestCore(). Note tht BestCore() is O(n). So the totl time omplexity for Next() beomes O(l (n log(n)+m)) nd the totl time omplexity for Algorithm to enumerte eh nswer is lso O(l (n log(n) +m)). For the spe omplexity, we need to reord l vlues for eh enter v, the best ore entered t v, using spe O(n l), nd lod the dtbse grph, G D, using spe O(n + m). AllS i nd N i ( i l) ost spe O(n l). The totl spe omplexity is O(n l + m). V. FIND TOP-K COMMUNITIES The lgorithm for COMM-k is shown in Algorithm 5. It tkes four inputs, the dtbse grph G D,thesetofkeywords, {k,k,, k l }, the rdius Rmx, nd n integer k > 0, nd outputs the top-k ommunities. We first ompute the set of nodes, S i, tht ontin the keyword k i, nd ompute its neighborset for S i,for i l (line -3). And we ompute the first nd the best ore, C (line 4). In order to find the top-k ommunities, we use dt struture, lled n-list, to mintin list of ore ndidtes mong them the top-k ore nd its ommunity n be identified. The mximum size of the pool is l k t most, whih we will explin lter in detil. A ndidte ore is kept in 4-element tuple, lled n-tuple, in the form of (C,ost,pos,prev). Here, C is the ore of ommunity, ost is the minimum totl weight from the nerest enter, denoted u, ofc, to every node i C, for i l, suh s l i= dist(u, i). Theprev points to its previous ndidte in the n-list. We explin pos using n exmple. Consider two n-tuple, x =(C, ost,, ) nd x =(C,ost,i,x), nd suppose C =[,,, i,, l ] nd C =[,,, i,, l ].Theprev in the n-tuple x points to the n-tuple x. The position, pos = i, inx, mens tht, by ompring C nd C kept in the two ndidtes, j = j if j<i, i i, nd j nd j my or my not be thesmeifj>i. Over the n-list, we use Fiboni hep, denoted H, whih is initilized to be empty (line 5). In the following, when we enhep n-tuple to H, we men to insert it into the n-list, nd then keep pointer in H pointing to the n-tuple on the n-list. When we dehep n-tuple from Algorithm 5 COMM-k (G D, {k,k,,k l }, k, Rmx) Input: the dtbse grph (G D), the set of keywords {k,k,,k l }, rdius Rmx, ndk > 0 Output: the top k ommunities. : for i =to l do : S i the set of nodes in G D ontining k i; 3: N i Neighbor(G D,S i, Rmx); 4: (C, ost) BestCore(N,N,,N l ); 5: H ; 6: H.enhep(C, ost,, ); 7: while H do 8: g H.dehep(); {g =(C, ost, pos, prev)} 9: G GetCommunity(g.C); 0: output G ; : k k ; : if k =0 then 3: return; 4: Next(g); 5: Proedure Next(g) 6: for i =to l do 7: N i Neighbor({g.C[i]}); 8: S i the set of nodes in G D ontining k i; 9: h g; 0: while h do : i h.pos; : S i S i {h.c[i]}; 3: h h.prev; 4: for i = l downto g.pos do 5: S i S i {g.c[i]}; 6: N i Neighbor(S i); 7: (C,ost) BestCore(N,N,,N l ); 8: if C then 9: H.enhep(C,ost,i,g); 30: S i S i {g.c[i]}; 3: N i Neighbor(S i); H, we simply remove the pointer from H, but still mintin the n-tuple on the n-list. The omplexity for H.enhep() nd H.dehep() is O() nd O(log n), respetively. It first enheps n-tuple for the first found ore C with its ost (line 6). Beuse it is the first n-tuple to be mintined in the n-list, its prev =, nd its pos =. The following while loop repets when H is non-empty (line 7-4). When H, it deheps H nd ssigns it to n-tuple, g (line 8). Beuse H is mintined in n sending order, the n-tuple, g, is with the smllest ost mong others in H. It will ll GetCommunity() to output the ommunity for the ore g.c (line 9). Then, it dereses k by, nd heks if it hs lredy output k ommunities (line -3). If k 0, it then dd more n-tuples into H, by lling the proedure Next() (line 4). In the proedure Next(), it onduts three min things. First, in preprtion phse, it omputes the neighborset for every node in the ore of the deheped n-tuple g, g.c[i], for i l, nd it lso reomputes S i s to inlude ll nodes in G D tht ontin the keyword k i. Seond, it removes those ndidtes tht hve been onsidered before (line 0-3). This proess limits the serh spe to speifi subspe out of l subspes. Third, it dpts the similr ide used in Algorithm tofindthenextl ndidtes, nd enhep them into H.

9 Hep Pool R 3 ( v 4, v 8, v 6 ) 7 Nil Core Cost Pos Prev Hep R5 R Hep R () initil Pool ( v 3, v 8, v ) ( v 3, v 8, v 9 ) 0 ( v 4, v, v 3 ) 4 ( v 4, v 8, v 6 ) 7 Nil Core Cost Pos Prev () nd itertion Pool ( v 4, v, v 9 ) 5 ( v 3, v 8, v ) ( v 3, v 8, v 9 ) 0 ( v 4, v, v 3 ) 4 ( v 4, v 8, v 6 ) 7 Nil Core Cost Pos Prev Hep R4 R Hep R Hep Pool ( v 3, v 8, v 9 ) 0 ( v 4, v, v 3 ) 4 ( v 4, v 8, v 6 ) 7 Nil Core Cost Pos Prev (b) st itertion Pool ( v 3, v 8, v ) ( v 3, v 8, v 9 ) 0 ( v 4, v, v 3 ) 4 ( v 4, v 8, v 6 ) 7 Nil Core Cost Pos Prev (d) 3rd itertion Pool ( v 4, v, v 9 ) 5 ( v 3, v 8, v ) ( v 3, v 8, v 9 ) 0 ( v 4, v, v 3 ) 4 ( v 4, v 8, v 6 ) 7 Nil Core Cost Pos Prev (e) 4th itertion (f) 5th itertion Fig. 8. Finding top-k ommunities Fig. 8 shows the pool (n-list) nd hep H when finding the top-5 ommunities in Exmple.. The 5 ommunities, R i,for i 5, re listed in Fig. 5. Theorem V.: The time omplexity for Algorithm 5 is O(l (n log(n)+m)) using O(l k + l n + m) spe. Proof Sketh: For the time omplexity, we only need to prove tht the hep opertions (line 8 nd line 9) do not hve impts on the omplexity of the lgorithm nd Lines 0-3 is done in omplexity O(l n). The omplexity for the other prts is ll the sme s in Algorithm. In Algorithm 5, every itertion, we dehep n-tuple from H nd enhep t most l n-tuples into H in order to get the next high rnked ommunity. Suppose we hve output p ommunities lredy. There re t most p l n-tuples in H. Note tht in totl p n l. Using the Fiboni hep, the dehep() osts O(log(p l)) O(log(n l l)) = O(l log(n)+log(l)) O(l n) time. The enhep() only osts O() time. Therefore, the hep opertions do not ffet the time omplexity of the lgorithm. Lines 0-3 remove some lredy used nodes from S i, whose time ost is t most l i= S i = O(n l). It does not ffet the time omplexity of Algorithm 5. The time omplexity to get eh ommunity is O(l (n log(n)+m)). For the spe omplexity, we need to mintin up to p l ntuples in the pool, where p is the number of urrently output ommunities. The spe omplexity for the other prts is ll the sme s in Algorithm. For eh generted ommunity, we hve to reord its ore using spe O(l), so the totl spe to reord ll the generted ores is O(l p), nd the totl spe omplexity for Algorithm 5 to generte the k-th best nswer is O(l k + l n + m). VI. INDEXING AND GRAPH PROJECTION As stted before, the time omplexity for finding ll/topk ommunities for user-given l-keyword query ginst dtbse grph, G D, with Rmx, is polynomil dely of R R 5 R 4 R R 3 Algorithm 6 GrphProjetion({k,k,,k l }, Rmx) Input: the set of keywords {k,k,,k l }, rdius Rmx. Output: projeted grph G P ( G D). : V ; E ; V ; W = ; : for i =to l do 3: W i getnode(invertedn,k i); 4: E i getedge(invertede,k i); 5: V i W i {u (u, v) E i (v, u) E i}; 6: W W W i; 7: E E E i; 8: V V V i; 9: V V i if i =, otherwise V V i; 0: let G s(v,e ) be virtully grph with new virtul node s,nd set of new edges from (s,v), forv V,wherew e((s,v)) = 0; : ompute the shortest pths from s to others over G s; : let G t(v,e ) be virtully grph with new virtul node t, nd set of new edges from (v, t), forv W,wherew e((v, t)) = 0; 3: ompute the shortest pths from t to others, on G t, by virtully onsidering (u, v) E s (v, u) (reverse the order); 4: V P {v v V dist(s,v)+dist(v, t) Rmx}; 5: E P {(u, v) u V P v V P (u, v) E }; 6: return G P (V P,E P ); O(l (n log(n)+m)). However, when G D is lrge in size, it is still ostly to proess COMM-ll/COMM-k queries. In order to redue the serh spe, in this setion, we introdue n index tht n be used to projet smll grph G P G D to support l-keyword queries with rdius up to R, whih is the lrgest Rmx users n use. In brief, the result for given l-keyword query ginst G D is the sme s the result for the sme query ginst the projeted dtbse grph G P. For n l-keyword query, the projeted grph G P n be muh smller thn G D. We use two inverted indexes, invertedn nd invertede. For eh keyword w in the dtbse grph G D,intheinvertedN, it mintins n invert list to store the set of nodes, denoted V w, where every node v V w ontins the keyword w, nd in the invertede, it mintins the set of edges, (u, v), suh tht both u nd v nodes re within R from t lest one node in V w. The node/edge weights re kept with the nodes nd the edges in the two inverted indexes. Next, we show how to use the two inverted indexes to projet dtbse grph for n l-keyword query. Note tht with the two inverted indexes, we do not need to use the underneth grph G D, or in other words, the entire G D n be onstruted using the two inverted indexes. The lgorithm to projet subgrph of G D for n l-keyword query within Rmx is shown in Algorithm 6. The min ide is the sme s to find ommunity for given ore, s illustrted in Fig. 7 for given set of nodes (v nd v ) nd given set of knodes (v 3, v 8, nd v ). When projeting grph G P, in Algorithm 6, the set of nodes beomes V nd the set of knodes beomes W. As shown in Algorithm 6, in for loop (line -9), for every keyword k i, it obtins the set of nodes tht ontin k i, W i,usinggetnode(invertedn,k i ) (line 3); nd it obtins the set of edges, E i,usinggetedge(invertede,k i ),

10 Prmeter Rnge Defult KWF.0003,.0006,.0009,.00, l, 3, 4, 5, 6 4 Rmx 4, 5, 6, 7, 8 6 k 50, 00, 50, 00, TABLE II PARAMETERS FOR DBLP DATASET KWF Keywords.0003 slble, protools, distne, disovery.0006 spe, grph, routing, sheme.0009 environment, dtbse, support, development, optimiztion, fuzzy.00 dynmi, pplition, modeling, logi.005 web, prllel, ontrol, lgorithms TABLE III KWF AND THE KEYWORDS USED IN DBLP Prmeter Rnge Defult KWF.0003,.0006,.0009,.00, l, 3, 4, 5, 6 4 Rmx 9, 0,,, 3 k 50, 00, 50, 00, TABLE IV PARAMETERS FOR IMDB DATASET KWF Keywords.0003 summer, bride, gme, drem.0006 Fridy, heven, street, prty.0009 str, deth, ll, girl, lost, blood.00 ity, Amerin, blue, world.005 night, story, king, house TABLE V KWF AND THE KEYWORDS USED IN IMDB where both ends of n edge re rehble from node in W i (line 4). Note tht V i is the neighborset of W i (line 5). After the for loop, it n projet subgrph G (V,E ) of G D to nswer the given l-keyword query if Rmx R. But, it is still onsidered s lrge. Note tht in the for loop, we lso ompute the set of enters, V ( V ), where every node v V n reh t lest node v i ( W i ) whih ontins the keyword k i. Bsed on the set of enters V, smller grph G P is onstruted using the similr ides given in Algorithm 4. We omit the disussions due to the limit of spe. VII. PERFORMANCE STUDIES We implemented two polynomil dely lgorithms, Algorithm nd Algorithm 5, to find ll/top-k ommunities. We denote them s nd PDk below. We lso implemented four expnding-bsed lgorithms. Two re bsed on bottomup expnding, we denote them s nd BUk, for finding ll/top-k ommunities, respetively. Similrly, two re bsed on top-down expnding, we denote them s nd TDk. All the lgorithms were written in C++. We tested the lgorithms using two rel dtsets, DBLP (DBLP 008) ( nd IMDB ( Both re used in the reported studies to test l-keyword queries. We use the sme edge-weight funtion w e () s used in [], [7], [3], w e ((u, v)) = log ( + N in (v)), where N in (v) is the in degree of node v. For DBLP, there re 4 tbles, Author(Aid, Nme), Pper(Pid, Title, Other), Write(Aid, Pid, Remrk), Cite(Pid, Pid). The numbers of tuples of the 4 tbles re, 597K, 986K, 46K, nd K, respetively. The whole DBLP dtset onsists of 4,, 0 tuples nd 5, 076, 86 referenes. The dtbse grph onsists of 4,, 0 nodes nd 0, 53, 65 direted edges (bi-direted). The prmeters with their defult vlues re shown in Tble II. The l-keywords re seleted from the keyword sets shown in Tble III, with the ssoited KWF (keyword frequeny). For IMDB, there re 3 tbles: Users(UserID, Gender, Age, Ouption, Zip-ode), Movies(MovieID, Title, Genres), nd Rtings (UserID, MovieID, Rting, Timestmp). The numbers of tuples of the 3 tbles re, 6.04K, 3.88K nd, 000.K, respetively. The dtbse grph onsists of, 00, 3 nodes nd 4, 000, 836 direted edges, whih is denser then the DBLP grph. The prmeters used nd their defult vlues re shown in Tble IV. The keyword sets we used re shown in Tble V, with their ssoited KWF. The hrteristis for the two dtsets re different. In DBLP dtset, eh uthor writes 4.06 ppers on verge while eh pper is written by.46 uthors on verge. In IMDB dtset, eh user evlutes movies on verge while eh movie is evluted by users on verge. This ft explins why we set the defult Rmx to be 6 for DBLP nd for IMDB. All experiments were onduted on.60ghz Intel(R) Xeon(R) CPU nd GB memory PC running windows server 003. For ll lgorithms to be tested, we first projet dtbse subgrph, for n l-keyword query, using the two inverted indexes (invertedn nd invertede), nd test the lgorithms. The mximum nd verge size of the projeted grphs re.% nd 0.4% of the DBLP grph, nd.8% nd 0.5% of the IMDB grph, respetively. We signifintly redue the serh spe using the two inverted indexes. The elpsed time for onstruting inverted indexes for DBLP nd IMDB re 355 seonds nd 4 seonds, respetively. The sizes of the totl inverted indexes for DBLP nd IMDB re,6 MB nd 84 MB respetively, ompred with the sizes of the rw dtsets, 445 MB nd 4 MB. For testing lgorithms,, nd, we report the verge-dely, whih is the totl CPU time divided by the number of ommunities found, s used in [] for the purpose of testing polynomil dely lgorithms. For testing PDk, BUk, nd TDk, we report the totl CPU time for finding ll k ommunities. We lso report the mximum memory used in testing. Exp- (IMDB): We ompre with nd for finding ll ommunities ginst IMDB. Fig. 9() nd 9(b) show tht, the more frequent the keyword is, the longer the vergedely is, nd the lrger memory it onsumes. is 0 times fster thn nd lgorithms, nd onsumes the lest memory. onsumes more memory thn does,

11 CPU (ms) CPU (ms) CPU (ms) () Vry KWF () Vry l (e) Vry Rmx Fig. 9. Memory (Byte) Memory (Byte) Memory (Byte) 0M M 00K 0K K M 0M M 00K 0K K 0M M 00K 0K K Find-All (IMDB) (b) Vry KWF (d) Vry l (f) Vry Rmx CPU (ms) CPU (ms) CPU (ms) () Vry KWF () Vry l (e) Vry Rmx Fig.. Memory (Byte) Memory (Byte) Memory (Byte) 00M 0M M 00K 0K K M 0M M 00K 0K K 00M 0M M 00K 0K K Find-All (DBLP) (b) Vry KWF (d) Vry l (f) Vry Rmx CPU (se) CPU (se) TDk BUk PDk () Vry KWF TDk BUk PDk () Vry Rmx Fig. 0. CPU (se) CPU (se) Find-TopK (IMDB) TDk BUk PDk (b) Vry l TDk BUk PDk (d) Vry k beuse eh enter node is ssoited with keyword node sets, whih re the sets of keyword nodes tht n be rehed from the enter. needs to mintin ll these sets wheres n free the memory fter outputs the ommunities found. When inresing l from to 6, s shown in Fig. 9() nd 9(d), the verge-dely for ll lgorithms dereses, s expeted. is lso fster thn both nd. The memory ost inreses using nd, beuse, when l inreses, the number of resulting ommunities inreses, both nd need to mintin ll the resulting ommunities. onsumes lest memory nd does not vry muh, even when l inreses. Fig. 9(e) nd 9(f) show tht, when Rmx inreses, both the verge-dely nd the memory onsumption inreses for ll three lgorithms. performs best mong ll. We ompre PDk with BUk nd TDk for finding top-k ommunities ginst IMDB. As shown in Fig. 0(), when KWF inreses, the totl time to get the top-k ommunities inreses in most ses for ll three lgorithms, PDk performers best. Fig. 0(b) shows tht, when l inreses, the time for BUk nd TDk inreses, beuse the number of temporry results generted inreses. PDk is onsistent. In Fig. 0() nd 0(d), when Rmx or k inreses, the totl time to get the CPU (Se) TDk BUk PDk () Vry k (IMDB) (b) Vry k (DBLP) Fig.. Intertive TopK Test top-k ommunities inreses, for ll three lgorithms. PDk performs best. The memory onsumptions for ll tests re not lrge nd do not hnge muh. Due to spe limit, we do not show the memory onsumptions. As n inditor, the memory onsumption for the defult vlues of three lgorithms re KB (TDk),. KB (BUk), nd 9.6 KB (PDk). CPU (Se) TDk BUk PDk Exp- (DBLP): We ompre with nd for finding ll ommunities ginst DBLP. In Fig. (), (b), (e) nd (f), for the memory onsumption, performs best, but for the verge-dely, is slower thn both nd, beuse, in the DBLP dtset, the probbility for set of keyword nodes to be entered t multiple nodes is very smll, nd most of the results hve only one enter. In this sitution, the number of duplitions generted by nd is very smll, whih mkes them fster to enumerte ll ommunities. When KWF or Rmx inreses, the verge-dely nd memory onsumption for ll three lgorithms inreses. Fig. () shows tht, when the number of keywords l inreses, the verge-dely, for ll three lgorithms, dereses. dereses fster. In Fig. (d), when l inreses, the memory onsumption for nd inreses, beuse they hve to mintin ll the results generted, nd the number of results will inrese when l inreses. onsumes smller memory when l beomes lrger, beuse, when l inreses, the size of the projeted grph dereses for the DBLP dtset. We ompre PDk with BUk nd TDk for finding top-k ommunities ginst DBLP. They show the similr trends s they do for finding ll ommunities due to the sme resons tht the number of duplitions is smll in DBLP.

Math 32B Discussion Session Week 8 Notes February 28 and March 2, f(b) f(a) = f (t)dt (1)

Math 32B Discussion Session Week 8 Notes February 28 and March 2, f(b) f(a) = f (t)dt (1) Green s Theorem Mth 3B isussion Session Week 8 Notes Februry 8 nd Mrh, 7 Very shortly fter you lerned how to integrte single-vrible funtions, you lerned the Fundmentl Theorem of lulus the wy most integrtion