Distance-Join: Pattern Match Query In a Large Graph Database

Similar documents
1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Chapter 4 State-Space Planning

, g. Exercise 1. Generator polynomials of a convolutional code, given in binary form, are g. Solution 1.

Math 32B Discussion Session Week 8 Notes February 28 and March 2, f(b) f(a) = f (t)dt (1)

Hyers-Ulam stability of Pielou logistic difference equation

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

ANALYSIS AND MODELLING OF RAINFALL EVENTS

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

Metodologie di progetto HW Technology Mapping. Last update: 19/03/09

] dx (3) = [15x] 2 0

QUADRATIC EQUATION. Contents

The Double Integral. The Riemann sum of a function f (x; y) over this partition of [a; b] [c; d] is. f (r j ; t k ) x j y k

Lecture Notes No. 10

f (x)dx = f(b) f(a). a b f (x)dx is the limit of sums

(a) A partition P of [a, b] is a finite subset of [a, b] containing a and b. If Q is another partition and P Q, then Q is a refinement of P.

Intermediate Math Circles Wednesday 17 October 2012 Geometry II: Side Lengths

Tutorial Worksheet. 1. Find all solutions to the linear system by following the given steps. x + 2y + 3z = 2 2x + 3y + z = 4.

INTEGRATION. 1 Integrals of Complex Valued functions of a REAL variable

Line Integrals and Entire Functions

arxiv: v1 [math.ca] 21 Aug 2018

Activities. 4.1 Pythagoras' Theorem 4.2 Spirals 4.3 Clinometers 4.4 Radar 4.5 Posting Parcels 4.6 Interlocking Pipes 4.7 Sine Rule Notes and Solutions

Arrow s Impossibility Theorem

Engr354: Digital Logic Circuits

Section 3.6. Definite Integrals

Maintaining Mathematical Proficiency

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Querying Communities in Relational Databases

Green s Theorem. (2x e y ) da. (2x e y ) dx dy. x 2 xe y. (1 e y ) dy. y=1. = y e y. y=0. = 2 e

Arrow s Impossibility Theorem

18.06 Problem Set 4 Due Wednesday, Oct. 11, 2006 at 4:00 p.m. in 2-106

Bisimulation, Games & Hennessy Milner logic

Comparing the Pre-image and Image of a Dilation

The Regulated and Riemann Integrals

Lecture 1 - Introduction and Basic Facts about PDEs

Nondeterministic Automata vs Deterministic Automata

Logic Synthesis and Verification

Section 4.4. Green s Theorem

6.5 Improper integrals

Section 1.3 Triangles

T b a(f) [f ] +. P b a(f) = Conclude that if f is in AC then it is the difference of two monotone absolutely continuous functions.

More Properties of the Riemann Integral

Lesson 2: The Pythagorean Theorem and Similar Triangles. A Brief Review of the Pythagorean Theorem.

NON-DETERMINISTIC FSA

CS 491G Combinatorial Optimization Lecture Notes

Introduction to Olympiad Inequalities

On the Scale factor of the Universe and Redshift.

ILLUSTRATING THE EXTENSION OF A SPECIAL PROPERTY OF CUBIC POLYNOMIALS TO NTH DEGREE POLYNOMIALS

Core 2 Logarithms and exponentials. Section 1: Introduction to logarithms

April 8, 2017 Math 9. Geometry. Solving vector problems. Problem. Prove that if vectors and satisfy, then.

Neighborhood Based Fast Graph Search in Large Networks

where the box contains a finite number of gates from the given collection. Examples of gates that are commonly used are the following: a b

Lecture Summaries for Multivariable Integral Calculus M52B

5. Every rational number have either terminating or repeating (recurring) decimal representation.

MATH Final Review

Learning Partially Observable Markov Models from First Passage Times

Part 4. Integration (with Proofs)

Electromagnetic-Power-based Modal Classification, Modal Expansion, and Modal Decomposition for Perfect Electric Conductors

Table of Content. c 1 / 5

Theoretical foundations of Gaussian quadrature

A Lower Bound for the Length of a Partial Transversal in a Latin Square, Revised Version

New Expansion and Infinite Series

Student Activity 3: Single Factor ANOVA

Algorithm Design and Analysis

CS 573 Automata Theory and Formal Languages

p-adic Egyptian Fractions

Lecture 6: Coding theory

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

Ch. 2.3 Counting Sample Points. Cardinality of a Set

LIP. Laboratoire de l Informatique du Parallélisme. Ecole Normale Supérieure de Lyon

HS Pre-Algebra Notes Unit 9: Roots, Real Numbers and The Pythagorean Theorem

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

MATH34032: Green s Functions, Integral Equations and the Calculus of Variations 1. 1 [(y ) 2 + yy + y 2 ] dx,

12.4 Similarity in Right Triangles

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Algorithm Design and Analysis

SECTION A STUDENT MATERIAL. Part 1. What and Why.?

System Validation (IN4387) November 2, 2012, 14:00-17:00

University of Sioux Falls. MAT204/205 Calculus I/II

AP Calculus BC Chapter 8: Integration Techniques, L Hopital s Rule and Improper Integrals

ODE: Existence and Uniqueness of a Solution

7.2 The Definite Integral

Unit 4. Combinational Circuits

Fast Frequent Free Tree Mining in Graph Databases

MATH34032: Green s Functions, Integral Equations and the Calculus of Variations 1

Recitation 3: More Applications of the Derivative

5.7 Improper Integrals

Solutions to Assignment 1

Applications of Definite Integral

Acceptance Sampling by Attributes

Génération aléatoire uniforme pour les réseaux d automates

THE INFLUENCE OF MODEL RESOLUTION ON AN EXPRESSION OF THE ATMOSPHERIC BOUNDARY LAYER IN A SINGLE-COLUMN MODEL

Chapter 3. Vector Spaces. 3.1 Images and Image Arithmetic

Finite State Automata and Determinisation

Chem Homework 11 due Monday, Apr. 28, 2014, 2 PM

Mid-Term Examination - Spring 2014 Mathematical Programming with Applications to Economics Total Score: 45; Time: 3 hours

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

GM1 Consolidation Worksheet

Linear Algebra Introduction

PYTHAGORAS THEOREM WHAT S IN CHAPTER 1? IN THIS CHAPTER YOU WILL:

Transcription:

Distne-Join: Pttern Mth Query In Lrge Grph Dtbse Lei Zou Huzhong University of Siene nd Tehnology Wuhn, Chin zoulei@mil.hust.edu.n Lei Chen Hong Kong University of Siene nd Tehnology Hong Kong leihen@se.ust.hk M. Tmer Özsu University of Wterloo Wterloo, Cnd tozsu@s.uwterloo. ABSTRACT The growing populrity of grph dtbses hs generted interesting dt mngement problems, suh s subgrph serh, shortest-pth query, rehbility verifition, nd pttern mth. Among these, pttern mth query is more flexible ompred to subgrph serh nd more informtive ompred to shortest-pth or rehbility query. In this pper, we ddress pttern mth problems over lrge dt grph G. Speifilly, given pttern grph (i.e., query Q), we wnt to find ll mthes (in G) tht hve the similr onnetions s those in Q. In order to redue the serh spe signifintly, we first trnsform the verties into points in vetor spe vi grph embedding tehniques, overting pttern mth query into distne-bsed multi-wy join problem over the onverted vetor spe. We lso propose severl pruning strtegies nd join order seletion method to proess join proessing effiiently. Extensive experiments on both rel nd syntheti dtsets show tht our method outperforms existing ones by orders of mgnitude.. INTRODUCTION Grphs hve been used to model mny dt types in different domins, suh s soil networks, biologil networks, nd World Wide Web. In order to ondut effetive nlysis over grphs, vrious types of queries hve been investigted, suh s subgrph serh [9, 6, 7,,, 8, 8,,, ], shortest-pth query [7,, 6], rehbility query [7,,, ], nd pttern mth query [6, ]. Among these interesting queries, pttern mth query is more flexible thn subgrph serh nd more informtive thn simple This work ws done when the first uthor ws visiting University of Wterloo s visiting sholr. The first uthor ws prtilly supported by Ntionl Nturl Siene Foundtion of Chin under Grnt 777. The seond uthor ws supported by Hong Kong RGC GRF 668 nd NSFC/RGC Joint Reserh Sheme N HKUST6 /8. The third uthor ws supported by Ntionl Siene nd Engineering Reserh Counil (NSERC) of Cnd. Permission to opy without fee ll or prt of this mteril is grnted provided tht the opies re not mde or distributed for diret ommeril dvntge, the VLDB opyright notie nd the title of the publition nd its dte pper, nd notie is given tht opying is by permission of the Very Lrge Dt Bse Endowment. To opy otherwise, or to republish, to post on servers or to redistribute to lists, requires fee nd/or speil permission from the publisher, ACM. VLDB 9, August -8, 9, Lyon, Frne Copyright 9 VLDB Endowment, ACM ----//. shortest-pth or rehbility query. Speifilly, pttern mth looks for the existenes of pttern grph in dt grph. A pttern mth query is different from subgrph serh in tht it only speifies the vertex lbels nd onnetion onstrints between verties. In other words, pttern mth query emphsizes the onnetivity between lbeled verties rther thn heking subgrph isomorphism s subgrph serh does. In this pper, we disuss n effetive nd effiient method for exeuting pttern mth queries over lrge grph dtbse. We desribe pttern mth query s follows: given dt grph G, query grph Q (with n verties), nd prmeter, n verties in G n form mth to Q, if: () these n verties in G hve the sme lbels s the orresponding verties in Q; nd () for ny two djent verties v i nd v j in Q (i.e. there is n edge between v i nd v j in Q nd i, j n), the distne between two orresponding verties in G is no lrger thn. We need to find ll mthes of Q in G. In this work, we use the shortest-pth distne to mesure the distne between two verties, but our pproh is not restrited to this distne funtion, it n be pplied to other metri distne funtions s well. We disuss two exmples to demonstrte the usefulness of pttern mth queries. Exmple. Febook Network Anlysis Figure () shows fititious grph model (G) of Febook, where verties represent tive users nd the edges indite the friendship reltions between two users. There re jobtitle ttributes ssoited with verties. We tret job-titles s vertex lbels. Note tht, the numbers inside verties re vertex IDs tht we introdue to simplify desription of the grph. A pttern mth query, Q (in Figure (b)), looks for friendship reltions between four types of users, i.e, four types of lbels: CFO, CEO, Mnger nd Dotor, nd onstrints re set up on the shortest-pth distne ( ) between ny pir of mthed lbeled verties in G. Finding suh ptterns my help soil siene reserhers disover onnetions between suessful CEO nd his/her irle of friends. In Figure (), verties (,,6,8) mth Q, whih indites tht verties (,,6,8) (in G) hve similr reltionships s those speified in query Q. Exmple. Biologil Network Investigtion We n model biologil network s lrge grph, suh s protein-protein intertion network (PPI) nd metboli network, where verties represent biologil entities (proteins, genes nd so on) nd edges represent the intertions between them. Consider the following senrio: in order to study ertin disese, sientist hs onstruted

CEO Aount CEO Clerk Mnger CFO 6 () Grph G CFO 9 Dotor 8 7 Offier Student CEO CFO Mnger (b) Query Q Dotor Figure : Pttern Mth Query in Febook Network smll portion of biologil network Q bsed on vrious experimentl dt. The sientist is interested in prediting more biologil tivities bout the disese. So, s/he wnts to find mthes of Q in lrge biologil network G bout nother well-studied disese. The mthes in G hve the sme (or similr) pthwys (i.e. shortest-pth) s those in Q. As shown in these bove exmples, pttern mth queries re useful; however, it is non-trivil to find ll mthes in lrge grph due to the huge serh spe. Given query Q with n verties, for eh vertex v i in Q, we first find list of verties in dt grph G tht hve the sme lbels s tht of v i. Then, for eh pir of djent verties v i nd v j in Q, we need to find ll mthing pirs in G whose distnes re less thn. This is lled n edge query. To nswer n edge query, we need to ondut distne-bsed join opertion between two lists of mthing verties orresponding to v i nd v j in G. Therefore, finding the pttern Q in G is sequene of distne-bsed join opertions, whih is very ostly for lrge grphs. For exmple, ssuming tht the query Q hs 6 verties, the dt grph G hs, verties, nd eh query vertex hs mth verties in G, the serh spe is () 6 =! Therefore, we need effiient pruning strtegies to redue the serh spe. Although mny effetive pruning tehniques hve been proposed for subgrph serh, (e.g. [9, 6, 7,,, 8, 8,,, ]), they n not be pplied to pttern mth queries sine these pruning rules re bsed on the neessry ondition of subgrph isomorphism. We propose novel nd effetive method to redue the serh spe signifintly. Speifilly, we trnsform verties into points in vetor spe vi grph embedding methods, onverting pttern mth query into distne-bsed multi-wy join problem over the vetor spe. In order to redue the join ost, we propose severl pruning rules to redue the serh spe further, nd propose ost model to guide the seletion of the join order to proess multi-wy join effiiently. To summrize, in this work, we mke the following ontributions: ) We propose generl frmework for hndling pttern mth queries over lrge grph. Speifilly, we mp verties into vetors vi n embedding method nd ondut distne-bsed multi-wy join over vetor spe. ) We design n effiient distne-bsed join lgorithm for n edge query in the onverted vetor spe, whih well utilizes the blok nested loop join nd hsh join tehniques to hndle high dimensionl vetor spe. ) We develop n effetive ost model to estimte the ost of eh join opertion, bsed on whih we n selet the most effiient join order to redue the ost of multi-wy join. ) Finlly, we ondut extensive experiments with rel nd syntheti dt to demonstrte the effetiveness of our solutions to nswer pttern mth queries. The rest of the pper is orgnized s follows. We disuss the relted work in Setion. Our frmework is presented in Setion. In Setion, we propose neighbor re pruning tehnique. We propose distne-bsed join lgorithm for n edge query nd its ost model in Setion. Setion 6 presents distne-bsed multi-wy join lgorithm for pttern mth query nd join order seletion method. We study our methods by experiments in Setion 7. Setion 8 onludes this pper.. RELATED WORK Let G = V, E to be grph where V is the set of verties nd E is the set of edges. Given two verties u nd u in G, rehbility query verifies if there exists pth from u to u, nd distne query returns the shortest pth distne between u nd u [7]. These re well-studied problems, with number of vertex lbeling-bsed solutions [7]. A fmily of lbeling tehniques hve been proposed to nswer both rehbility nd distne queries. A -hop lbeling method over lrge grph G ssigns to eh vertex u V (G) lbel L(u) = (L in(u), L out(u)), where L in(u), L out(u) V (G). Verties in L in(u) nd L out(u) re lled enters. There re two kinds of -hop lbeling: tht re -hop rehbility lbeling (rehbility lbeling for short) nd -hop distne lbeling (distne lbeling for short). For rehbility lbeling, given ny two verties u, u V (G), there is pth from u to u (denoted s u u ), if nd only if L out(u ) L in(u ) φ. For distne lbeling, we n ompute Dist sp(u, u ) using the following eqution. Dist sp(u, u ) = min{dist sp(u, w) + Dist sp(u, w) w (L out(u ) L in (u ))} where Dist sp(u, u ) is the shortest pth distne between verties u nd u. The distnes between verties nd enters (i.e, Dist sp(u, w) nd Dist sp(u, w)) re pre-omputed nd stored. The size of -hop lbeling is defined s u V (G) ( L in(u) + L out(u) ), while the size of -hop distne lbeling is O( V (G) E(G) / ) [6]. Thus, ording to Eqution, we need O( E(G) / ) time to ompute the shortest pth distne by distne lbeling beuse the verge vertex distne lbel size is O( E(G) / ). To the best of our knowledge, there exists little work on pttern mth queries over lrge dt grph, exept for [6, ]. In [6], bsed on the rehbility onstrint, uthors propose pttern mth problem over lrge direted grph G. Speifilly, given query pttern grph Q (tht is direted grph) tht hs n verties, n verties in G n mth Q if nd only if these orresponding verties hve the sme rehbility onnetion s those speified in Q. This is the most relted work to ours, lthough our onstrints re on distne insted of rehbility. We ll our mth distne pttern mth, nd the mth in [6] rehbility pttern mth. We first illustrte the method in [6] using Figure, nd then disuss how it n be extended it to solve our problem nd present the shortomings of the extension. Without loss of generlity, we first ssume tht there is only one direted edge e = (v, v ) in query Q. Figure () shows bse tble to store ll vertex distne lbels. For eh enter w i, two lusters F (w i) nd T (w i) of verties re defined, where for every vertex u in F (w i), it n reh every vertex u in T (w i), vi w i. Then, n index struture is built bsed on these lusters, s shown in Figure. For eh vertex lbel pir (l, l ), ll enters w i re stored (in tble W-Tble), where there exists t lest one vertex lbeled l (nd l ) in F (w i) (nd T (w i)). Consider direted ()

u L ( in u ) Lout ( u) { } { } b {, b} { } b { b} { b} b {, b} { } { } { b, } lbel pir enters (, b) { } ( b, ) { b, } Tble : Menings of Symbols Used G dt grph Q Query Grph V (G)/V (Q) Vertex set of G/Q v i vertex in Q E(G)/E(Q) Edge set of G/Q u i vertex in G () Bse Tble b b F( ) T ( ) b root b b (b) W-Tble b F( b ) T ( b ) F( ) T ( ) ( ) Cluster-bsed Index Figure : R-join edge e = (v, v ) in query Q nd ssume tht the lbels of vertex v nd v (in query Q) re nd b, respetively. Aording to tble W-Tble in Figure b, we n find enters w i, in whih there exists t lest vertex u lbeled in F (w i), nd there exists t lest vertex u lbeled b in T (w i). For eh suh enter w i, the Crtesin produt of verties lbeled in F (w i) nd verties lbeled b in T (w i) n form the mthes of Q. This opertion is lled R-join [6]. In this exmple, there is only one enter tht orresponds to vertex lbel pir (, b), s shown in Figure (b). Aording to index struture in Figure (), we n find F ( ) nd T ( ). When the number of edges in Q is lrger thn one, rehbility pttern mth query n be nswered by sequene of R-joins. We n extend the method in [6] to distne pttern mth using -hop distne lbeling insted of rehbility lbeling. Agin, we first ssume tht there is only one edge e = (v, v ) in query Q. The vertex lbels re nd b, respetively. In order to find distne pttern mthes, following the frmework in [6], we lso find ll enters w i, in whih there exists t lest vertex u lbeled in F (w i) nd vertex u lbeled b in T (w i). In the lst step, for eh vertex pir (u, u ) in the Crtesin produt, we need to ompute dist = Dist sp(u, w i) + Dist sp(u, w i). If dist, (u, u ) is mth. Note tht this step is different from rehbility pttern mth in [6], in whih no distne omputtion is needed. Assume tht there re n verties lbeled nd n verties lbeled b in grph G. It is ler tht the number of distne omputtions is t lest n n, whih is extly the sme s nive join proessing. Sine vertex u my exist in different lusters F (w i) nd T (w i), the omputtionl ost of this strightforwrd extension is fr lrger thn R R. As disussed in Setion, the hllenge in our distne pttern mth problem is the huge serh spe. Simply extending the method proposed in [6] will not resolve the effiieny issue. Thus, the motivtion of our work is extly this: is it possible to void unneessry distne omputtion to speed up the serh effiieny? Severl effiient nd effetive pruning tehniques re proposed in this pper. Furthermore, our method is independent of -hop grph lbeling tehniques. The best-effet lgorithm [] returns K mthes with lrge sores. Bsed on some heuristi rules, the lgorithm first finds the most promising mth vertex u (in dt grph G) for one vertex in query Q (lled Seed-Finder). Then, it extends the vertex to mth other verties in Q (lled Neighbor-Expnder). After tht, it finds good pth to onnet two mth dt verties if they re required to be onneted ording to query Q (lled Bridge). The query n be repeted with nother seed node, until the user reeives ll k mthes tht re requested. This lgorithm nnot gurntee tht the k result mthes re the k lrgest over ll mthes. We nnot extend this method to pply to our problem, sine the lgorithm nnot gurntee the ompleteness of results. In [9], uthors propose rnked twig queries over lrge grph, however, twig pttern is direted grph, not generl grph. Besides rehbility, distne, nd pttern mth queries, there re lot of works on subgrph serh over grph dtbses, suh s [9, 6, 7,,, 8, 8,,, ], none of whih n be pplied to pttern mth queries, sine ll these pruning tehniques re bsed on the neessry ondition of subgrph isomorphism.. FRAMEWORK In this setion, we give the forml definition of pttern mth queries over grph nd present the generl frmework of our proposed solution. As disussed in Setion, in this work, we study serh over lrge vertex-lbeled nd edge-weighted undireted grph. In the following, unless otherwise speified, ll uses of the term grph refer to vertex-lbeled nd edge-weighted grph. The ommon symbols used in this pper re given in Tble. Definition.. Mth. Consider dt grph G, onneted query grph Q tht hs n verties {v,..., v n}, nd prmeter. A set of n distint verties {u,..., u n} in G is sid to be mth of Q, if nd only if the following onditions hold: ) L(u i) = L(v i), where L(u i)(l(v i)) denotes u i s (v i s) lbel; nd ) If there is n edge between v i nd v j in Q, the shortest pth distne between u i nd u j in G is no lrger thn, tht is, Dist sp(u i, u i ). Given n edge (v i, v j) in Q nd its mth (u i, u j), the shortest pth between u i nd u j in G is sid to be mth pth of the edge (v i, v j) in Q. Definition.. Pttern Mth Query. Given lrge dt grph G, onneted query grph Q with n verties {v,..., v n}, nd prmeter, pttern mth query reports ll mthes of Q in G ording to Definition.. Aording to Definition., ny mth is lwys ontined in some onneted omponent of G, sine Q is onneted. Without loss of generlity, we ssume tht G is onneted. If not, we n sequentilly perform pttern mth query in eh onneted omponent of G to find ll mthes. One wy of exeuting the pttern mth query (tht we ll nive join proessing) is the following. Given pttern

mth query Q tht hs n verties, ording to vertex lbel predites ssoited with eh vertex v i, we first obtin n lists of verties, R,..., R n, where eh list R i ontins ll verties u i whose lbels re the sme s v i s lbel. We sy list R i orresponds to vertex v i in Q. Then, we need to perform shortest pth distne-bsed multi-wy join over these lists. To omplete this tsk, we need to define join order. In ft, join order in our problem orresponds to trversl order in Q. In eh trversl step, the subgrph indued by ll visited edges (in Q) is denoted s Q. We n find ll mthes of Q in eh step. Figure shows join order (i.e., trversl order in Q) of smple query Q. In the first step, there is only one edge in Q, thus, the pttern mth query degrdes into n edge query. After the first step, we still need to nswer n edge query for eh new enountered edge. It is ler tht different join orders will led to different performne. b b d Query Q b b d Figure : A Join-Order As in left-deep join proessing in reltionl systems, we lwys perform shortest pth distne-bsed two-wy join to nswer n edge query. We ll this two-wy join Distne- Join (D-join for short), whih is expressed by Eqution, in whih R nd R re two lists of verties in grph G, nd u nd u re two verties in the two lists, respetively. RS = R R Dist sp(u,u ) () Aording to Definition., we hve to perform shortest pth distne omputtion online. The strightforwrd solution to redue the ost is to pre-ompute nd store ll pirwise shortest pth distnes (Pre-ompute method). The method is fst but prohibitive in spe usge (it needs O( V (G) ) spe). Grph lbeling tehnique enbles the omputtion of shortest pth distne in O( E(G) / ) time, while the spe ost is only O( V (G) E(G) / ) []. Thus, we dopt grph lbeling tehnique insted of Pre-ompute method to perform shortest-pth distne omputtion. The key problem in nive join proessing is its lrge number of distne omputtions, whih is R R. In order to speed up the query performne, we need to ddress two issues: how to redue the number of distne omputtions; nd, finding distne omputtion method to find ll ndidte mthes tht is more effiient thn shortest pth distne omputtion. In order to ddress these issues, we utilize LLR embedding tehnique [7, 8] to mp ll verties in G into points in vetor spe R k, where k is the dimensionlity of R k. We then ompute L distne between the points in R k spe, sine it is muh heper to ompute nd it is the lower bound of the shortest pth distne between two orresponding verties in G (see Theorem.). Thus, we n utilize L distne in vetor spe R k to find ndidte mthes. We lso propose severl pruning tehniques bsed on the properties of L distne to redue the number of distne omputtions in join proessing. Furthermore, we propose novel ost model to guide the join order seletion. Note tht b NULL Verties in G Offline Online Pttern Mthing Query LLR Embedding Edge Query Points in -hop distne lbeling Vertex Lbels Blok Nested Loop Join k Vertex Lists Ri Join Order Seletion Clustering Neighbor Are Pruning Cndidte Set CL={(u, u)} Bloks in flt file Vertex distne lbels Shrunk Vertex Lists Ri Cost Estimtion Answer Set RS={(u, u)} Figure : Frmework of Pttern Mth Query we do not propose generl method for distne-join (lso termed s similrity join) in vetor spe [, ]; we fous on L distne in the onverted spe simply beuse we use L distne to find ndidte mthes. Figure depits the generl frmework to nswer pttern mth query. We first use LLR embedding to mp ll verties into points in vetor spe R k. We dopt k- medoids lgorithm [] to group ll points into different lusters. Then, for eh luster, we mp ll points u (in this luster) into -dimensionl blok. Aording to the Hilbert urve in R k spe, we n define the totl order for ll lusters. Aording to this totl order, we link ll bloks to form flt file. We lso ompute grph distne lbel for eh vertex to enble fst shortest pth distne omputtion [7, ]. When query Q is reeived, ording to join order seletion lgorithm, we find the hepest query pln (i.e., join order). As disussed bove, join order orresponds to trversl order in query Q. At eh step, we perform n edge query for the new introdued edge. During edge query proessing, we first use L distne to obtin ll ndidte mthes (Definition.); then, we ompute the shortest pth distne for eh ndidte mth to fix finl results. Join proessing is iterted until ll edges in Q re visited. Aording to LLR embedding tehnique [7, 8], we hve the following embedding proess to mp ll verties in G into points in vetor spe R k, where k is the dimensionlity of the vetor spe: ) Let S n,m be subset of rndom seleted verties in V (G). We define D(u, S n,m) = min u S n,m {Dist sp(u, u )} () tht is, D(u, S n,m) is the distne from u to its losest neighbor in S n,m. ) We selet k = O(log V (G) ) subsets to form the set R = {S,,..., S,κ,..., S β,,..., S β,κ }. where κ = O(log V (G) ) nd β = O(log V (G) ) nd k = κβ = O(log V (G) ). Eh subset S n,m ( n β, m κ) in R hs n verties in V (G). ) The mpping funtion E : V (G) R k is defined s follows: E(u) = [D(u, S, ),..., D(u, S,κ ),..., D(u, S β, ),..., D(u, S β,κ )] () where βκ = k. In the onverted vetor spe R k, we use L metri s distne funtion in R k, whih is defined s follows: L (E(u ), E(u )) = mx n,m D(u, S n,m) D(u, S n,m) () where D(u, S n,m) is defined in Eqution, nd E(u ) is the orresponding point (in R k spe) with regrd to the vertex

u in grph G. For nottionl simpliity, we lso use u to denote the point in R k spe, when the ontext is ler. Theorem. estblishes L distne over R k s the lower bound of the shortest pth distne over G. Theorem.. [8] Given two verties u nd u in G, L distne between two orresponding points in the onverted vetor spe R k is the lower bound of the shortest pth distne between u nd u ; tht is, L (E(u ), E(u )) Dist sp(u, u ) (6) Note tht shortest pth distne nd L distne re both metri distnes []; thus they stisfy tringle inequlity.. NEIGHBOR AREA PRUNING As result of LLR embedding, ll verties in G hve been mpped into points in R k. We use reltionl tble T (ID, I,..., I k, L) to store ll points in R k. The first ID olumn is the vertex ID, olumns I,..., I k re k dimensions of mpped point in R k, nd the lst olumn L denotes the vertex lbel. To nswer pttern mth query, we ondut multiwy join over the onverted vetor spe, not the originl grph spe. Similrly, eh D-join step is onduted over the vetor spe s well. Thus, to redue the ost of multiwy join, the first step is to remove ll the points tht do not qulify for D-join (i.e., they don t stisfy join ondition in Eqution ) s erly s possible. In this setion, we propose n effiient pruning strtegy lled neighbor re pruning. u is pruned u (, ) sp 6 Dist u ( sp u, u ) 6 Dist ( b u sp u, 6 u ) v Dist (, ) u sp u u u u v b b Dist u u () Shortest Pth Distnes in grph G (b) Query Q Figure : Are Neighbor Pruning We first illustrte the rtionle behind neighbor re pruning using Figure. Consider query Q in Figure. If vertex u lbeled (in G) n mth v (in Q) ording to Definition., there must exist nother vertex u lbeled b (in G), where Dist sp(u, u ), sine v hs neighbor vertex lbeled b in query Q. For vertex u in Figure, there exists no vertex u lbeled with b, where Dist sp(u, u ) ; thus, u n be pruned sfely. Vertex u 6 hs lbel, thus, it is ndidte mth to vertex v in query Q. Although there exists vertex u lbeled, where Dist sp(u 6, u ) <, pruning vertex u in the lst step will led to pruning u 6 s well. In other words, neighbor re pruning is n itertive step, until onvergene is rehed (i.e., no verties in eh list n be further pruned). As result of LLR embedding, ll verties in G hve been mpped into points in R k. Therefore, we wnt to ondut neighbor re pruning over the onverted spe. Sine L distne is the lower bound for the shortest pth distne, for vertex u in Figure, if there exists no vertex u lbeled with b where L (u, u ), u n lso be pruned sfely. However, it is ineffiient to hek eh vertex one-by-one. Therefore, we propose the neighbor re pruning to redue the serh spe in R k. v Definition.. Given vertex v i in query Q nd its orresponding list R i in dt grph G, for point u i in R i, we define vertex neighbor re to be Are(u i) = ([(u i.i, u i.i +),..., (u i.i k, u i.i k +)]), where u i is point in R k spe. The list neighbor re of R i is defined s Are(R i) = u i R i Are(u i). Definition.. Given list R i nd vertex u j, u j Are(R i), if nd only if, for ny dimension I n, u j.i n Are(R i).i n, where Are(R i).i n is the nth dimension of Are(R i). Theorem.. Consider vertex v i in query Q nd ssume tht v i hs m neighbor verties v j (i.e. (v i, v j) is n edge), j =,..., m, nd for eh vertex v j, its orresponding list is R j in G. If j, u i / Are(R j), u i n be sfely pruned from the list R i. Proof. (sketh) If j, u i / Are(R j), there is no vertex u j lbeled s the sme s v j, where L (u i, u j). Algorithm Neighbor Are Pruning Require: Input: Query Q tht hs n verties v i ; nd eh v i hs orresponding list R i. Output n lists R i fter pruning. : while numloop < MAXNUM do : for eh list R i do : Sn R i to find Are(R i ). : for eh list R i do : Sn R i to filter out flse positives by Are(R j ), where v j is neighbor vertex w.r.t v i. 6: if ll list R i hs not been hnge in this loop then 7: Brek Bsed on Theorem., Algorithm lists the steps to perform pruning on eh list R i. Notie tht, s disussed bove, the pruning proess is itertive. Lines - re repeted until either the onvergene is rehed (Lines 6-7), or itertion step exeeds the mximl itertion steps (Line ). The totl time omplexity of Algorithm is O( i Ri ). In the worst se, D-join proessing needs O( i Ri ). Thus, it is desirble to perform neighbor re pruning before join proessing.. EDGE QUERY PROCESSING After neighbor re pruning, we obtin n shrunk lists, R,..., R n, eh orresponding to vertex v i in query Q. Aording to the frmework in Figure, t eh step, we need to nswer n edge query. In this setion, we propose n effiient D-join edge query lgorithm. We first use L distne in the onverted vetor spe R k to find ndidte mth set CL (Definition.): CL = R R L (u,u ). Eh ndidte mth in G is pir of verties (u i, u j) (i j), where L (u i, u j). Then, for eh ndidte (u i, u j), we utilize grph lbeling tehnique to obtin the ext shortest pth distne Dist sp(u i, u j) [7, ]. All pirs (u i, u j) where Dist sp(u i, u j) re olleted to form the finl result RS. Theorem. proves tht the bove proess gurntees no flse negtives. Definition.. Given n edge query Q e = (v, v ) over grph G nd prmeter, vertex pir (u, u ) is ndidte mth of Q e if nd only if: (7)

() L(v ) = L(u ) nd L(v ) = L(u ) where L(u i) (L(v i)) indites lbel of u i (v i); nd () L (u, u ). Theorem.. Given n edge query Q e = (v, v ) over grph G, nd prmeter, let CL denote the set of ndidte mthes of Q e omputed ording to Formul 7, nd RS denote the set of ll mthes of Q e. Then, RS CL. Proof. Strightforwrd from Theorem.. Essentilly, D-join is similrity join over vetor spe. Existing similrity join lgorithms (suh s [] nd []) n be utilized to find ndidte mthes over the vetor spe R k. However, there re two importnt issues to be ddressed in nswering n edge query. First, the onverted spe R k is high dimensionl spe, where k = O(log V (G) ). In our experiments, we hoose - dimensions when V (G) = K K. R-tree bsed similrity join lgorithms (suh s []) nnot work well due to the dimensionlity urse []. Seond, lthough some high-dimensionl similrity join lgorithms hve been proposed, they re not optimized for L distne, whih we use to find ndidte mthes. To ddress these key issues, we first propose novel dt struture to redue both I/O nd CPU osts (Setion.). Then, we propose tringle inequlity pruning nd hsh-join to further redue CPU ost (Setion.).. Dt Strutures nd D-join Algorithm Due to drwbks of index-bsed ess in high-dimensionl spe, we dopt nested loop join strtegy for D-join proessing. However, nive nested loop lgorithm to join two lists R nd R hs serious performne issues: ) High I/O ost: Assume tht tble T is stored into N disk pges, the totl number of I/O in join proessing is N ; b) High CPU ost: The number of distne omputtions is R R. In order to perform n effiient D-join for edge query, we propose luster-bsed blok nested loop join. The onverted high dimensionl spe R k is not uniformly distributed; there exist some lusters in the R k spe. Inspired by idistne [] tht nswers NN queries in high dimensionl spe, we first utilize existing luster lgorithms to find lusters in R k. In our implementtion, we use K-medoids lgorithm [] to find lusters. Note tht the lustering lgorithm is orthogonl to our D-join lgorithm. How to find n optiml lustering in R k is beyond the sope of this pper. In the following disussion, we ssume tht lustering results in R k re given. For eh luster C i, we find its luster enter i s pivot. For eh point u in luster C i whose enter is i, ording to distne L (u, i) ( i is luster enter of C i), u is mpped into -dimensionl blok B i. Clerly, different lusters re mpped into different bloks. We define luster rdius r(c i) s the mximl distne between enter i nd vertex u in luster C i. Figure 6 depits our method, where we Euliden distne is used s the distne funtion for demonstrtion; the tul distne funtion is still L. We need to perform sequentil sn in the nested loop join. To filitte sequentil sn during join proessing, we define totl order of the lusters. Aording to this order, we link ll orresponding bloks B i to form flt file. We dely the disussion on the totl order until the end of this subsetion, sine it is relted to our D-join lgorithm. Hilbert urve Bloks Flt File u d d u u u d d d r( ) r Blok u d u ( ) Blok Blok omitted Figure 6: Cluster in R k We dopt blok nested loop strtegy in D-join lgorithm. Given n edge query Q e = (v, v ), let R nd R to be the lists of ndidte verties (in G) tht stisfy vertex lbel predites ssoited with v nd v, respetively. Let R be the outer nd R be the inner join opernd. D- join lgorithm reds one blok B from R in eh step. In the inner loop, it is not neessry to perform join proessing between B nd ll bloks in R. We sn R to lod promising blok B into memory in the inner loop. Then, we perform memory join lgorithm between B nd B. Theorem. shows the neessry ondition tht B is promising blok with regrd to B. Theorem.. Given two bloks B nd B (the outer nd inner join opernds, respetively), the neessry ondition tht D-join between B nd B produes non-empty result is: L (, ) < r(c ) + r(c ) + where C (C ) is the orresponding luster of blok B (B ), ( ) is C s (C s) luster enter, nd r(c ) (r(c )) is C s (C s) luster rdius. Proof. Proven ording to tringle inequlity. After the nested loop join, we n find ll ndidte mthes for edge query. Then, for eh ndidte mth (u, u ), we use grph lbeling to ompute the shortest pth distne between u nd u, tht is, Dist sp(u, u ). If Dist sp(u, u ), (u, u ) will be inserted into nswer set RS. The detiled steps of D-join Algorithm re shown in Algorithm. Now, we disuss the totl order for lusters. In Algorithm, in eh inner loop, we sequentilly sn R to lod promising bloks into memory with regrd to B (the outer join opernd). Consider two promising bloks B nd B with regrd to B with orresponding lusters C, C nd C, respetively. Aording to tringle inequlity, L (, ) L (, ) L (, ) L (, ) + L (, ). This mens tht lusters C nd C re ner eh other in R k spe. All lusters tht need to be joined with B should be ner eh other in R k spe. If their orresponding bloks re lso djent to eh other in flt file F, we only need to sn portion of file F (insted of snning the whole file) in the inner loop. Due to good lolity-preserving behvior, n Hilbert urve is often used in multidimensionl dtbses.

We define the totl order for different lusters ording to Hilbert order. Consider two lusters C nd C whose luster enters re nd respetively. Assuming nd re in two different ells S nd S (in R k spe) respetively, if ell S is hed of S in Hilbert order, luster C is lrger thn C. If nd re in the sme ell, the order of C nd C is rbitrrily defined. Aording to the totl order, we n link ll orresponding bloks to form flt file. Algorithm D-join Algorithm Require: Input: An edge e = (l, l ) in query Q, where L(v ) (nd L(v )) denotes the vertex lbel of vertex v (nd v ). The distne onstrint is. R, the set of ndidte verties for mthing v in e. R, the set of ndidte verties for mthing v in e. Output: Answer set RS = {(u, u )}, where L(u ) = L(v ) AND L(u ) = L(v ) AND Dist sp(u, u ). : Initilize ndidte set CL nd nswer set RS. : for eh luster C in flt file F do : if C R φ then : Lod C into memory : Aording to Theorem., find ll promising lusters C w.r.t C in flt file F to form luster set P C. 6: Order ll lusters C in P C ording to physil position in flt file F. 7: for eh promising luster C in P C do 8: Lod luster C into memory. 9: Perform memory-bsed D-Join lgorithm on C nd C to find ndidte set CL (ll Algorithm ). : Insert CL into CL. : for eh ndidte mth (u, u ) in CL do : Compute Dist sp(u, u ) by grph lbeling tehniques. : if Dist sp(u, u ) then : Insert (u, u ) into nswer set RS : Report RS Proof. Diretly follows from tringle inequlity sine L is metri. Figure 7 visulizes the serh spe in luster C with regrd to point p in C fter pruning ording to Theorem.... Hsh Join Hsh join in well-known join lgorithm with good performne. The lssil hsh join does not work for D-join proessing, sine it n only hndle equi-join. Consider two bloks B nd B (the outer nd inner join opernds). For purposes of presenttion, we first ssume tht there is only one dimension (I ) in R k, i.e. k =. The mximl vlue in I is defined s I.Mx. We divide the intervl [, I.Mx] into I.Mx bukets for dimension I. Given point q in blok B (the outer opernd), we define hsh funtion H(q) = n = q.i. Then, insted of hshing q into one single buket, we put q into three bukets, (n ) th, n th, nd (n + ) th bukets. To sve spe, we only store q s ID in different bukets. Bsed on this revised hshing strtegy, we n redue the serh spe, whih is desribed by the following theorem. Theorem.. Given point p in blok B (inner join opernd), ording to hsh funtion H(p) = n = p.i, p is loted t the n th buket. It is only neessry to perform join proessing between p nd ll points of B loted in the n th buket. The ndidte serh spe for point p is, Cn (p) = b n, where b n denotes ll points in the n th buket. Proof. It n be proven using L distne definition. L ( p, ) () r( C) p p Serh Spe Cluster C L ( p, ) L ( p, ) Figure 7: Theorem.. Memory Join Algorithm (b) p p r( C) For pir of bloks B nd B tht re loded in memory, we need to perform join effiiently. We hieve this by pruning using tringle inequlity nd by pplying hsh join... Tringle Inequlity Pruning The following theorem speifies how the number of distne omputtions n be redued bsed on tringle inequlity. Theorem.. Given point p in blok B (the inner join opernd) nd point q in blok B (the outer join opernd), the distne between p nd q needs to be omputed only when the following ondition holds (C (C ) is the luster orresponding to B (B )): Mx(L (p, ), ) L (q, ) Min(L (p, )+, r(c )) Dimension I Bukets Keys p q q. I n... b... n b b n n b n n n I. Mx Cndidte Serh Spe: C( p) bn Figure 8: Hsh Join Figure 8 demonstrtes our proposed hsh join method. When k > (i.e. higher dimensionlity), we build bukets for eh dimension I i (i =,..., k). Consider point p (the inner join opernd) from blok B nd obtin ndidte serh spe Cn i(p) in dimension I i, i =,..., k. Theorem. estblishes the finl serh spe of p using hsh join. Theorem.. The overll serh spe for vertex p is Cn(p) = Cn (p) Cn (p)... Cn k (p), where Cn i(p) (i =,..., k) is defined in Theorem.. Theorem.6 shows tht, for join pir (q, p) (p from B nd q from B, respetively), if Dist (q, p) >, the join pir (q, p) n be sfely pruned by the hsh join. Theorem.6. Consider two bloks B nd B (the outer nd inner join opernds) to be joined in memory. For ny point p in B, the neessry nd suffiient ondition tht point q is in p s serh spe (i.e., q C(p)) is L (p, q).

Proof. It n be proven ording to Theorems. nd.. Aording to two pruning tehniques in Theorem. nd join hsh, respetively, we propose Memory D-join in Algorithm. Algorithm Memory D-Join Algorithm Require: Input: An edge e = (v, v ) in query Q. Two lusters re C nd C. The distne onstrint is. R is the set of ndidte verties tht mth v ; R is the set of ndidte verties tht mth v. Output: Answer set RS = {(u, u )}, where L(u ) = L(v ) AND L(u ) = L(v ) AND Dist sp(u, u ). : for eh vertex p in C do : if p R then : Aording to Theorem., find serh spe in C with regrd to p, denoted s SP (p). : Using hsh join in Theorem., find serh spe Cn(p). : Finl serh spe with regrd to p is SP (p) = SP (p) Cn(p). 6: for eh point q in the serh spe SP (p) do 7: if L (q, p) then 8: Insert (q, p) into ndidte set CL 9: Report CL. 6. PATTERN MATCH QUERY Aording to the frmework in Figure, pttern mth query is trnsformed into shortest pth distne-bsed multi-wy join problem, lled MD-join. Thus, we first give the detiled steps to nswer multi-wy join query in Setion 6., then we present the ost funtion (Setion 6.) tht drives join order seletion disussed in Setion 6.. 6. MD-Join Algorithm In the following disussion, we ssume tht the join order is speified. As disussed in Setion, join order of MD-join orresponds to trversl order in query grph Q. Aording to given trversl order (in Q), we visit one edge e = (v i, v j) (in Q) from vertex v i in eh step. If vertex v j is the new enountered vertex (i.e., v j hs not been visited yet), edge e = (v i, v j) is lled forwrd edge; nd if v j hs been visited before, e is lled bkwrd edge. The proessing of forwrd edge query nd tht of bkwrd edge query re different. Essentilly, forwrd edge proessing is performed by D-join lgorithm (s disussed in Setion.), while bkwrd edge proessing is seletion opertion, whih will be disussed shortly. MD-join is similr to trditionl multi-join opertion in reltionl dtbses nd XML dtbses []. Thus, following the sme onventions, we define the onept of sttus. Definition 6.. Given query grph Q, subgrph Q indued by ll visited edges in Q is lled sttus. All mthes of Q (nd Q ) re stored in reltionl tble MR(Q) (nd MR(Q )), in whih olumns orrespond to verties v i in Q (nd Q ). The MD-join lgorithm (Algorithm ) performs sequentil move from the initil sttus NULL to finl sttus Q, s shown in Figure. Consider two djent sttuses Q i nd Q i+, where Q i is subgrph of Q i+ nd E(Q i+) E(Q i) =. Let e = (Q i+ \ Q i) denote n edge in Q i+ but not in Q i. If e is the first edge to be visited in query Q, we n get the mthes of e (denoted s MR(e)) by D-join proessing (Line in Algorithm ). Otherwise, there re two ses to be onsidered. Forwrd edge proessing: If e = (v i, v j) is forwrd edge, we n obtin MR(Q j) s follows: ) we first projet tble MR(Q ) over olumn v i to obtin list R i (Line 9 in Algorithm ). We n obtin the list R j (by snning the originl tble T before joining proessing in Line ) tht orresponds to vertex v j, ording to v j s lbel. Note tht, R j is shrunk list fter neighbor re pruning (Line ); ) Aording to the D-join lgorithm (Algorithm ), we find the mthes for edge e, denoted s MR(e) (Line ); ) We perform trditionl nturl join over MR(Q i) nd MR(e) to obtin MR(Q j) bsed on olumn v i (Line ). Bkwrd edge proessing: If e = (v i, v j) is bkwrd edge, we n sn the intermedite tble MR(Q i) to filter out ll vertex pirs (u i, u j), where u i nd u j orrespond to verties v i nd v j in query Q, nd Dist sp(u i, u j) > (we n ompute Dist sp(u i, u j) by grph lbeling tehnique). After filtering MR(Q i), we obtin the mthes of Q i+, i.e., MR(Q i+). Essentilly, it is seletion opertion bsed on the distne onstrint (Line ), defined s follows: MR(Q i+) = σ (Distsp(r.vi,r.v j ) )(MR(Q i)). The bove steps re iterted until the finl sttus Q is rehed (Lines 6-). Algorithm Multi-Distne-Join Algorithm (MD-join) Require: Input: A query grph Q tht hs n verties nd prmeter nd lrge grph G nd tble T for the onverted vetor spe R k, nd the join order MDJ. Output: MR(Q): All mthes of Q in G. : for eh vertex v i in query Q, find its orresponding list R i, ording to v i s lbel. : Obtin Shrunk lists R i (i =,..., n) by neighbor re pruning. : Set e = (v, v ). : Obtin MR(e) by D-join lgorithm (ll Algorithm ). : set Q i 6: = e. while Q i 7:! = Q do Aording to join order MDJ, e is the next trversl edge. 8: if e is forwrd edge, denoted s e = (v i, v j ) then 9: R i = σ t.id ( v MR(Q ))(T ). i i : MR(e) = (R i.id,r j.id) ( R i R j ) (ll Algorithm ) Dist sp(r i,r j ) : MR(Q i+ ) = MR(Q i ) MR(e) v i : else : MR(Q i+ ) = σ (Dist sp(r.v i,r.v j ) )(MR(Q i )) : Report MR(Q). 6. Cost Model It is well-known tht different join orders in MD-join lgorithm will led to different performnes. The join order seletion is bsed on the ost estimtion of edge query. In this setion, we disuss the ost of D-join lgorithm tht nswers edge query, whih hs three omponents: the ost of blok nested loop join (Lines - in Algorithm ), the ost of omputing the ext shortest pth distne (Lines -), nd the ost of storing nswer set RS (Line ). Note tht the mthes of n edge query re intermedite results for grph pttern query. Therefore, similr to ost nlysis for struturl join in XML dtbses [], we lso ssume tht intermedite results should be stored in temporry tble in disk. We use set of ftors to normlize the ost of D-join lgorithm. These ftors re f R: the verge ost of loding

one blok into memory; f D: the verge ost of L distne omputtion ost; f S: the verge ost of shortest pth distne omputtion ost; f IO: the verge ost of storing one mth into disk. Given n edge query Q e = (v, v ) nd prmeter, R (R ) is the list of ndidte verties for mthing v (v ). All verties in R (R ) re stored in B ( B ) bloks in flt file F. The ost of D-join lgorithm n be omputed s follows: Cost(e) = B B γ f R + R R γ f D + CL f S + CL γ f IO (8) where γ, γ, nd γ re defined s follows. from the first N tuples of R nd every N th tuple therefter []. The reltions here re R nd R, nd the join n n ttributes re R.I nd R.I.R nd R re both from tble T. We ssume tht there exists B + -tree index on eh dimension I i in tble T, llowing tuples to be essed in sending/deseding order. We selet ( R λ) verties from R, nd ll these seleted verties re olleted to form subset SR, where λ is smpling rtio. The sme is done for subset SR from the list R. R. I R. I R. I the shred re R. I R. I R. I the shred re f f f γ = AessedBloks, γ = DisComp B B R R, γ = RS CL (9) R. I R. I x x f f f R. I R. I nd where AessedBloks is the number of essed bloks in Algorithm ; DisComp is the number of L distne omputtions nd RS (nd CL ) is rdinlity of nswer set RS (nd ndidte set CL). We use the following methods to estimte γ, γ nd γ. ) Offline: We pre-ompute γ, γ nd γ. Notie tht γ, γ nd γ re relted to vertex lbels nd the distne onstrint. Thus, ording to historil query logs, the mximl vlue of is. We prtition [, ] into z intervls, eh with width d =. In order to ompute the sttistis z the γ, γ nd γ for vertex lbel pir (l, l ) nd the distne onstrint in the ith intervl [(i )d, i d] ( i z), we set = (i /)d, nd there is only one edge e = (v, v ) in query grph Q, where L(v ) = l nd L(v ) = l. We perform D-join lgorithm, nd ompute γ, γ nd γ using Eqution 9. ) Online: Given n edge query Q e = (v, v ), we look up the estimtes for γ, γ nd γ tht were omputed offline using the vertex lbel (L(v ), L(v )) nd. Next, we disuss how to estimte CL. Let us first ssume tht k =, given n edge query Q e = (v, v ), the rdinlity of ndidte mth set CL n be denoted s CL = R R θ where θ is the seletivity of D-join bsed on L distne. We n regrd R.I nd R.I s two rndom vribles x nd y. Let z = x y denote the joint rndom vrible. Seletivity θ equls to the probbility of z. Figure 9() visulizes the joint rndom vrible z nd the re Θ between two urves y = x + nd y = x. We n use the following eqution to ompute seletivity θ. θ = P r(z ) = f(x, y)d(x, y) = f(x, y)d(x, y) x y (x,y) Θ where f(x, y) denotes z s density funtion. We use twodimensionl histogrm method to estimte f(x, y). Speifilly, we use equi-width histogrms tht prtition (x, y) dt spe into t regulr bukets (where t is onstnt lled the histogrm resolution), s shown in Figure 9(b). Similr to other histogrm methods, we lso ssume tht the distribution in eh buket is uniform. Then, we use systemti smpling tehnique [] to estimte density funtion in eh buket. The bsi ide of systemti smpling is the following []: Given reltion R with N tuples tht n be essed in sending/deseding order on the join ttribute of R, we selet n smple tuples s follows: selet tuple t rndom R. I f f f x () (b) Figure 9: Seletivity Estimtion We mp SR SR into different two-dimensionl bukets. For eh buket A, we use A to denote the number of points (from SR SR ) tht fll into buket A. The joint density funtion of points in buket A is denoted s f(a) = x x R. I A SR SR. () Some bukets re prtilly ontined in the shred re Θ. The number of points (from R R ) tht fll into both buket A nd the shred re Θ (denoted s A Θ ) n be estimted s: A Θ = R R f(a) re(a Θ) re(a) where re(a Θ) denotes the re of intersetion between A nd Θ nd re(a) denotes the re of A. We dopt Monte-Crlo methods to estimte re(a Θ). re(a) Speifilly, we first rndomly generte set of points in buket A (the number of generted reords is ). The number of points tht fll in Θ is b. Then, we estimte re(a Θ) re(a) to be. b Therefore, we hve CL = A ij Θ = R R (f(a ij) re(a ij Θ) ) ij ij re(a ij ) The seletivity of θ n be estimted s follows θ = P r(z ) = A ij Θ = (f(a ij) re(a ij Θ) ) ij ij re(a ij ) () where f(a ij) is estimted by Eqution. If k >, ording to Theorem., we hve CL = R R Mx i k ( R.I i R.I i ) The rdinlity of CL is CL = R R θ where θ is the seletivity of D-join bsed on L distne. We n regrd R.I i nd R.I i (i =,..., k) s rndom vribles

R. I I. Mx R. I R. I join pir ( r, r ) R. I R I. Mx. I R. I R. I R. I R. I R. I I. Mx R. I Dimension I Dimension I I. Mx Figure : Multi-Dimension Seletivity Estimtion x i nd y i. Let z i = x i y i denote the joint rndom vrible. θ = P r(mx(z,..., z k ) )) = P r((z )... (z k )) () To ompute Eqution, we propose two tehniques: dimension-independene ssumption nd smpling-bsed method. ) Dimension-Independene Assumption We ssume tht every dimension I i in vetor spe R k is independent of eh other. Thus, we hve P r((z )... (z k )) = P r(z )... P r(z k ). () where P r(z i ) (i =,..., k) n be omputed using Eqution. Experiments indite tht Eqution nnot provide urte seletivity estimtion. sine dimensions in R k spe re orrelted. In order to obtin more urte estimtion, we propose smpling. ) Smpling-bsed Method Consider two lists R nd R to be joined. Assume, for simpliity, k =. In Figure, P r(mx(z, z ) )) is the probbility tht vertex pir flls into both shred res Θ nd Θ. We dopt smpling-bsed methods to estimte P r(mx(z,..., z k ) )). For exmple, we hve two smple sets SR nd SR from two sets R nd R, respetively. If there re M join pirs (u, u ) suh tht Mx( u.i i u.i i ), ( i k), P r(mx(z,..., z k ) M ) = SR SR. The speifi tehnique for omputing the optiml smpling tehnique in high-dimensionl spe is beyond the sope of this pper. Without loss of generlity, we hoose rndom smples, i.e, eh point hs the equl probbility of being hosen s smple. 6. Join Order Seletion The join order seletion n be performed by dopting the trditionl dynmi progrmming lgorithm [] using the ost model introdued in the previous setion. However, this solution is ineffiient due to very lrge solution spe, espeilly when E(Q) is lrge. Therefore, we propose simple yet effiient greedy solution to find good join order. There re two importnt heuristi rules in our join order seletion. ) Given sttus Q i, if there is bkwrd edge e tthed to Q i, the next sttus is Q i+ = Q i e, i.e., we perform bk edge proessing s erly s possible. If there re more thn one bkwrd edges tthed to Q i, we perform ll bk edge proessing simultneously, whih will redue the I/O ost. The intuition behind this heuristi rule is similr to seletion push-down in reltionl query optimiztion. Performing bk edge query will redue the rdinlity of intermedite join results. ) Given sttus Q i, if there is no bkwrd edge tthed to Q i, the next sttus is Q i+ = Q i e, where e is forwrd R. I edge nd Cost(e) (defined in Eqution 8) is minimum of ll forwrd edges. 7. EXPERIMENTS We evlute our methods using both syntheti nd rel dt sets. All of the methods hve been implemented using stndrd C++. The experiments re onduted on P.GHz mhine with G RAM running Windows XP. Syntheti Dtsets ) Erdos Renyi Model: This is lssil rndom grph model. It defines rndom grph s N verties onneted by M edges, hosen rndomly from the N(N )/ possible edges. We set N = K nd M = K. This grph is onneted, nd it is denoted s ER Network. b) Sle-Free Model: We use the grph genertor gengrphwin (www.s.sunysb.edu/ lgorith/implement/viger/distrib/). We generte lrge grph G with K verties stisfying powerlw distribution. Defult vlue of prmeter α is set to.. There re 8998 verties nd 6 edges in the mximl onneted omponent of G. We n sequentilly perform our method in eh onneted omponent of G. This dtset is denoted SF Network. In the bove two dtsetes, the edge weights in G stisfy rndom distribution between [, ]. Vertex lbels re rndomly ssigned between [, ]. Rel Dtsets ) Citeseer: We generte o-uthor network G from iteseer dtset (http://s.ist.psu.edu/publi/oi/). We generte o-uthor network G s follows: We tret eh uthor s vertex u in G nd introdue n edge to onnet two verties if nd only if there is t lest one pper o-uthored by the two orresponding uthors. We ssign vertex lbels nd edge weights s follows: ording to text lustering lgorithms, we group ll uthor ffilitions into lusters. For eh uthor, we ssign the luster ID s its vertex lbel. For n edge e = (u, u ) in G, its weight is o(u,u ) ssigned s, where o(u, u) denotes the number of o-uthored ppers between uthors u nd u. There re 879 verties nd 9 edges in the generted G. There re 78 verties nd 9 edges in the mximl onneted omponent of G. d) Yest. This is protein-to-protein intertion network in budding yest (http://vldo.fmf.uni-lj.si/pub/networks/dt/). Eh vertex denotes protein nd n edge denotes the intertion between two orresponding proteins. We delete self-loop edges in the originl dtset. There re types of protein lusters in this dtset. Therefore, we ssign vertex lbels bsed on the orresponding protein lusters. The edge weights re ll set to. There re 6 verties nd 666 edges in G. There re verties nd 668 edges in the mximl onneted omponent of G. Exp. We first evlute the performne of LLR embedding tehnique. In this experiment, we onsider D-join lgorithm to nswer edge query. For lustering, we use the k- medoids lgorithm. The vlue of the luster number depends on the vilble memory size for join proessing. We hoose two lterntive methods for performne omprison: the extension of R-join lgorithm [6] nd the D-join without embedding. In D-join without embedding method, we ondut distne-bsed joins diretly over the grph, rther thn first performing join proessing over onverted spe nd verifying ndidte mthes. We use luster-bsed blok nested loop join nd tringle pruning, but no hsh join pruning. We report query response time in Figure, whih shows