Incremental Maintenance of XML Structural Indexes

Similar documents
Chapter 7. Kleene s Theorem. 7.1 Kleene s Theorem. The following theorem is the most important and fundamental result in the theory of FA s:

The Area of a Triangle

Andersen s Algorithm. CS 701 Final Exam (Reminder) Friday, December 12, 4:00 6:00 P.M., 1289 Computer Science.

Previously. Extensions to backstepping controller designs. Tracking using backstepping Suppose we consider the general system

Data Structures. Element Uniqueness Problem. Hash Tables. Example. Hash Tables. Dana Shapira. 19 x 1. ) h(x 4. ) h(x 2. ) h(x 3. h(x 1. x 4. x 2.

Topic II.1: Frequent Subgraph Mining

Week 8. Topic 2 Properties of Logarithms

Module 4: Moral Hazard - Linear Contracts

Mathematical Reflections, Issue 5, INEQUALITIES ON RATIOS OF RADII OF TANGENT CIRCLES. Y.N. Aliyev

Optimization. x = 22 corresponds to local maximum by second derivative test

Illustrating the space-time coordinates of the events associated with the apparent and the actual position of a light source

10.3 The Quadratic Formula

Swinburne Research Bank

Validating XML Documents in the Streaming Model with External Memory

General Physics II. number of field lines/area. for whole surface: for continuous surface is a whole surface

Lecture 10. Solution of Nonlinear Equations - II

Radial geodesics in Schwarzschild spacetime

CHAPTER 7 Applications of Integration

This immediately suggests an inverse-square law for a "piece" of current along the line.

SPA7010U/SPA7010P: THE GALAXY. Solutions for Coursework 1. Questions distributed on: 25 January 2018.

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

Data Compression LZ77. Jens Müller Universität Stuttgart

An Analysis of the LRE-Algorithm using Sojourn Times

10 Statistical Distributions Solutions

On the Eötvös effect

Answers to test yourself questions

Topics for Review for Final Exam in Calculus 16A

Fluids & Bernoulli s Equation. Group Problems 9

Prerna Tower, Road No 2, Contractors Area, Bistupur, Jamshedpur , Tel (0657) ,

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

INTEGRATION. 1 Integrals of Complex Valued functions of a REAL variable

Math 4318 : Real Analysis II Mid-Term Exam 1 14 February 2013

Analysis of Variance for Multiple Factors

Deterministic simulation of a NFA with k symbol lookahead

NS-IBTS indices calculation procedure

CS 573 Automata Theory and Formal Languages

Important design issues and engineering applications of SDOF system Frequency response Functions

FI 2201 Electromagnetism

CHAPTER 18: ELECTRIC CHARGE AND ELECTRIC FIELD

r r E x w, y w, z w, (1) Where c is the speed of light in vacuum.

9.4 The response of equilibrium to temperature (continued)

18.06 Problem Set 4 Due Wednesday, Oct. 11, 2006 at 4:00 p.m. in 2-106

Michael Rotkowitz 1,2

Algebra Based Physics. Gravitational Force. PSI Honors universal gravitation presentation Update Fall 2016.notebookNovember 10, 2016

A Bijective Approach to the Permutational Power of a Priority Queue

EECE 260 Electrical Circuits Prof. Mark Fowler

A Study of Some Integral Problems Using Maple

Equations from the Millennium Theory of Inertia and Gravity. Copyright 2004 Joseph A. Rybczyk

COMPARING MORE THAN TWO POPULATION MEANS: AN ANALYSIS OF VARIANCE

( ) D x ( s) if r s (3) ( ) (6) ( r) = d dr D x

Class Summary. be functions and f( D) , we define the composition of f with g, denoted g f by

6.5 Improper integrals

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Multiplying and Dividing Rational Expressions

Arrow s Impossibility Theorem

School of Electrical and Computer Engineering, Cornell University. ECE 303: Electromagnetic Fields and Waves. Fall 2007

The Formulas of Vector Calculus John Cullinan

Influence of the Magnetic Field in the Solar Interior on the Differential Rotation

Chapter Seven Notes N P U1C7

Probabilistic Retrieval

Photographing a time interval

Chapter 4. Sampling of Continuous-Time Signals

Mark Scheme (Results) January 2008

Tutorial Worksheet. 1. Find all solutions to the linear system by following the given steps. x + 2y + 3z = 2 2x + 3y + z = 4.

2-Way Finite Automata Radboud University, Nijmegen. Writer: Serena Rietbergen, s Supervisor: Herman Geuvers

Arrow s Impossibility Theorem

U>, and is negative. Electric Potential Energy

Math 32B Discussion Session Week 8 Notes February 28 and March 2, f(b) f(a) = f (t)dt (1)

Physics 217 Practice Final Exam: Solutions

MAT 403 NOTES 4. f + f =

Chapter Introduction to Partial Differential Equations

10 m, so the distance from the Sun to the Moon during a solar eclipse is. The mass of the Sun, Earth, and Moon are = =

Tests for Correlation on Bivariate Non-Normal Data

Fourier-Bessel Expansions with Arbitrary Radial Boundaries

More Properties of the Riemann Integral

ITI Introduction to Computing II

SAMPLE LABORATORY SESSION FOR JAVA MODULE B. Calculations for Sample Cross-Section 2

Electronic Supplementary Material

Language Processors F29LP2, Lecture 5

TOPIC: LINEAR ALGEBRA MATRICES

(a) A partition P of [a, b] is a finite subset of [a, b] containing a and b. If Q is another partition and P Q, then Q is a refinement of P.

Suppose you have a bank account that earns interest at rate r, and you have made an initial deposit of X 0

NON-DETERMINISTIC FSA

Finite State Automata and Determinisation

Chapter Direct Method of Interpolation More Examples Mechanical Engineering

Core 2 Logarithms and exponentials. Section 1: Introduction to logarithms

π,π is the angle FROM a! TO b

School of Electrical and Computer Engineering, Cornell University. ECE 303: Electromagnetic Fields and Waves. Fall 2007

Week 10: DTMC Applications Ranking Web Pages & Slotted ALOHA. Network Performance 10-1

1 Using Integration to Find Arc Lengths and Surface Areas

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Chapter 4 State-Space Planning

Summary: Binomial Expansion...! r. where

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

Exploration of the three-person duel

Green s Theorem. (2x e y ) da. (2x e y ) dx dy. x 2 xe y. (1 e y ) dy. y=1. = y e y. y=0. = 2 e

FEATURE-BASED CRYSTAL CONSTRUCTION IN COMPUTER-AIDED NANO- DESIGN

Exercise sheet 6: Solutions

Bisimulation, Games & Hennessy Milner logic

where the box contains a finite number of gates from the given collection. Examples of gates that are commonly used are the following: a b

Transcription:

Inementl Mintenne of XML Stutul Indexes Ke Yi Ho He Ion Stnoi Jun Yng Dept. Compute Siene Duke Univesity yike@s.duke.edu Dept. Compute Siene Duke Univesity hohe@s.duke.edu IBMT.J.Wtson Reseh Cente is@us.ibm.om Dept. Compute Siene Duke Univesity junyng@s.duke.edu ABSTRACT Inesing populity of XML in eent yes hs geneted muh inteest in quey poessing ove gph-stutued dt. To suppot effiient evlution of pth expessions, mny stutul indexes hve been poposed. The most popul ones e the 1-index, bsed on the notion of gph bisimility, nd the eently poposed A(k)-index, bsed on the notion of lol simility to povide tde-off between index size nd quey nsweing powe. Fo these indexes to be ptil, we need effetive nd effiient inementl mintenne lgoithms to keep them onsistent with the undelying dt. Howeve, existing updte lgoithms fo stutul indexes essentilly povide no guntees on the qulity of the index; the updted index is usully lge size thn neessy, degding the pefomne fo subsequent queies. In this ppe, we popose updte lgoithms fo the 1- index nd the A(k)-index with povble guntees on the esulting index qulity. Ou lgoithms lwys mintin miniml index, i.e., meging ny two index nodes would esult in n inoet index. Fo the 1-index, if the dt gph is yli, ou lgoithm futhe ensues tht the index is minimum, i.e., it hs the lest numbe of index nodes possible. Fo the A(k)-index, we show tht the miniml index ou lgoithm mintins is lso the unique minimum A(k)-index, fo both yli nd yli dt gphs. Finlly, though expeimentl evlution, we demonstte tht ou lgoithms bing signifint impovement ove pevious methods, in tems of both index size nd updte time. 1. INTRODUCTION Inesing populity of XML in eent yes hs gene- Pt of the wok ws done while the utho ws visiting IBM T. J. Wtson Reseh Cente. Suppoted in pt by the Ntionl Siene Foundtion though CAREER gnt CCR 9984099 nd ITR gnt EIA 0112849. Suppoted by Ntionl Siene Foundtion CAREER Awd unde gnt IIS-0238386. Pemission to mke digitl o hd opies of ll o pt of this wok fo pesonl o lssoom use is gnted without fee povided tht opies e not mde o distibuted fo pofit o ommeil dvntge, nd tht opies be this notie nd the full ittion on the fist pge. To opy othewise, to epublish, to post on seves o to edistibute to lists, equies pio speifi pemission nd/o fee. SIGMOD 2004 June 13-18, 2004, Pis, Fne. Copyight 2004 ACM 1-58113-859-8/04/06... $5.00. ted muh inteest in quey poessing ove gph-stutued dt. A numbe of ommeil dtbse vendos e mking signifint effots to suppot XML ntively, the thn onvet it to the tditionl eltionl model. One of the mjo hllenges of this tsk is to povide suppot fo effiient quey poessing ove XML. To summize the stutue of suh dt nd to suppot pth expession [4] evlution, novel stutul indexes hve been poposed [11, 9, 17, 7]. Among the most popul ones e the 1-index [11], bsed on the notion of gph bisimility, nd the eently poposed A(k)-index [9], bsed on the notion of lol simility to povide tde-off between index size nd quey nsweing powe. Some stutul indexes hve lso been used s sttistil synopses fo estimting seletivities of pth expessions [3, 16]. Comped with tditionl eltionl indexes, muh less ttention hs been dieted to the poblem of mintining stutul indexes fo XML, with the exeption of eent wok in [8]. Afte n XML doument is updted, its stutul index must be popely mintined so tht subsequent queies hve view of the summized dt tht is onsistent with the updted doument. Fo stutul indexes to be ptil, we need effiient index mintenne lgoithms tht guntee the uy nd effiieny of these indexes fo queying. Thee e two bsi ppohes to index mintenne: eonstution nd inementl mintenne. Reonstution is simple nd usully leds to highqulity indexes, but the ovehed of eonstution mkes it untttive even fo dtbses with modete updte tes. The seond ppoh, inementl mintenne, updtes the existing index inementlly s soon s the undelying dtbse hnges. The ost of omputing nd pplying inementl index updtes n be potentilly muh lowe thn tht of eonstution. Designing good mintenne lgoithms is hllenging beuse of the delite blne between effiy nd effiieny. Effiy mens peseving the qulity of the stutul summy. Fo the sme undelying dt, thee e mny oet stutul summies, but they vy getly in size nd hene in quey pefomne. The lgoithm should ensue tht the updted index is not expnded unneessily. On the othe hnd, effiieny hee mens tht the lgoithm itself should be effiient. In some ses, obtining the smllest possible stutul summy is vey expensive, so settling on esonbly smll stutul summy would be moe pefeble. Finding the ight blne in this tdeoff is not tivil. Reonstution povides pefet effiy, but seveely lks effiieny. In ontst, lthough the pe-

viously poposed updte lgoithms in [9] e effiient, we show tht thei effiy is lking: the updted index usully hs muh lge size thn neessy, degding the pefomne fo subsequent quey evlutions. In this ppe, we demonstte tht it is possible to hieve high degees of both effiy nd effiieny in designing inementl mintenne lgoithms fo stutul indexes. We fous on thee types of updtes: edge insetion, edge deletion, nd subgph ddition. Edge insetion nd deletion onstitute the bsi opetions upon whih othe kinds of updtes (e.g., node insetion nd deletion) n be bsed. Although subgph ddition n lso be poessed by inseting edges one t time, we povide septe, moe effiient lgoithm fo it sine it is suh ommon opetion. We estit ouselves to the 1-index nd the A(k)-index, but we believe ou tehniques n lso be used fo othe stutul indexes bsed on node ptitioning. We develop effiient updte lgoithms fo the 1-index nd the A(k)-index tht, in ontst to pevious lgoithms, povide povble guntees on the esulting index qulity. Moe peisely, we mke the following ontibutions: 1. Ou lgoithms lwys mintin miniml index, i.e., meging ny two index nodes would esult in n inoet index. 2. If the dt gph is yli, we show tht thee is unique miniml 1-index tht is lso minimum, i.e., it hs the lest numbe of index nodes possible. This esult futhe ensues tht ou lgoithm lwys mintins the minimum 1-index fo yli dt gphs. 3. Fo yli dt gphs, whee thee might be moe thn one miniml 1-indexes, we show by expeiments tht the miniml 1-index mintined by ou lgoithm is lwys vey lose to the minimum, if not the sme. 4. Fo ny dt gph (yli o yli), we show tht thee is unique miniml A(k)-index tht is lso minimum, whih ensues tht ou lgoithm lwys mintins the minimum A(k)-index fo ny dt gph. 5. Though n extensive expeimentl study, we demonstte tht ou lgoithms e not only effetive in peseving index qulity, but lso vey effiient in tems of omputtion ost. The est of the ppe is ognized s follows. We fist suvey pevious wok in Setion 2. In Setion 3, we pesent the dt model of XML nd its stutul indexes, s well s the bsi onepts elted to the theoy nd lgoithms. We give genel oveview of ou lgoithms in Setion 4, followed by the detiled updte lgoithms fo the 1-index nd A(k)-index in Setion 5 nd 6, espetively. We study thei ptil pefomne expeimentlly in Setion 7. Finlly, we onlude in Setion 8. 2. PREVIOUS WORK Quey optimiztion fo XML hs been popul subjet of study [10, 5]. Indexing is essentilly used to void exhustive tvesl of the douments fo quey poessing. Signtue-bsed tehniques hve the sme gol of eduing the seh spe. They hve been used extensively in infomtion etievl nd hve lso been dpted fo XML dt eently in [14, 15]. With this ppoh, eh node of the XML tee is nnotted with the bitwise OR of the hsh vlues of its hild nodes. Existene of tg in the subtee of node n theefoe be estimted by omping the hshed vlue of the hild tg with the signtue of the node. Updtes my howeve led to eomputtion of signtues of ll nestos. Stutul summies fo XML hve been used fo indexing, quey puning nd ewiting, nd seletivity estimtion. DtGuides [6] ws one of the fist stutul summies used in XML quey poessing. The notion of simultion, moe ommonly used in gph theoy, ws pplied in [5] to shem vlidtion s well s quey puning nd ewiting fo semistutued dt. A numbe of stutul indexes bsed on simultion followed. The 1-index [11] ptitions dt nodes into equivlene lsses bsed on bisimility. To edue the size of the 1-index, the A(k)-index ws poposed [9]. It uses lol simility fo ptitioning, theeby ompessing the 1-index t the ost of losing some stutul infomtion bout the undelying dt. Vey eently, othe tehniques [17, 7] hve been poposed to futhe impove the flexibility nd effiieny of the A(k)-index. Thee impotnt issues need to be onsideed fo ny index: onstution, quey evlution, nd mintenne. Pige nd Tjn [12] gve n itetive splitting lgoithm to onstut 1-index in O(m log n) time, whee m is the numbe of edges nd n is the numbe of nodes in the dt gph. In [9], n lgoithm bsed on simil ides is used to onstut n A(k)-index in time O(km). Mny diffeent quey evlution sttegies use these stutul indexes; see [11, 9] fo detils. In this ppe, we only fous on the mintenne issue. The only known updte lgoithm fo the 1-index is the popgte lgoithm fom [8], whih uses Pige nd Tjn s onstution lgoithm [12] to hndle edge hnges. This lgoithm essentilly povides no guntee on the qulity of the esulting index. In the expeiments of [8], the index ws shown to hve 3% 5% moe nodes thn the minimum index fte eltively smll numbe of edge insetions (500 in dt gph with bout 200,000 nodes); no pefomne esults wee epoted fo deletions. A subgph ddition lgoithm bsed on eonstution ws lso given in [8]. Intuitively, beuse of its lolity, the A(k)-index should be esie to mintin thn the 1-index. Howeve, no good updte lgoithm fo the A(k)-index hs been poposed so f, exept fo some simple lgoithms mentioned in [8, 17]. These ppohes ll suffe fom the sme poblem of geneting too mny unneessy nodes, whih undemines the omptness dvntge of the A(k)-index. Designing effiient inementl mintenne lgoithms fo the A(k)- index ws left s n inteesting e fo futue eseh in [8]. 3. PRELIMINARIES Dt model. In this ppe, we model XML o othe semistutued dt s dieted, lbeled gph G =(V,E, oot, Σ, lbel, oid, vlue). Eh edge in E indites n objet-subobjet o IDREF eltionship. Eh node in V is lbeled with sting fom Σ vi the lbel funtion nd with unique identifie vi the oid funtion. It my lso optionlly hve vlue given by the vlue funtion. Thee is single oot node with the distinguished lbel ROOT with no inoming edges. An exmple XML doument unde this model is shown in Figue 1, whee objet-subobjet eltions e shown in solid lines, nd IDREF eltions e shown in dshed lines. A dtbse with multiple XML douments

0 1 oot 2 3 4 egions people utions 5 6 fi si 7 8 9 peson peson peson 10 11 ution ution 12 13 14 item15 16 17 18 19 20 item item item selle bidde bidde selle item site Figue 1: An XML dtbse exmple. n be modeled s single dt gph with n tifiil oot onneting gphs oesponding to the individul files. We efe to the nodes nd edges in V nd E s dt nodes nd dt edges, o dnodes nd dedges, espetively, to diffeentite fom those in n index gph, to be intodued below. We will use u,v,... to denote dnodes, nd Su(u) todenote the set of u s suessos, i.e., Su(u) ={v (u, v) E}. Stutul indexes. A stutul index (o stutue summy) fo dt gph tkes the fom of nothe lbeled dieted gph (V I,E I), whih is built by the following genel poedue: (1) ptition the dnodes into lsses oding to some equivlene eltion, (2) mke n index node (o inode) fo eh equivlene lss, with ll dnodes in this lss being its extent, nd (3) dd n index edge (o iedge) fom inode I to inode J if thee is dedge fom some dnode in the extent of I to some dnode in the extent of J. WeuseΦ(G) to denote stutul index built fo dt gph G, ndi[v] to denote the inode whose extent ontins dnode v. Fom now on, we will not distinguish between n inode nd its extent when thee is no onfusion. Sine stutul index is ompletely detemined by its ptition of the dnodes, we lso do not distinguish between n index nd its dnodes ptitions. We define Su(I) tobe S u ISu(u), the dnodes suessos of dnodes in I, ndisu(i) ={J (I,J) E I}, the index suessos of I. WeuseI,J,... to epesent sets of inodes, nd define Su(I) = S I I Su(I). Evlution of pth expessions n often be mde fste with stutul index Φ(G) by exeuting the pth expession R on Φ(G), whih is often muh smlle thn the oiginl dt gph G. The esults of R is ontined in the union of the extents of the inodes tht mth R, beuse ny stutul index tht is onstuted by the poedue bove is sfe. Howeve, not ll stutul indexes e peise, i.e., the esult of some quey R on Φ(G) myontin flse positives. Diffeent stutul indexes n be obtined by hoosing diffeent equivlene eltions in step (1) bove. The 1- index [11] uses bisimility [13] to ptition the dnodes. Fo ou pupose, we use the following equivlent definition fo the 1-index bsed on the notion of stbility [12]: Definition 1. An inode I is stble with espet to J if eithe I Su(J) oi Su(J) =. Fo dt gph G, n index Φ(G) isstble w..t. index Φ (G) if fo ny inode I Φ(G),I Φ (G), I is stble w..t. I. Definition 2. A stutul index Φ(G) is lled 1-index if (1) ll dnodes in ny inode of Φ(G) hve the sme lbel, nd (2) it is stble with espet to itself. A minimum 1-index is the 1-index with the minimum numbe of inodes. Note tht if I is not stble w..t. J, we n mke it stble by splitting I into I Su(J) ndi I Su(J). This is the bsi opetion fo ensuing the oetness of the index in the onstution lgoithm of [12] nd ou lgoithms. Thee my be moe thn one 1-index fo given dt gph, ll of whih n be used in the sme wy fo quey evlution. Of ouse they diffe in pefomne: the smlle the index, the bette the pefomne. The best one is the minimum 1-index, while the wost is the dt gph itself (lso vlid 1-index) whee we do not gin nything fom using it. In [12], the following esult gives the eltionship between the minimum 1-index nd othe 1-indexes. Definition 3. Fo dt gph G, stutul index Φ(G) is efinement of nothe index Φ (G) if fo ny inode I Φ(G), thee exists n inode I Φ (G) suh tht I I. Lemm 1. Thee is unique minimum 1-index fo ny given dt gph, nd ny othe 1-index is efinement of the minimum 1-index. Howeve, even the minimum 1-index n sometimes hve too mny inodes, espeilly fo highly iegul dt gphs, esulting in poo quey pefomne. To llevite the poblem, the A(k)-index [9] ws poposed to shink the index size by using k-bisimility to ptition dnodes. We use the following equivlent definition fo the A(k)-indexes. Definition 4. Given ny dt gph G, the A(0)-index is the stutul index obtined by simply ptitioning the dnodes of G by thei lbels. Fo 1 i k, stutul index Φ(G) is lled n A(i)-index if thee exists n A(i 1)- index Φ (G) suh tht Φ(G) is efinement of Φ (G) ndit is stble with espet to Φ (G). A minimum A(k)-index is the A(k)-index with the minimum numbe of inodes. Note tht the A(k)-index is not peise ny moe, beuse it only peseves pths of length up to k. Fo pth expessions longe thn k, it my genete flse positives nd we need vlidtion step on the oiginl dt gph to eliminte them. Nevetheless, in [9], it ws shown by expeiments tht even with this ext vlidtion step, the totl evlution ost is muh less thn tht of 1-index, due to the smll sizes of the A(k)-indexes, fo typil vlues of k =2,...,5. A esult pllel to Lemm 1 holds fo the A(k)-index [9]: Lemm 2. Fo ny given dt gph G, thee is unique minimum A(k)-index. Any othe A(k)-index is efinement of the minimum A(k)-index. Qulity of indexes. When thee e updtes to the dt gph G, it is sometimes diffiult nd ostly to mintin the minimum index, but s disussed befoe, thee e mny oet indexes nd ny of them n be used fo quey poessing in the sme wy s the minimum index. Howeve,

they nge fom the minimum index to the dt gph itself, hene diffe getly in pefomne [9, 17]. Thus, we would like to keep the index size s smll s possible when doing mintenne. To mesue the effetiveness of ou updte lgoithms, we define the qulity of the index to be # inodes in the index # inodes in the minimum index 1, whih we would like to keep s lose to zeo s possible. Note tht this is the sme meti used by [8] to mesue the qulity of the index fte sequene of updtes. 4. ALGORITHMS OVERVIEW The bsi ide behind ou new updte lgoithms is to itetively mke lol impovements fte oetness is fist ensued. All ou lgoithms onsist of split phse nd mege phse. Theefoe we will sometimes genelly efe to them s split/mege lgoithms. The split phse uses ides fom the index onstution lgoithms to fist mke the index oet by splitting some inodes, while the mege phse ties to mege neby inodes togethe without violting ny onstint, one pi t time, until no moe meges n be mde. Both split nd mege phses e ied out in n itetive nd lol mnne: we stt fom the newly inseted (o deleted) edge, nd poeed step by step. In eh step, we ty to split (o mege) the hilden of some new inode geneted fom pevious splits (o meges). The nie popety of ou lgoithms is tht, lthough these opetions e ied out in lol mnne, eh inode in the esulted index nnot be meged with ny othe inode without violting the stbility onstint. We sy tht suh n index is miniml. The peise definitions of miniml indexes tke slightly diffeent foms fo the 1-index nd the A(k)-index, so we will defe them to the espetive setions. Fom Lemms 1 nd 2, we know tht the minimum index is unique. Howeve, thee might be moe thn one miniml indexes fo given dt gph. Nonetheless, in mny ses we n pove tht thee is unique miniml index, i.e., fo yli 1-indexes nd genel A(k)-indexes. In these ses, ou lgoithms n futhe guntee tht the minimum index is lwys mintined. 5. UPDATES FOR THE 1-INDEX 5.1 Edge Insetion nd Deletion The lgoithms. We fist use unning exmple to demonstte how ou lgoithm updtes the 1-index when dedge is inseted into the dt gph. See Figue 2. The dt gph is shown in (), whee we use lettes to epesent lbels nd numbes to epesent dnodes. The new dedge to be inseted is shown with dshed line. The 1-index befoe the updte is shown in (b), whee the inodes extents e shown in bkets. The split phse fist heks if thee is n iedge between the two inodes ontining the soue nd sink of the new dedge. In this se thee is not, so we split the inode {3, 4} in (b) into n inode tht ontins dnode 4 nd one tht ontins the est of dnodes (Figue 2()). Then, this split tigges the split of inode {6, 7} beuse it now beomes unstble with espet to the two new inodes esulted fom the pevious split (Figue 2(d)). Now, evey inode is stble with espet to evey inode nd the split phse ends. The mege phse stts by looking fo n inode mong the siblings of {4}, the inode ontining the sink of the new dedge, to see if thee is n inode tht hs the sme lbel nd the sme set of index pents. We find inode {5} in this se ndthenmegeinodes{4}nd {5} togethe (Figue 2(e)). Next, we itetively onside the possible meges mong the hilden of newly geneted inodes fom pevious meges. In this exmple, we will mege inodes {7} nd {8} togethe. The finl esult of the updte is shown in Figue 2(f). Moe fomlly, ou lgoithm fist heks if the new edge (u, v) mkesvnot bisimil with the est of the dnodes in I[v]. If yes, we split I[v] into one inode ontining v itself nd one tht ontins the est of the dnodes. A ompound blok is set of inodes tht e the new inodes esulted fom pevious split. The split phse bsilly uses the Pige- Tjn s 1-index onstution lgoithm to itetively split inodes until we get stble ptition with espet to itself (hene oet). We stt with only one ompound blok onsisting of the two new inodes. In eh of the split steps, we tke out P ompound blok I, pik n inode I Isuh tht I 1 2 J I J, nd mke ll othe inode stble with espet to Su(I) ndsu(i {I}). This in tun my split othe inodes nd genete new ompound bloks, whih e dded to the queue of ompound bloks. The split phse ends when queue is empty. The mege phse stts fom I[v]ndtiestomegeinodes togethe itetively until no moe meges n be mde. We fist look fo n inode with the sme lbel nd index pents s I[v]. If one exists, we mege it with I[v], nd put the newly meged inode into queue of meged inodes. In eh of the following mege steps, we tke out one meged inode I fom the queue, nd onside the possible meges mong the index suessos of I. We lso dd newly meged inodes into the queue. The mege phse ends when the queue is empty. Ou omplete 1-index edge insetion lgoithm is desibed in Figue 3. The edge deletion lgoithm diffes only slightly. Fo simpliity of pesenttion, we ssume tht thee is no self-yles in the 1-index (i.e., n inode tht points to itself), whih is tue fo vitully ll XML dtbses. Ou lgoithms n be modified to tke e of self-yles s well, only tht some detils get little messy. Note tht in the lgoithm desiption we only speify how the ptition of dnodes gets updted; we do not bothe to stte expliitly how iedges e hndled, beuse they e ompletely detemined fom the ptition by the definition of stutul indexes (Setion 3). These iedges n lso be esily mintined s we updte the inode extents, using tehniques simil to those in [8]. Effiy. Now we give the foml definition fo miniml 1- indexes. Definition 5. Fo dt gph G, 1-index Φ(G) isminiml if fo ny two inodes I,J Φ(G), eithe (1) they hve diffeent lbels, o (2) thee exists n inode K Φ(G) suh tht I J is not stble with espet to K. Fo exmple, fo the dt gph in Figue 2() fte the dedge insetion, the index in 2(f) is miniml 1-index (nd minimum t the sme time), the ones in 2(d) nd (e) e not miniml, nd the one in 2() is not even vlid 1-index. Note tht 1-index is miniml if nd only if it hs no two inodes tht hve the sme lbel nd the sme set of index

0 1 2 b 3 6 4 5 d d d 7 8 {3,4} {5} d d {6,7} {8} {3} {4} {5} d d {6,7} {8} () Dtgph (b) old 1-index () split phse begins {3} {4} {5} d d d {6} {7} {8} (d) split phse ends {3} {4,5} {3} {4,5} d d d d d {6} {7} {8} {6} {7,8} (e) mege phse begins (f) mege phse ends Figue 2: An exmple of updting the 1-index fte dedge insetion. poedue inset 1 index edge(u, v) begin dd dedge fom u to v; if thee is n iedge fom I[u] toi[v]then etun; /* eple the 2 lines bove with the following fo deletions: delete the dedge fom (u, v); if thee exist u I[u],v I[v] nd thee is dedge fom u to v then etun; */ // split phse Q = ; if I[v] > 1 then split I[v] intoi 1={v}nd I 2 = I {v}; Q={{I 1,I 2}}; while Q do pik ny I Q, emove it fom Q; P J I J ; pik I Is.t. I 1 2 if I 3 then inset I {I}into Q; foeh inode K ISu(I) do split K into K 1 = K Su(I) ndk 2=K K 1; split K 1 into K 11 = K 1 Su(I {I})nd K 12 = K 1 K 11; let K = {K 11,K 12,K 2} { }; if K 2 then if J Q s.t. K J then eple K in J with the inodes in K; else dd K to Q; // mege phse Q = ; look fo n inode J with the sme lbel s v mong I[v] s siblings tht hve the sme set of index pents s I[v]; if suh n inode J exists then mege I[v] ndjinto K = J I[v]; Q = {K}; while Q do pik ny I Q, emoveifom Q; let I = ISu(I); ptition I into equivlent lsses oding to thei lbels nd index pents; foeh equivlent lss J Ido if J 2 then mege the inodes in J into J = S J ; Q = Q J; inset J into Q; end Figue 3: Inset n edge into 1-index 0 1 2 3 b 4 b () Dtgph {3,4} b (b) minimum 1-index {1,2} () miniml (but not minimum) 1-index {2} {3} b {4} b Figue 4: Miniml 1-indexes might not be unique. pents, whih follows dietly fom the definition of stbility. Miniml 1-indexes might not be unique. Fo exmple, the indexes in Figue 4(b) nd 4() e both miniml 1- indexes fo the dt gph in 4(), but only the one in 4(b) is minimum. Lemm 3. If the 1-index befoe the updte is miniml, the new index geneted by the split/mege lgoithm is lso miniml 1-index. Poof. Let (u, v) be the dedge just inseted (o deleted). The lgoithm fist heks if this edge updte hnges ny index pedeesso-suesso eltions. If no, it simply etuns. The esulted index is still miniml 1-index simply beuse the index befoe the updte is miniml 1-index. Assume now the updte indeed uses some hnges to the index. Let us ll the dt gph befoe the updte G 0, nd the one fte the updte G 2. Imgine we elbel v of G 2 with new lbel tht is diffeent fom ll othes, nd ll this dt gph G 1. We ll the 1-index befoe the updte Φ 0(G 0), the one fte the split phse but befoe the mege phse Φ 1(G 1) (with elbeled v), nd the finl 1-index Φ 2(G 2). We will show tht if Φ 0(G 0) is miniml 1-index, then Φ 1(G 1)ndΦ 2(G 2) e both miniml 1-indexes, too. If v is in n inode by itself in Φ 0(G 0), the split phse does nothing. In this se, the only inode in Φ 1(G 1)thtmy hve diffeent set of index pents thn in Φ 0(G 0)isI[v], whih by definition, hs distinguished lbel in Φ 1(G 1), theefoe it nnot be meged with ny othe inode. So Φ 1(G 1) is miniml in this se. Suppose othewise tht v shes n inode with some othe dnodes in Φ 0(G 0). Afte insetion, v hs diffeent set of index pents thn these othe dnodes, nd is then singled out by the split phse, whih ftewds popgtes the split using the Pige-Tjn s lgoithm. Tht Φ 1(G 1) is indeed oet 1-index follows fom the oetness of the Pige-

Tjn s lgoithm, whih lwys etuns the osest selfstble efinement of the stting ptition. To see it is lso miniml, we define the index pents of dnode w to be the set of inodes, eh of whih ontins t lest one of w s pents, i.e., the set {I[w ] w Su(w )}, nd we will show tht the split phse mintins the invint tht no two dnodes in diffeent inodes hve the sme lbel nd the sme index pents. Note tht if the index is vlid 1- index, the index pents of ny dnode w e the sme s the index pents of I[w], so this invint is tue in 1-index if nd only if this 1-index is miniml. The invint is tue befoe the split phse beuse Φ 0(G 0) is miniml 1-index. It is still mintined when we elbel v nd single it out s septe inode, beuse v s lbel is now diffeent fom ny othes. In eh of the following split steps, wheneve we split n inode into two, the newly geneted two inodes must hve t lest one diffeent index pent othewise they will not get split. Sine we ledy know Φ 1(G 1)is 1-index, it is lso miniml beuse the invint holds. Next, we need to show tht Φ 2(G 2) is miniml 1-index, whih is the esult of lbeling v bk to its oiginl lbel nd pplying the mege phse on Φ 1(G 1). It is esy to see tht it is 1-index beuse the mege phse only meges inodes tht hve the sme lbel nd index pents. Sine Φ 1(G 1) is miniml, nd the only diffeene between G 1 nd G 2 is v s lbel, the only possible two inodes tht my hve the sme lbel nd the sme index pents in Φ 1(G 2)e I[v] nd some othe inode. The mege phse extly stts by looking fo this only possible mege. Futhe notie tht fte two inodes e meged, it n only tigge new possible meges mong the inode suessos of the newly meged inode beuse the index pents of ll othe inodes emin unhnged. Theefoe, when the mege phse ompletes, no two inodes in Φ 2(G 2) n be meged, so Φ 2(G 2) is miniml. Keeping the 1-index miniml is pobbly the best one n do with esonble ost, sine it is muh hepe to hek if the 1-index is miniml, s ou lgoithms do, thn to detemine if it is minimum. Fo exmple, in ode to find out the 1-index in Figue 4() is not minimum, we need to be ble to detet two meges simultneously, nd the numbe of suh simultneous meges might be s high s Θ(n). In ptie, it is often good enough to keep the 1-index miniml, nd in mny ses, the miniml 1-index indeed tuns out to be the minimum 1-index. Even if we e unluky to get stuk in miniml 1-index tht is not minimum, ou expeiments show tht the diffeene between the two is often vey smll. Mny dt gphs e yli. Fo exmple, in bibliogphy dtbse, if we wnt to model the efeene eltions with IDREF edges, it is n yli gph s ppe n only efeene ppes tht ppe elie in time. Mny othe XML dtbses tht model hiehil eltions e ntully yli, o even tees. Fo suh dtbses, ou lgoithms n povide n even stonge guntee tht the minimum 1-index is lwys mintined, beuse the miniml 1-index is unique in this se. Lemm 4. Fo ny yli dt gph G, thee is unique miniml 1-index Φ(G), whih is lso minimum. Poof. Fist we showthtny1-indexofgis lso yli. Suppose thee ws yle in the 1-index. By definition, fo ny iedge I J, ny dnode in J hs t lest one pent in I. By following iedges bkwds in yle, we know thee exists pth of bity length in G, whih ould only hppen if G is yli, too. Suppose tht Φ(G) is the minimum 1-index nd Φ (G) is miniml 1-index diffeent fom Φ(G). We ode the inodes in Φ(G) topologilly nd pik the fist inode I tht does not ppe in Φ (G). By Lemm 1, Φ (G) is efinement of Φ(G), so thee exists t lest two inodes I 1,I 2 Φ (G)suh tht I 1 I,I 2 I nd I 1 nd I 2 hve the sme lbel. Fo ny index pent J of I, J lso ppes in Φ (G) beuse J is befoe I in the topologil ode. Then J is lso n index pent of I 1 nd I 2 in Φ (G) beuse eh dnode in I hs t lest one pent in J. Fo ny inode J tht is not n index pent of I, J nnot be n index pent of I 1 o I 2,eithe, beuse J does not ontin ny pent of ny dnode in I, nd both I 1 nd I 2 e subsets of I. So I 1nd I 2 hve the sme set of index pents, whih e the sme s those of I. This ontdits with the ft tht Φ (G) is miniml. CombiningLemm3nd4,wehve: Theoem 1. Fo yli dt gphs, the split/mege lgoithm lwys mintins the minimum 1-index duing edge insetions nd deletions. Fo yli dt gphs it lwys mintins miniml 1-index. Effiieny. Theoem 1 gives theoetil guntee on the effiy of the split/mege lgoithm, but how ostly it is in tems omputtion ost? We ontinue to use Φ 0(G 0), Φ 1(G 2), Φ 2(G 2) to denote the index befoe the updte, between the split nd mege phse, nd fte the updte, espetively. It is esy to see tht the numbes of split nd mege opetions e Φ 1(G 2) Φ 0(G 0) nd Φ 1(G 2) Φ 2(G 2), espetively. The fist pt is essentilly the ost of the popgte lgoithm, while the seond pt is the minimum numbe of meges equied to shink the intemedite esult down to miniml. Unfotuntely, in the wost se, this intemedite index Φ 1(G 2) ould hve muh moe nodes thn the index befoe o fte the updte. See fo exmple Figue 5, whee the tingles epesent two subtees with the sme stutue. By bitily enlging these subtees, we n hve n intemedite index tht hs Ω(n) moenodes thn the old o the new index. This is lso poblem to the popgte lgoithm nd ws identified in [8]. Nevetheless, the wost-se exmple of Figue 5 is the ontived nd is e in ptie. As obseved by [8], s well s ou own expeiments with both el-life nd benhmk dt, the intemedite index on vege only hs 0.01% moe nodes, whih mens tht the updte lgoithm is elly inementl, opeting only on vey smll ftion of the whole index. Sine we hve n dditionl mege phse, the ost of the split/mege lgoithm is etinly highe thn the popgte lgoithm, but we feel the mege phse is lwys woth doing, not only beuse it gives us nie theoetil guntee on the qulity of the esulted index, but lso fo the following ptil onsidetions: (1) With the mege step, we n effetively keep the index size smll, leding to muh loweed eonstution fequeny. (2) The mege phse lwys mkes the index smlle, hene highe quey pefomne. Typilly we hve moe queies thn updtes, so the effot spent in impoving the qulity of the index vey likely n be pid bk by the svings fom subsequent quey evlutions.

0 1 2 b 3 4 5 {3} {4,5} 6 d 7 d {6,7} d poedue dd 1 index subgph(g ) begin build the 1-index Φ (G ) fo the new subgph G ; union Φ (G ) with the uent 1-index Φ(G); dd ll inoming dedges tht go into, the oot of G ; do mege phse of inset 1 index edge stting t I[]; foeh othe dedge (u, v) between G nd G do inset 1 index edge(u, v); end Figue 6: Add subgph in 1-index. Figue 5: se. () Dtgph () Intemedite 1-index (b) Old 1-index {3} {4} {5} {3,4} {5} {6} {7} d d {6,7} d (d) Finl 1-index Updte ost ould be high in the wost Finlly, s n implementtion note, when we split inodes using Su(I) (o Su(I {I})), we in ft n split ll inodes ontining t lest one dnode in Su(I) tthesme time by snning Su(I) one nd eting K Su(I) fo eh K. The sme tehnique is used in [12, 8]. 5.2 Subgph Addition We model subgph lso s lbeled, ooted dieted gph, whih n etinly be dded by inseting its dnodes nd thei inident dedges one by one using ou edge insetion lgoithm. But sine subgph ddition ous so fequently, we design moe effiient lgoithm tht pefoms the insetions in bthed mnne. The bsi ide is to build the 1-index fist fo the new subgph, nd then dd ll the edges between the new subgph nd the existing dt gph using the edge insetion lgoithm. Note tht the oot of the new subgph must be in n inode by itself in the 1-index of the subgph. As n optimiztion, we n inset ll the inoming edges to the oot of the subgph nd then pefom the mege phse just one. The lgoithm dd 1 index subgph is shown in Figue 6. The following oolly follows fom Theoem 1. Coolly 1. Algoithm dd 1 index subgph mintins the minimum 1-index fo yli dt gphs nd miniml 1-index fo yli dt gphs. Ntully one would like to delete subgphs effiiently s well. This is esy, too. Hve speil node with distinguished lbel DELETE, nd dd dedge fom this node to the oot of the subgph tht we wnt to delete. This new dedge will single out this subgph fom the est of index, nd then we n just delete it fom the index. 6. UPDATES FOR THE A(K)-INDEX The lgoithm. Ou ides nd tehniques fo updting the 1-index n be extended to hndle updtes fo the A(k)- index s well. As identified in [8], the A(k)-index is diffiult to mintin by itself beuse updting it equies infomtion ontined in n A(k 1)-index. Thus, the bsi ide in ou lgoithm is to mintin ll the A(0), A(1),..., A(k)-indexes togethe using ou 1-index updte lgoithms. When mintining the A(i)-index, we use the A(i 1)-index s efeene to mke split nd mege deisions. We will fist desibe the lgoithm, nd then disuss how to implement it in spe- nd time-effiient mnne. Note tht ll these A(i)-indexes n be esily while we we build the A(k)-index; in ft, the onstution lgoithm [9] builds ll the A(0), A(1),..., A(k)-indexes in ode. In the following, we only onside edge insetions nd deletions; subgph ddition n be done in vey simil wy s we did fo the 1-index. The A(k)-index updte lgoithm lso onsists of split phse to guntee oetness nd mege phse to ensue minimlity. Suppose the new edge to be inseted (o deleted) is (u, v). We fist look fo the lgest i suh tht the A(i)- index will not be ffeted by the edge updte. The split phse fist etes new inode ontining v itself fo eh of the A(i + 1) toa(k)-indexes. These initil splits genete numbe of ompound bloks (in the 1-index, we hve only one ompound blok t the beginning), nd we put them in queue. Aftewds, we itetively split othe inodes in wy vey simil to wht we did fo the 1-index. The only diffeene is tht, when we stbilize othe inodes with espet to ompound blok in the A(i)-index, we need to onside ll the inodes in the A(i +1) to A(k)-indexes. The mege phse lso poeeds similly s fo the 1-index. Fo eh of the ffeted A(i)-indexes, we fist ty to mege the inodes ontining v with nothe inode. Next, we itetively mege othe inodes togethe. In eh step, if I is new inode in the A(i)-index geneted fom pevious mege, we onside the possible meges mong the inodes in the A(i+1)-index tht ontins t lest one dnode with pent in I. The detiled edge insetion (nd deletion) lgoithm fo the A(k)-index is shown in Figue 7. We use Φ (i) (G) to denote the A(i)-index of dt gph G; I (i),j (i) e some inodes in the A(i)-index; I (i) [v] denotes the inode in the A(i)-index tht ontins dnode v. We lso use I (i), J (i) to denote sets of inodes in the A(i)-index.

poedue inset A(k) index edge(u, v) begin find the lgest i s.t. v Su(I (i) [u]), if no suh i exists, set i = 1; dd dedge fom u to v; /* eple the thee lines bove with the following fo deletions: delete the dedge fom u to v; find the lgest i s.t. v Su(I (i) [u]), if no suh i exists, set i = 1; */ // split phse begins Q = ; fo j = i +2 to k if I (j) [v] > 1 then split I (j) [v] intoi (j) 1 ={v}nd I (j) 2 = I (j) [v] {v}; if j k 1 then inset {I (j) 1,I(j) 2 } into Q; // itete to split othes while Q do pik ny I (j) Q with the smllest j; emove I (j) fom Q; pik I (j) I (j) s.t. I (j) 1 P 2 J (j) I (j) J(j) ; if I (j) 3then inset I (j) {I (j) }into Q; foeh inode K (l),j+1 l k do split K (l) into K (l) 1 = K (l) Su(I (j) ) nd K (l) 2 = K (l) K (l) 1 ; split K (l) 1 into K (l) 11 = K(l) 1 Su(I (j) {I (j) }) nd K (l) 12 = K(l) 1 K (l) 11 ; if l k 1 then let K (l) = {K (l) 11,K(l) 12,K(l) 2 } { }; if K (l) 2then if J (l) Q s.t. K (l) J (l) then eple K (l) in J (l) with the inodes in K (l) ; else dd K (l) to Q; // mege phse begins fo j = i +2 to k do Q = ; look fo inode I (j) I (j 1) [v] s.t.i (j) hs the sme index pents in the A(j 1)-index s I (j) [v]; if suh inode I (j) exists then mege I (j) [v] ndi (j) with J (j) = I (j) [v] I (j) ; if j k 1 then inset J (j) into Q; // itete to mege othes while Q do pik ny I (l) Q with the smllest l; emove I (l) fom Q; let I (l+1) = {I (l+1) (w) w Su(w ),w I (l) }; ptition I (l+1) into equivlent lsses oding to thei lbels nd index pents in the A(l)-index; foeh equivlent lss J (l+1) I (l+1) do if J (l+1) 2then mege the inodes in J (l+1) into J (l+1) = S J (l+1) ; if l k 2 then Q = Q J (l+1) ; inset J (l+1) into Q; end Figue 7: Inset n edge into A(k)-index. Effiy. Sine we e essentilly using ou 1-index updte lgoithm to mintin Φ (i) (G) with espet to Φ (i 1) (G) fo ll i =1,...,k, we n show tht ou lgoithm mintins miniml set of A(i)-indexes in the following sense. Definition 6. Fo ny dt gph G, theset of A(i)-indexes Φ (0) (G), Φ (1) (G),...,Φ (k) (G)eminiml if fo ll 1 i k, meging ny two inodes of Φ (i) (G) will mke it unstble with espet to Φ (i 1) (G). Lemm 5. The split/mege lgoithm lwys mintins miniml set of A(i)-indexes. Poof. Follow the sme lines of esoning in the poof of Lemm 3. Sine the set of A(i)-index is built in hiehil mnne, whih esembles the ntue of yli 1-indexes, we hve the following esult fo the A(k)-index fo ny genel dt gph. Lemm 6. Fo ny dt gph G, thee is unique minimlsetofa(i)-indexes, eh of whih is lso minimum. Poof. Let Φ (0) (G),...,Φ (k) (G) be the minimum A(i)- indexes of G, ndψ (0) (G),...,Ψ (k) (G) be ny miniml set of A(i)-indexes. We will show by indution tht Φ (i) (G) = Ψ (i) (G) fo ll i. The bse se Φ (0) (G) =Ψ (0) (G) holds by definition. Now suppose Φ (i) (G) =Ψ (i) (G), then meging ny two inodes in Ψ (i+1) (G) will mke it unstble with espet to Φ (i) (G). By Lemm 2, Ψ (i+1) (G) is lwys efinement of Φ (i+1) (G). If Φ (i+1) (G) Ψ (i+1) (G), we would find t lest two inodes in Ψ (i+1) (G) tht e ontined in the sme inode of Φ (i+1) (G), meging these two inodes would not use Ψ (i+1) (G) to be unstble with espet to Φ (i) (G). So we hve Φ (i+1) (G) =Ψ (i+1) (G). CombiningLemm5nd6,wehve: Theoem 2. Fo ny dt gph G, the split/mege lgoithm lwys mintins the minimum A(k)-index. Effiieny. As fist impession, mintining ll the A(0) to A(k)-indexes would tke lot of spe nd inese the updte ost. Below we desibe stutue lled the efinement tee, whih is designed to exploit the ft tht the A(i+1)-index is lwys efinement of the A(i)-index. With this tee ( foest in genel) we n mintin the A(i)-index on top of the A(i + 1)-index, insted of mnipulting mssive sets of dnodes dietly. The efinement tee inludes ll the nodes in the A(0) to A(k)-indexes. Tee edges e built by linking eh inode in the A(i)-index to the inodes in the A(i + 1)-index tht e ontined in this inode (Figue 8). With this tee stutue, thee is no longe ny need to stoe the extents of the inodes in ll the A(i)-indexes fo 0 i k 1, s they n be fully eoveed fom the extents of A(k)-index nodes. Let us now see how to use the efinement tee to implement the lgoithm inset A(k) index edge, o moe peisely, the two bsi opetions split nd mege. Meges e esy: If we mege two A(k)-index inodes, we mege thei extents s we did fo the 1-index. If we mege two A(i)-index inodes fo 1 i k 1, we simply mege them togethe

1 0 2 3 4 5 6 7 8 () Dtgph b {3,4,5,6,7,8} (b) A(0) {3} {4,5} {6,7,8} () A(1) {3} {4,5} {6} {7,8} (d) A(k=2) Figue 8: Refinement tee: tee edges e shown in dotted lines. without ny opetion on thei extents; ll A(i + 1)-index inodes tht wee hilden of the two old nodes in the efinement tee now beome the hilden of the new node. Splits need moe e. Thee e two kinds of splits: the initil splits t the beginning of the split phse nd the noml splits using Su(I (j) )osu(i (j) {I (j) }) (ef. Figue 7). All initil splits togethe ete one new inode ontining only v fo eh of the A(j)-indexes, j = i +2,...,k, so we just need to split I (k) [v] nd then ete new node on eh level j of the efinement tee, pointing only to the new tee node on level j +1. Fo noml splits using, sy Su(I (j) ), we sn though Su(I (j) ) nd split ll inodes whose extents inteset tht of Su(I (j) ) t the sme time. Fo eh dnode w Su(I (j) ), thee is extly one inode in the A(l)-index tht ontins w, fo l = j+1,...,k. These inodes, denoted by K (j+1),...,k (k), fom pth in the efinement tee. Fo the A(k)-index inode K (k), we y out the sme poedue s with the 1-index: ete n A(k)-index inode ˆK (k) fo w if neessy (it might hve been eted ledy while poessing n elie dnode tht is k-bisimil to w), nd then move w fom K (k) to ˆK (k). Fo l = k 1,...j + 1, we ete n A(l)-index inode ˆK (l) fo w if neessy, nd then mke ˆK (l+1) hild of ˆK (l) in the efinement tee. Afte ll dnodes in Su(I (j) ) e snned, we emove ny empty inodes fom the A(k)- index, nd then ny A(l)-index inodes with no hilden in the efinement tee, fo l = k, k 1,...,j+ 1. Afte ll pis e poessed, ll splits with espet to Su(I (j) ) e ompleted. The sme poedue pplies to Su(I (j) {I (j) }). Note tht in this wy we only del with the dnodes in the A(k)-index; mintenne of the A(i)-index only involves inodes in the A(i+1)-index, nd the ost of doing so deeses pidly s i gets smlle. Apt fom the efinement tee edges, thee e two types of iedges we need to mintin: the noml int-iedges inside the A(k)-index, used fo quey poessing, nd the inte-iedges in the efinement tee tht onnet inodes in A(i) to thei inode suessos in A(i + 1), whih e equied in ode fo the mintenne lgoithm to funtion effiiently. Both types of iedges n be mintined heply duing the split/mege poess. Optionlly, one ould lso mintin the int-iedges inside the A(i)-indexes fo i = 1,...,k 1, whih will speed up the evlution of pth expessions of length less thn k, but we will not exploe this option futhe in this ppe. Although we stoe moe infomtion thn the A(k)-index lone, the ext stoge ovehed is low. We stoe eh dnode only one (in the extent of n A(k)-index inode), nd we use only one hsh tble fo the evese mpping fom the dnodes to the A(k)-index inodes. Fo the A(i)-indexes whee i<k, we stoe only the efinement tee edges nd the inte-iedges. Sine the numbe of inodes in the A(i)-index deeses pidly s i gets smlle, this stoge ovehed is insignifint omped with the ost of stoing extents nd the dnode-to-inode mpping, whih must be pid by stnd-lone A(k)-index s well. 7. EXPERIMENTS In this setion, we pesent ou expeimentl study omping ou lgoithms with pevious methods. All lgoithms e implemented in Jv. The mhine used fo expeiments is Dell PoweEdge 2600 with 2.4GHz Xeon poesso nd 1GB of RAM, unning Linux with JDK 1.4.2. Ou mhine hs enough memoy to stoe eveything nd no pging is needed duing exeution. We use the sme pefomne metis s pevious woks [8, 17], i.e., we mesue effiy in tems of the qulity of the index s defined in Setion 3, nd effiieny in tems of the wll-lok unning time. We use both benhmk nd el-life XML dtbses in ou expeiments. The XMk dtbse is geneted by the XMk geneto fom the XML Benhmk Pojet [2]. It is highly yli nd iegul dtbse likely to stess the use of stutul indexes. It is 11.7MB in size nd onsists of 167,865 dnodes nd 198,612 dedges, mong whih 30,747 e IDREF edges. A smple of this dtbse is shown in Figue 1. Cyles in this dtbse e used by lge numbe of peson-ution edges. To see how ou lgoithms hndle dt gphs with yles, we intentionlly emove potion of those edges to vy the yliity, whih we define to be the ftion of suh edges emining. We nme these dt sets XMk() wheeis the yliity; e.g., XMk(1) is the oiginl XMk dtbse, nd XMk(0) ontins no peson-ution edges nd thus no yles, lthough they hve the sme numbe of dnodes. The el-life dtset is extted fom the Intenet Movie Dtbse (IMDB) [1] in the following wy: Fist we ndomly hoose smll subset of movies nd ll people (tos, dietos, et.) ssoited with these movies. We then extt ll othe movies ssoited with these people, nd ontinue this poess until the desied dtbse size is ehed. Fo eh movie o peson, we lso extt substntil mount of othe infomtion (e.g. title, ye, gene). This dtset onsists of 272,567 dnodes nd 285,221 dedges, mong whih 12,654 e IDREF edges. Ovell, it is lso highly yli nd iegul dtset. 7.1 Expeiments on the 1-Index Edge insetions nd deletions. Fo the 1-index edge insetions nd deletions, we ompe with the popgte lgoithm [8]. In this set of expeiments we pply mixed sequene of edge insetions nd deletions on both the XMk nd IMDB dt. Fo XMk, we selet fou dtsets with yliities 1, 0.5, 0.2 nd 0, to see how the lgoithms pefom, sine s suggested by Theoem 1, the pefomne might be ffeted by yles. In ode to genete edge insetions in meningful wy, we fist emove 2 of ll the IDREF edges fom the dt

index qulity 6 5 4 3 2 1 split/mege popgte popgte + eonstution 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # (insetion, deletion) pis Figue 9: 1-index qulity ove mixed edge insetions nd deletions on IMDB. gph. These deleted edges then beome pool of possible insetions. Using the esulting dt gph s the stting point, we pefom one edge insetion followed by one edge deletion in eh step: fist ndomly seleted edge is emoved fom the pool nd inseted into the dt gph, nd then nothe ndomly seleted edge is deleted fom the dt gph nd put bk into the pool. Fo eh dtset, 5000 pis of edge insetions nd deletions e pefomed. Sine the popgte lgoithm does not hve ny guntee on the qulity of the updted index, the index gets pogessively wose ove time nd it is neessy to eonstut the index peiodilly. We used the index eonstution ide of [8], i.e., un the onstution lgoithm on top of the index gph (teting it s dt gph), nd then blow up eh inode of the new index by epling eh inode of the old index with its extent of dnodes. Sine we do not know how big the uent minimum index is duing the ouse of sequene of updtes, we use the following simple heuisti to tigge index eonstutions: emembe the size of the index when it ws lst eonstuted, nd then pefom eonstution wheneve the uent index is moe thn 5% lge thn tht. Sine ou split/mege lgoithm does not guntee the minimum 1-index on yli dt gphs, eithe, we use the sme heuisti to tigge eonstution. Results fo IMDB e shown in Figue 9. Pefomne of popgte fo the fist 500 edge updtes gees with the peviously epoted esults [8] vey well: ound 5% inese in index size. In ft the esult epoted in [8] ws little bette thn this, whih n be explined by the ft tht [8] only did edge insetions, while edge deletions e little moe diffiult to hndle beuse the minimum 1-index itself usully shinks when edges e deleted. Afte tht, we see tht its index qulity ontinues to degde lmost linely with the numbe of edge updtes pefomed. Thus, eonstution is tiggeed one bout evey 500 updtes. On the othe hnd, ou split/mege lgoithm mintins the index qulity vey well, neve exeeding 3%. This expeiment shows tht the miniml 1-index mintined by ou lgoithm is in ft vey lose to minimum fo this dtset. Results fo XMk e shown in Figue 10. An inteesting ft is tht on these dtsets, ou split/mege lgoithm pefoms extemely well: its qulity uves vitully emin zeo (neve exeeding 0.5%). The eson is tht the IDREF edges in the XMk dtsets e geneted moe unifomly, while in IMDB they tend to be lusteed: elted pesons e likely to get involved in elted movies, eting shote yles tht mke ses simil to Figue 4 moe likely thn in XMk. Fo popgte, we see simil tends fo ll dtsets: its qulity uves lmost lwys gow linely, lthough the te vies lot fo diffeent yliities: on XMk(1), the index qulity is still bette thn 12% fte 10000 edge updtes, but on XMk(0) it gets wose vey quikly. The eson is tht XMk(1) is highly iegul dtset; even the size of its minimum 1-index is moe thn 4 of its dt gph size. Fo suh big index, thee e vey few possible meges duing updtes, so popgte lgoithm pefoms eltively well. Howeve, suh lge 1-indexes usully led to bd quey pefomne, nd we usully tun to othe smlle indexes, suh s A(k), fo these ses. As the yliity deeses, the dt gph lso gets moe egul, nd the minimum 1-index shinks. The popgte lgoithm then hs inesing diffiulty in keeping the index fit, nd hs to pefom moe fequent eonstutions. We lso mesued the vege unning times ove the 10000 edge updtes fo eh dtset. Fom Figue 11 we n see tht the split/mege lgoithm is moe ostly thn the popgte lgoithm, due to the ext mege phse, but it beomes muh fste if we fto in the motized eonstution ost (totl eonstution ost divided by 10000). Notie tht yliity does not seem to ffet the pefomne of the split/mege lgoithm, showing tht ses like Figue 5 e not ommon. Finlly, note the index is essentilly unusble duing the eonstution, while ou split/mege lgoithm lwys esponds quikly, theeby mking the index moe vilble fo queies. Subgph dditions. We lso ondut expeiments on subgph dditions with the XMk dt. We extt subgphs in the following mnne. Fist we ndomly selet n ution dnode u, nd then pefom tvesl down stting fom u to extt ll desendents of u, whihfom subgph. We do not tvese IDREF edges beuse we wnt to void yles, nd lso beuse the IDREF edges usully epesent inte-objet eltionships tht e not integl pts of the entity of inteest. In this wy we extt 500 subgphs, with n vege size of 50 dnodes. Fo eh dtset, we fist delete ll these subgphs, nd then inset them one by one. We ompe thee ltentives: (1) ou lgoithm of Setion 5.2, (2) sme lgoithm but using popgte insted of inset 1 index edge to inset the edges, nd (3) the index eonstution lgoithm of [8], whih lwys mintins the minimum 1-index but is extemely ostly. We obtin lmost the sme esults gin: Ou lgoithm keeps the qulity of 1-index t lmost ll the time, while the seond ltentive keeps inesing the index size nd is vey sensitive to the stutue of the dt gph (Figue 12). In tems of unning ost, the fist two ltentives e both vey fst, bout 20 mse fo eh subgph; the thid one is moe thn 100 times slowe beuse of the ostly eonstution. 7.2 Expeiments on the A(k)-Index Sine we hve theoetil guntee tht ou split/mege lgoithm lwys mintins the minimum A(k)-index, the expeiments on the A(k)-index e minly imed t effiieny issues, nmely the unning ost nd ddition stoge ovehed esulted fom mintining ll the A(i)-indexes fo 0 i k. In the expeiments, we vied k fom 2 to 5, oveing the nge of k s tht give the best pefomnes s

index qulity 12% 1 8% 6% 4% 2% split/mege popgte popgte + eonstution unning time pe updte (mse) 50 40 30 20 10 split/mege motized eonstution popgte index qulity 25% 2 15% 1 5% 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # (insetion, deletion) pis () XMk(1) split/mege popgte popgte + eonstution 0 XMk(1) XMk(0.5) XMk(0.2) XMk(0) IMDB Figue 11: Running times of 1-index lgoithms. index qulity 5 4 3 2 XMk(1) XMk(0.5) XMk(0.2) XMk(0) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # (insetion, deletion) pis 6 5 (b) XMk(0.5) split/mege popgte popgte + eonstution 1 0 50 100 150 200 250 300 350 400 450 500 # subgph dded Figue 12: 1-Index qulity duing sequene of subgph dditions with the popgte lgoithm. 4 index qulity index qulity 3 2 1 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # (insetion, deletion) pis 20 15 10 5 () XMk(0.2) split/mege popgte popgte + eonstution 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 # (insetion, deletion) pis (d) XMk(0) Figue 10: 1-index qulity ove mixed edge insetions nd deletions on XMk. epoted by [9]. The yliities of dtsets e not tweked hee beuse the pefomne of ou lgoithms e not ffeted by yles. We only pesent the expeiments on edge insetions nd deletions fo the A(k)-index in this ppe. We ompe with the following simple lgoithm, obtined by fixing mino mistke in the one mentioned t the end of [17]. Afte n dedge (u, v) is inseted o deleted, we do bedth-fist seh to find ll the potentilly ffeted dnodes in the dt gph. These dnodes e desendnts of v up to mximum depth of k 1. The oesponding inodes ontining these dnodes e possibly unstble nd need to be ptitioned into new inodes oding to k-bisimility. Sine the A(k)-index does not etin enough infomtion to ompute k-bisimility, we hve to go bk to the dt gph nd ompute by definition. Notie tht the ost of this simple lgoithm is exponentil in k. Sine this lgoithm does not povide ny guntee on the index qulity, we lso onside the option of peiodi index eonstutions in the expeiments, like wht we did with the 1-index. Fo the expeiments, we only pefom 1000 pis of insetions nd deletions sine it is ledy enough to see le tend. The simple lgoithm, s expeted, blows up the index size pidly without eonstutions, espeilly fo smll k s. The esult on the XMk dtbse e shown in Figue 13. The esult on IMDB is simil nd omitted. When the eonstution theshold is set to 5%, this simple lgoithm tigges fequent eonstutions, s shown in Tble 1. Running times of ou split/mege lgoithm nd this sim-