Tree Pattern Aggregation for Scalable XML Data Dissemination

Similar documents
Tree Pattern Aggregation for Scalable XML Data Dissemination

CS 491G Combinatorial Optimization Lecture Notes

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

Lecture 6: Coding theory

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

CSE 332. Sorting. Data Abstractions. CSE 332: Data Abstractions. QuickSort Cutoff 1. Where We Are 2. Bounding The MAXIMUM Problem 4

2.4 Theoretical Foundations

XML and Databases. Exam Preperation Discuss Answers to last year s exam. Sebastian Maneth NICTA and UNSW

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

CS 2204 DIGITAL LOGIC & STATE MACHINE DESIGN SPRING 2014

XML and Databases. Outline. 1. Top-Down Evaluation of Simple Paths. 1. Top-Down Evaluation of Simple Paths. 1. Top-Down Evaluation of Simple Paths

Now we must transform the original model so we can use the new parameters. = S max. Recruits

Necessary and sucient conditions for some two. Abstract. Further we show that the necessary conditions for the existence of an OD(44 s 1 s 2 )

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

18.06 Problem Set 4 Due Wednesday, Oct. 11, 2006 at 4:00 p.m. in 2-106

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

Lecture 2: Cayley Graphs

Outline Data Structures and Algorithms. Data compression. Data compression. Lossy vs. Lossless. Data Compression

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

Lecture 11 Binary Decision Diagrams (BDDs)

CSC2542 State-Space Planning

Section 2.1 Special Right Triangles

6.5 Improper integrals

Common intervals of genomes. Mathieu Raffinot CNRS LIAFA

The DOACROSS statement

Numbers and indices. 1.1 Fractions. GCSE C Example 1. Handy hint. Key point

Mid-Term Examination - Spring 2014 Mathematical Programming with Applications to Economics Total Score: 45; Time: 3 hours

Solving the Class Diagram Restructuring Transformation Case with FunnyQT

Chapter 4 State-Space Planning

Lecture 8: Abstract Algebra

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

CS 360 Exam 2 Fall 2014 Name

Solutions to Problem Set #1

CS261: A Second Course in Algorithms Lecture #5: Minimum-Cost Bipartite Matching

Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

Lecture 3. XML Into RDBMS. XML and Databases. Memory Representations. Memory Representations. Traversals and Pre/Post-Encoding. Memory Representations

Factorising FACTORISING.

Logic, Set Theory and Computability [M. Coppenbarger]

NON-DETERMINISTIC FSA

CIT 596 Theory of Computation 1. Graphs and Digraphs

Finite State Automata and Determinisation

A Disambiguation Algorithm for Finite Automata and Functional Transducers

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

Welcome. Balanced search trees. Balanced Search Trees. Inge Li Gørtz

Technology Mapping Method for Low Power Consumption and High Performance in General-Synchronous Framework

Laboratory for Foundations of Computer Science. An Unfolding Approach. University of Edinburgh. Model Checking. Javier Esparza

Surds and Indices. Surds and Indices. Curriculum Ready ACMNA: 233,

CS 573 Automata Theory and Formal Languages

where the box contains a finite number of gates from the given collection. Examples of gates that are commonly used are the following: a b

Subsequence Automata with Default Transitions

Durable Top-k Search in Document Archives

I 3 2 = I I 4 = 2A

A Lower Bound for the Length of a Partial Transversal in a Latin Square, Revised Version

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

If the numbering is a,b,c,d 1,2,3,4, then the matrix representation is as follows:

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

Arrow s Impossibility Theorem

COMPUTING THE QUARTET DISTANCE BETWEEN EVOLUTIONARY TREES OF BOUNDED DEGREE

INTRODUCTION TO AUTOMATA THEORY

Maximum size of a minimum watching system and the graphs achieving the bound

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER MACHINES AND THEIR LANGUAGES ANSWERS

POSITIVE IMPLICATIVE AND ASSOCIATIVE FILTERS OF LATTICE IMPLICATION ALGEBRAS

arxiv: v2 [math.co] 31 Oct 2016

Compression of Palindromes and Regularity.

CARLETON UNIVERSITY. 1.0 Problems and Most Solutions, Sect B, 2005

Geodesics on Regular Polyhedra with Endpoints at the Vertices

Lesson 2: The Pythagorean Theorem and Similar Triangles. A Brief Review of the Pythagorean Theorem.

Let s divide up the interval [ ab, ] into n subintervals with the same length, so we have

for all x in [a,b], then the area of the region bounded by the graphs of f and g and the vertical lines x = a and x = b is b [ ( ) ( )] A= f x g x dx

6. Suppose lim = constant> 0. Which of the following does not hold?

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Automata and Regular Languages

Lesson 2.1 Inductive Reasoning

Eigenvectors and Eigenvalues

Chapter 3. Vector Spaces. 3.1 Images and Image Arithmetic

Lesson 2.1 Inductive Reasoning

Particle Physics. Michaelmas Term 2011 Prof Mark Thomson. Handout 3 : Interaction by Particle Exchange and QED. Recap

A Primer on Continuous-time Economic Dynamics

Statistics in medicine

Part I: Study the theorem statement.

ANALYSIS AND MODELLING OF RAINFALL EVENTS

p-adic Egyptian Fractions

COMPUTING THE QUARTET DISTANCE BETWEEN EVOLUTIONARY TREES OF BOUNDED DEGREE

Analysis of Temporal Interactions with Link Streams and Stream Graphs

Minimal DFA. minimal DFA for L starting from any other

Nondeterministic Automata vs Deterministic Automata

Metaheuristics for the Asymmetric Hamiltonian Path Problem

Situation Calculus. Situation Calculus Building Blocks. Sheila McIlraith, CSC384, University of Toronto, Winter Situations Fluents Actions

Total score: /100 points

Model Reduction of Finite State Machines by Contraction

Monochromatic Plane Matchings in Bicolored Point Set

Section 2.3. Matrix Inverses

Mining Frequent Web Access Patterns with Partial Enumeration

Arrow s Impossibility Theorem

Convert the NFA into DFA

Bi-decomposition of large Boolean functions using blocking edge graphs

Computing the Quartet Distance between Evolutionary Trees in Time O(n log n)

Parse trees, ambiguity, and Chomsky normal form

= state, a = reading and q j

A Study on the Properties of Rational Triangles

Transcription:

Tree Pttern Aggregtion for Slle XML Dt Dissemintion Chee-Yong Chn, Wenfei Fn Λ, Psl Feler y, Minos Groflkis, Rjeev Rstogi Bell Ls, Luent Tehnologies fyhn,wenfei,minos,rstogig@reserh.ell-ls.om, Psl.Feler@eureom.fr Astrt With the rpi growth of XML-oument trffi on the Internet, slle ontent-se issemintion of XML ouments to lrge, ynmi group of onsumers hs eome n importnt reserh hllenge. To inite the type of ontent tht they re intereste in, t onsumers typilly speify their susriptions using some XML pttern speifition lnguge (e.g., XPth). Given the lrge volume of susriers, system slility n effiieny mnte the ility to ggregte the set of onsumer susriptions to smller set of ontent speifitions, so s to oth reue their storgespe requirements s well s spee up the oumentsusription mthing proess. In this pper, we provie the first systemti stuy of susription ggregtion where susriptions re speifie with tree ptterns (n importnt sulss of XPth expressions). The min hllenge is to ggregte n input set of tree ptterns into smller set of generlize tree ptterns suh tht: (1) given spe onstrint on the totl size of the susriptions is met, n (2) the loss in preision (ue to ggregtion) uring oument filtering is minimize. We propose n effiient tree-pttern ggregtion lgorithm tht mkes effetive use of oument-istriution sttistis in orer to ompute preise set of ggregte tree ptterns within the llotte spe uget. As prt of our solution, we lso evelop severl novel lgorithms for tree-pttern ontinment n minimiztion, s well s lest-upper-oun omputtion for set of tree ptterns. These results re of interest in their own right, n n prove useful in other omins, suh s XML query optimiztion. Extensive results from prototype implementtion vlite our pproh. 1 Introution XML (extensile Mrkup Lnguge) [16] hs eome the ominnt stnr for t enoing n exhnge Λ Currently on leve from Temple University n supporte in prt y NSF Creer Awr IIS-93168. y Current ffilition: Institut EURECOM, Sophi Antipolis, Frne Permission to opy without fee ll or prt of this mteril is grnte provie tht the opies re not me or istriute for iret ommeril vntge, the VLDB opyright notie n the title of the pulition n its te pper, n notie is given tht opying is y permission of the Very Lrge Dt Bse Enowment. To opy otherwise, or to repulish, requires fee n/or speil permission from the Enowment. Proeeings of the 28th VLDB Conferene, Hong Kong, Chin, 22 on the Internet, inluing e-business trnstions in oth Business-to-Business (B2B) n Business-to-Consumer (B2C) pplitions. Given the rpi growth of XML trffi on the Internet, the effetive n effiient elivery of XML ouments hs eome n importnt issue. Consequently, there is growing interest in the re of XML ontent-se filtering n routing (e.g., [4]), whih resses the prolem of effetively ireting high volumes of XML-oument trffi to intereste onsumers se on oument ontents. Unlike onventionl routing, where pkets re route se on limite, fixe set of ttriutes (e.g., soure/estintion IP resses n port numers), ontent-se routing is se on generl ptterns of the oument ontents, whih is signifintly more flexile n emning. Consumers typilly speify their susriptions, initing the type of XML ontent tht they re intereste in, using some XML pttern speifition lnguge (e.g., XPth [15]). For eh inoming XML oument, ontent-se router mthes the oument ontents ginst the set of susriptions to ientify the (su)set of intereste onsumers, n then routes the oument to them. Thus, in ontent-se routing, the estintion of n XML oument is generlly unknown to the t prouer, n is ompute ynmilly se on the oument ontents n the tive set of susriptions. Effetive support for slle, ontent-se XML routing is ruil to enling effiient n timely elivery of relevnt XML ouments to lrge, ynmi group of onsumers. Given the lrge volume of potentil onsumers, system slility n effiieny mte the ility to juiiously ggregte the set of onsumer susriptions to smller set of ontent speifitions. The gol, of ourse, is to oth reue the susriptions storge spe requirements (e.g., so tht the routing tle fits in min memory), s well s spee up the filtering of inoming XML trffi. For instne, ore router in B2B pplition my hoose to ggregte susriptions se on geogrphil lotion, ffilition, or omin-speifi informtion (e.g., teleommunitions). Susription ggregtion essentilly involves ggregting n initil set of susriptions S into smller set A suh tht ny oument tht mthes some susription in S lso mthes some susription in A. However, sine there is typilly loss of preision ssoite with suh ggregtion, the ouments mthe y the ggregte set A is, in generl, superset of those mthe y the originl set S. As result, oument my e route to onsumers who hve not susrie to it, thus resulting in n inrese in the mount of unwnte

* Bh CD SONY () p CD Bh () p CD Bh () p Bh () p CD Bh CD SONY Clssil Jzz Pop (e) T Figure 1: Exmple Tree Ptterns n XML Doument Tree. oument trffi. In orer to voi suh spurious forwring of ouments, it is esirle to minimize the numer of suh flse mthes (i.e., minimize the loss in preision) with respet to the given spe onstrint for the ggregte susriptions. So fr, there hs only een limite work on susription ggregtion, minly for very simple susription moels. For exmple, in [12], eh susription is set of ttriute-preite pirs (e.g., fissue = GE ; prie < 12; volume > 1g), n n ggregte susription is llowe to ontin wilr vlues, initing the entire set of omin vlues for ertin ttriutes. 1 In this pper, we provie the first systemti stuy of the susription ggregtion prolem where susriptions re speifie using the muh more expressive moel of tree ptterns. Tree ptterns represent n importnt sulss of XPth expressions tht offers nturl mens for speifying tree-struture onstrints in XML n LDAP pplitions [3]. Compre to erlier work se on ttriute/preite-se susriptions, effetively ggregting tree-ptterns poses muh more hllenging prolem sine susriptions involve oth ontent informtion (noe lels) s well s struture informtion (prent-hil n nestor-esennt reltionships). Briefly, our tree pttern ggregtion prolem n e stte s follows: Given n input set of tree ptterns S n spe onstrint, ggregte S into smller set of generlize tree ptterns tht meets the spe onstrint, n for whih the loss in preision ue to ggregtion is minimize. Exmple 1.1 Consier the two similr tree-pttern-se susriptions p n p shown in Figure 1, where p mthes ny oument with root element lele CD tht hs oth su-element lele SONY s well s su-element (with n ritrry lel) tht in turn hs su-element lele Bh ; n p mthes ny oument tht hs some element lele CD with suelement lele Bh. Here the noe lele Λ (wilr) mthes ny lel, while the noe lele == (esennt) mthes some (possily empty) pth. The XML oument T shown in Figure 1(e) mthes (or stisfies) p ut not p euse the su-element lele Bh in 1 Due to spe onstrints, more etile overview of relte work n e foun in the ppenix. T oes not hve prent element lele CD. For effiieny resons, one might wnt to ggregte the set of tree ptterns fp ;p g into single tree pttern. Two exmples of ggregte tree ptterns for fp ;p g re p n p (in Figure 1) sine ny oument tht stisfies p or p lso stisfies oth p n p. Although oth p n p hve the sme numer of noes, p is intuitively more preise thn p with respet to fp ;p g sine p preserves the nestor-esennt reltionship etween the CD n Bh elements s require y p n p. Inee, ny XML oument tht stisfies p lso stisfies p (n thus we sy tht p ontins p ). 2 To the est of our knowlege, our work is the first to ress this timely susription ggregtion prolem for XML t issemintion. Our min ontriutions n e summrize s follows. ffl We stuy the properties of tree ptterns n evelop effiient lgorithms for eiing tree pttern ontinment, minimizing tree pttern, n omputing the most preise ggregte (i.e., the lest upper oun ) for set of ptterns. Our results re not only interesting in their own right, ut lso provie solutions for speil ses of our tree pttern ggregtion prolem. ffl We propose novel, effiient metho tht exploits orse sttistis on the unerlying istriution of XML ouments to ompute preise set of ggregte ptterns within the llotte spe uget. Speifilly, our sheme employs the oument sttistis to estimte the seletivity of tree pttern, whih is lso use s mesure of the pttern s preiseness. Thus, our ggregtion prolem reues to tht of fining ompt set of ggregte ptterns with miniml loss in seletivity, for whih we present greey heuristi. ffl We emonstrte experimentlly the effetiveness of our pproh in omputing spe-effiient n preise set of ggregte tree ptterns. The usefulness of our results on tree ptterns n their ggregtion is not limite to ontent-se routing, ut lso extens to other pplition omins suh s the optimiztion of XML queries involving tree ptterns n the proessing/issemintion of susription queries in multist environment [9] (where ggregtion n e use to reue server lo n network trffi). Further, our work n results re omplementry to reent work on effiient inexing strutures for XPth expressions [2, 6]. The fous of this erlier reserh is to spee up oument filtering with given set of XPth susriptions using pproprite inexing shemes. In ontrst, our work fouses on effetively reuing the volume of susriptions tht nee to e mthe in orer to ensure slility given oune storge resoures for routing. Clerly, our tehniques n e use s pre-proessing step for the inexes of [2, 6] when hr onstrints on the size of the inex must e met. Due to spe limittions, the proofs of ll theoretil results n e foun in the full version of this pper [5].

2 Prolem Formultion 2.1 Definitions A tree pttern isnunorerenoe-leletreethtspeifies ontent n struture onitions on n XML oument. More speifilly, tree pttern p hs set of noes, enote y Noes(p), where eh noe v in Noes(p) hs lel, enote y lel(v), whih n either e tg nme, Λ (wilr tht mthes ny tg), or == (theesennt opertor). In prtiulr, the root noe hs speil lel =:. We use Sutree(v; p) to enote the sutree of p roote t v, referre to s su-pttern of p. Some exmples of tree ptterns re epite in Figure 2. To efine the semntis of tree pttern p, wefirstgive the semntis of su-pttern Sutree(v; p), wherev is not the root noe of p. Rell tht XML ouments re typilly represente s noe-lele trees, referre to s XML trees. Let T e n XML tree n t e noe in T. We sy tht T stisfies Sutree(v; p) t noet, enotey (T;t) j= Sutree(v; p), if the following onitions hol: (1) if lel(v) is tg, then t hs hil noe t lele lel(v) suh tht for eh hil noe v of v, (T;t ) j= Sutree(v ;p);(2)iflel(v) =Λ,thent hs hil noe t lele with n ritrry tg suh tht for eh hil noe v of v, (T;t ) j= Sutree(v ;p);n(3)iflel(v) ===, then t hs esennt noe t (possily t = t) suh tht for eh hil v of v, (T;t ) j= Sutree(v ;p). We next efine the semntis of tree ptterns. Let T e n XML tree with root t root,np e tree pttern with root v root. We sy tht T stisfies p, enote y T j= p, iff for eh hil noe v of v root,(1)iflel(v) is tg, thent root is lele with n for eh hil noe v of v, (T;t root ) j= Sutree(v ;p) (here lel(v) speifies the tg of t root ); (2) if lel(v) =Λ, thent root my hve ny lel n for eh hil noe v of v, (T;t root ) j= Sutree(v ;p); (3)iflel(v) = ==, thent root hs esennt noe t (possily t = t root ) suh tht T j= p, where T is the sutree roote t t,np is ientil to Sutree(v; p) exept tht is the lel for the root noe v (inste of lel(v)). Oserve tht v root is trete ifferently fromthe rest of the noes of p. The motivtion ehin this is illustrte y p i in Figure 2, whih speifies the following: for ny XML tree T stisfying p i, its root must e lele with n moreover, it must ontin two onseutive elements somewhere. This nnot e expresse without our speil root lel (s tree ptterns o not llow union opertor). Exmple 2.1 Consier the tree pttern p in Figure 2. An XML oument T stisfies p if its root element stisfies ll the following onitions: (1) its lel is ; (2) it must hve hil element with n ritrry tg, whih in turn hs hil element with lel ; n (3) it must hve esennt element whih hs oth -hil element n n -hil element. Thus, p essentilly speifies (existentil) onjuntive onitions on XML ouments. It shoul e note tht ouments stisfying p my hve tgs/sutrees not mentione in p. For instne, the root element of T my hve -hil element, n the -elements of T my hve -esennt elements. 2 A tree pttern p is si to e onsistent if n only if there exists n XML oument tht stisfies p. We only onsier onsistent tree ptterns in our work. Further, the tree ptterns efine ove n e nturlly generlize to ommote simple onitions n preites (e.g., issue = GE n prie < 1). To simplify the isussion, we o not onsier suh extensions in this pper. It is worth mentioning tht tree pttern n e esily onverte to n equivlent XPth expression [15] in whih eh su-pttern is expresse s onition/qulifier [5]. Thus, our tree ptterns re grph representtions of lss of XPth expressions, whih re similr to the tree ptterns tht hve een stuie for XML queries (e.g., [3, 17]). It is tempting to onsier using lrger frgment of XPth to express susription ptterns. However, it turns out tht even mil generliztion of our tree ptterns (e.g., with the ition of union/isjuntion opertors) les to muh higher omplexity (onp-hr or eyon) for si opertions suh s ontinment omputtion (e.g., see [1]). A tree pttern q is si to e ontine in nother tree pttern p, enote y q v p, if n only if for ny XML tree T,ifT stisfies q then T lso stisfies p. Ifq v p, we refer to p s the ontiner pttern n q s the ontine pttern. We sy tht p n q re equivlent, enote y p q, if p v q n q v p. This efinition n e generlize to sets of tree ptterns: set of tree ptterns S is ontine in nother set of tree ptterns S, enote y S v S,if for eh p 2 S, there exists p 2 S suh tht p v p. Continment for su-ptterns is efine similrly. The size oftreeptternp, enote y jpj, issimply the rinlity of its noe set. For exmple, referring to Figure 2, jp j =7n jp j =8. 2.2 Prolem Sttement The tree pttern ggregtion prolem tht we investigte in this pper n now e stte s follows. Given set of tree pttern susriptions S n spe oun k on the totl size of the ggregte susriptions, ompute set of tree ptterns S tht stisfies ll of the following three onitions: (C1) S v S (i.e., S is t lest s generl s S), (C2) P p 2S jp j»k (i.e., S is onise ), n (C3) S is s preise s possile, in the sense tht there oes not exist nother set of tree ptterns S tht stisfies the first two onitions n S v S. Clerly, the tree pttern ggregtion prolem my not neessrily hve unique solution sine it is possile to hve two sets S n S tht stisfy the first two onitions ut S 6v S n S 6v S. Therefore, we nee to evise some mesure to quntify the gooness of nite solutions in terms of oth their oniseness s well s preiseness. With respet to oniseness, we re intereste in miniml tree ptterns tht o not ontin ny reunnt noes. More preisely, we sy tht tree pttern p is minimize if for ny tree pttern p suh tht p p, itisthe se tht jpj»jp j. With respet to preiseness, it n e

* * * () p () p () p () p * x * * y * (e) p e (f) p f (g) p g (h) p h (i) p i Figure 2: Exmples of Tree Ptterns. shown tht the ontinment reltionship v on the universe of tree ptterns tully efines lttie. In prtiulr, the notions of upper oun n lest upper oun re of relevne to the ggregtion prolem n, therefore, we efine them formlly here. An upper oun of two tree ptterns p n q is tree pttern u suh tht p v u n q v u, i.e., for ny XML tree T,ifT j= p or T j= q then T j= u. Thelest upper oun (LUB) ofp n q, enote y p t q, is n upper oun u of p n q suh tht, for ny upper oun u of p n q, u v u. One gin, we generlize the notion of LUBs to set S of tree ptterns. An upper oun of S is tree pttern U, enote y S v U, suh tht p v U for every p 2 S. The LUB of S, enote y ts, is n upper oun U of S suh tht for ny upper oun U of S, U v U. Clerly, if p is n ggregte tree pttern for set of tree ptterns S (i.e., S v p), then p is n upper oun of S. Oserve tht, if p is the LUB of S,thenpis the most preise ggregte tree pttern for S. In ft, it n e shown tht ts exists n is unique up to equivlene for ny set S of tree ptterns [5]; thus, it is meningful to tlk out ts s the most preise ggregte tree pttern. Exmple 2.2 Consier gin the tree ptterns in Figure 2. Oserve tht p p ; n sine jp j > jp j, p is not minimize pttern. In ft, exept for p, ll the tree ptterns in Figure 2 re minimize ptterns. Note tht p 6v p euse the root noe of p oes not hve tg- hil noe; n p 6v p euse there exists no noe in p tht is prent noe of oth tg--noe n tg--noe. Oserve tht p v p n p v p ; i.e., p is n upper oun of p n p. However, p 6= p tp sine we hve nother tree pttern, p e, whih is n upper oun of p n p suh tht p e v p. Inee, p e = p t p with jp e j < jp j + jp j. Note, however, tht the size of n LUB is not neessrily lwys smller thn the size of its onstituent ptterns. For exmple, p h = p t p f ut jp h j > jp j + jp f j. Note tht p is n upper oun of fp ;p ;p ;p e ;p f ;p g ;p h g. 2 We onlue this setion y presenting some itionl nottion use in this pper. For noe v in tree pttern p, we enote the set of hil noes of v in p y Chil(v; p). We lso efine prtil orering μ on noe lels suh tht if x n x re tg nmes, then (1) x μ Λ μ == n (2) x μ x iff x = x. Given two noes v n w, MxLel(v; w) is efine to e the lest upper oun of their lels lel(v) n lel(w) s follows: 8 lel(v) if lel(v) =lel(w); >< == if (lel(v) ===) MxLel(v;w) = >: or (lel(w) ===); * otherwise. For exmple, M xlel(; ) =Λ nm xlel(λ; ==) = ==. For nottionl onveniene, we refer to noe v in tree pttern s n `-noe if lel(v) =`, n refer to v s tg-noe if lel(v) 62 f=:; Λ; ==g. 3 Computing the Most Preise Aggregte In this setion, we onsier speil se of our tree pttern ggregtion prolem, nmely, when the ggregte set S onsists of single tree pttern n there is no spe onstrint. For this se, we provie n lgorithm to ompute the most preise ggregte tree pttern (i.e., LUB) for set of tree ptterns. Some of the lgorithms given in this setion re lso key omponents of our solution for the generl prolem, whih is presente in the next setion. Given two input tree ptterns p n q, Algorithm LUB in Figure 3 omputes the most preise ggregte tree pttern for fp; qg (i.e., the LUB of p n q). It trverses p n q top-own n omputes the tightest ontiner su-ptterns for eh pir of su-ptterns p = Sutree(v; p) n q = Sutree(w; q) enountere, where v n w re noes in p n q, respetively. The tightest ontiner su-ptterns of p n q re set R of su-ptterns suh tht: (1) R onsists of ontiner su-ptterns 2 of p n q, i.e., for ny XML oument T n ny element t in T,if (T;t) j= p or (T;t) j= q then (T;t) j= r for eh r 2 R; n, 2 Note tht su-pttern of tree ptterns p n q is n upper-oun of p n q, n we use these two terms interhngely.

Algorithm LUB (p; q) Input: p n q retreeptterns. Output: A tree pttern representing the LUB of p n q. 1) if (q v p) then return p; 2) if (p v q) then return q; 3) Initilize T CSuP t[v;w] =;, 8 v 2 Noes(p); 8 w 2 Noes(q); 4) Let v root n w root enote the root noes of p n q,resp.; 5) for eh v 2 Chil(v root;p) o 6) for eh w 2 Chil(w root;q) o 7) T CSuP t[v;w] =LUB SUB (v; w; T CSuP t); 8) Crete tree pttern x with root noe lel =: n the set of hil [ su-ptterns T CSuP t[v; w]; v2chil(v root;p);w2chil(w root ;q) 9) return MINIMIZE (x); Algorithm LUB SUB (v; w; T CSuP t) Input: v, w re noes in tree ptterns p, q (respetively), T CSuP t is 2-imensionl rry suh tht T CSuP t[v; w] is the set of tightest ontiner su-ptterns of Sutree(v;p) n Sutree(w; q). Output: T CSuP t[v;w]. 1) if (T CSuP t[v;w] 6= ;) then 2) return T CSuP t[v; w]; 3) else if (Sutree(w; q) v Sutree(v;p)) then 4) return fsutree(v; p)g; 5) else if (Sutree(v;p) v Sutree(w; q)) then 6) return fsutree(w; q)g; 7) else 8) Initilize R = ;; R = ;; R = ;; 9) for eh v 2 Chil(v; p) o 1) for eh w 2 Chil(w; q) o 11) R = R [ LUB SUB (v ;w ; T CSuP t); 12) for eh v 2 Chil(v; p) o 13) R = R [ LUB SUB (v ; w; T CSuP t); 14) for eh w 2 Chil(w; q) o 15) R = R [ LUB SUB (v; w ; T CSuP t); 16) Let x e the pttern with root noe lel MxLel(v;w) n set of hil sutree ptterns R; 17) Let x e the pttern with root noe lel == n set of hil sutree ptterns R ; 18) Let x e the pttern with root noe lel == n set of hil sutree ptterns R ; 19) return T CSuP t[v; w] =fx; x ;x g; Figure 3: Lest-Upper-Boun Computtion Algorithm. (2) R is tightest in the sense tht for ny other set of ontiner su-ptterns R of p n q tht stisfies onition (1), ny XML oument T n ny element t in T,if(T;t) j= r for eh r 2 R then (T;t) j= r for ll r 2 R. Intuitively, R is olletion of onitions impose y oth p n q suh tht if T stisfies p or q t t,thent lso stisfies the onjuntion of these onitions t t. We now show how the LUB for p n q n e ompute from the tightest ontiner su-ptterns. Let v root n w root e the roots of ptterns p n q, respetively. Note tht oument T tht stisfies p lso stisfies, for eh v 2 Chil(v root ;p), the restrition of p to the root noe n only Sutree(v; p). Consequently, oument T tht stisfies p or q must lso stisfy the pttern x onsisting of root noe (with lel ) whose hilren re the tightest ontiner suptterns for eh pir Sutree(v; p) n Sutree(w; q), where v 2 Chil(v root ;p) n w 2 Chil(w root ;q). This pttern x is thus n LUB of p n q. The min suroutine in our LUB omputtion (Algorithm LUB SUB) omputes the tightest ontiner suptterns of p n q s follows. If q v p (resp. p v q ), then p (resp. q ) is the tightest ontiner supttern; otherwise, the tightest ontiner su-ptterns re setfx; x ;x g of su-ptterns, whih re efine in the following mnner. The root noe of x is lele with MxLel(v; w) n the hil sutrees of x re the tightest ontiner su-ptterns of eh hil sutree of p n eh hil sutree of q. Intuitively, the root of x orrespons to the roots of p n q (with lel equl to the lest upper oun of tht of p n q ). In other wors, x preserves the positions of the orresponing noes in p n q. However, this position-preserving generliztion is not suffiient sine p n q my hve ommon suptterns t ifferent positions reltive to their roots. For exmple, p n p f in Figure 2 hve ommon su-pttern roote t n -noe tht hs oth -hil n -hil, ut this pttern is lote t ifferent positions reltive to the roots of p n p f. To pture these off-position ommon su-ptterns, we nee to ompute x n x. The hil sutrees of x re the tightest ontiner su-ptterns of q itself n eh hil sutree of p ; n the lel of the root noe of x is == to ommote ommon su-ptterns t ifferent positions reltive to the roots of p n q. Similrly, the root noe of x hs lel ==, n the hil sutrees of x re the tightest ontiner su-ptterns of p itself n eh hil sutree of q. By omputing the tightest ontiner su-ptterns reursively, the lgorithm omputes the LUB of the input tree ptterns p n q. By inution on the strutures of p n q, we n show the following result [5]. Proposition 3.1: Given two tree ptterns p n q, Algorithm LUB (p; q) omputes p t q. 2 Exmple 3.1 Given p n p f in Figure 2, Algorithm LUB returns p h, whih is inee p t p f. To help explin the omputtion of p h, we use the nottion x n to refer the n th noe (in some tree pttern) tht is lele x, where eh olletion of noes shring the sme lel re orere se on their pre-orer sequene; for exmple, in p h, we use == 1 n == 3 to refer to the leftmost n rightmost ==-noes, respetively. Algorithm LUB SUB (invoke y Algorithm LUB) first extrts the position preserving tightest ontiner su-ptterns for Sutree( 1 ;p ) n Sutree(; p f ), whih yiels the su-pttern Sutree( 1 ;p h ) (in Steps 9 11). Note tht the root noe of Sutree( 1 ;p h ) is lele euse oth the root noes of Sutree( 1 ;p ) n Sutree(; p f ) re lele. The su-ptterns Sutree( 2 ;p ) n Sutree(; p f ), however, hve quite ifferent strutures n thus position-preserving ttempt to extrt their ommon su-ptterns only yiels

Sutree(Λ 1 ;p h ). In prtiulr, the ommon su-pttern onsisting of n -noe with oth -hil-noe n -hil-noe is not pture y the ove proess euse they our t ifferent positions reltive to the root noes of Sutree( 2 ;p ) n Sutree(; p f ). To extrt suh off-position ommon su-ptterns, Algorithm LUB SUB ompres Sutree( 1 ;p ) with Sutree(; p f ) n Sutree(; p f ), s well s ompres Sutree(; p f ) with Sutree( 2 ;p ) (in Steps 12 15). Inee, this yiels Sutree(== 3 ;p h ) whih hs ==-root sine this ommon su-pttern ours t ifferent positions reltive to the root noes of Sutree( 1 ;p ) n Sutree(; p f ). It shoul e mentione tht oth Sutree(== 1 ;p h ) n Sutree(== 2 ;p h ) re lso proue y the off-position proessing, s Algorithm LUB SUB reursively proesses the su-pttern Sutree( 2 ;p ) with Sutree(; p f ) n Sutree(; p f ), respetively. Finlly, the lgorithm removes the reunnt noes in the result tree pttern y using minimiztion lgorithm (whih will e expline shortly) to generte the LUB p h. 2 It is strightforwr to show tht our LUB opertor t, onsiere s inry opertor, is ommuttive n ssoitive, i.e., p 1 t p 2 = p 2 t p 1 n p 1 t (p 2 t p 3 ) = (p 1 t p 2 ) t p 3. As result, Algorithm LUB n e nturlly extene to ompute the LUB of ny set of tree ptterns. We next explin the etils of the two uxiliry lgorithms use in Algorithm LUB. Algorithm LUB nees to hek the ontinment of tree ptterns, whih is implemente y Algorithm CONTAINS in Figure 4. Given two input tree ptterns p n q, the lgorithm etermines if q v p. It mintins two-imensionl rry Sttus, whih is initilize with Sttus[v; w] = null to inite tht v 2 Noes(p) n w 2 Noes(q) hve not een ompre; otherwise, Sttus[v; w] 2 ftrue; flseg suh tht Sttus[v; w] = true if n only if Sutree(w; q) v Sutree(v; p). Clerly, q v p if n only if Sttus[v root ;w root ]=true,wherev root n w root enote the root noes of p n q, respetively. The min suroutine in our ontinment lgorithm is Algorithm CONTAINS SUB. Astrtly, CONTAINS SUB trverses p n q top-own n uptes Sttus[v; w] for eh pir of noes v 2 N oes(p) n w 2 N oes(q) visite s follows. Let p n q enote Sutree(v; p) n Sutree(w; q), respetively. If Sttus[v; w] hs lrey een ompute (i.e., Sttus[v; w] 6= null), then its vlue is returne. Otherwise, our lgorithm etermines whether q v p, s follows. If lel(v) 6= ==, then Sttus[v; w] = true iff lel(w) μ lel(v) n eh hil sutree of v ontins some hil sutree of w. Otherwise, if lel(v) = ==, two itionl onitions nee to e tken into ount. This is euse unlike Λ-noe or tg-nme-noe, ==-noe in ontiner tree pttern n lso e mppe to (possily empty) hin of noes in ontine tree pttern. For exmple, onsier the tree ptterns p n p f in Figure 2. Note tht p f v p,n the ==-noe in p is not mppe to ny noe in p f in the sense tht p f woul still e ontine in p if the ==-noe Algorithm CONTAINS (p; q) Input: p n q re two tree ptterns. Output: Returns true if q v p; flse otherwise. 1) Initilize Sttus[v; w] =null, 8 v 2 Noes(p); 8 w 2 Noes(q); 2) Let v root n w root enote the root noes of p n q,resp.; 3) if (Chil(v root;p)=;) then 4) return true; 5) else 6) return CONTAINS SUB (v root;w root; Sttus); Algorithm CONTAINS SUB (v; w; Sttus) Input: v, w re noes in tree ptterns p, q (respetively), Sttus is 2-imensionl rry suh tht eh Sttus[v; w] 2fnull; flse; trueg. Output: Sttus[v;w]. 1) if (Sttus[v;w] 6= null) then 2) return Sttus[v; w]; 3) if (v is lef noe in p) then 4) Sttus[v; w] =(lel(w) μ lel(v)); 5) else if (lel(w) 6μ lel(v)) then 6) Sttus[v; w] =flse; 7) else 8) Sttus[v; w] = 1 ^ _ @ CONTAINS SUB (v ;w ; Sttus) A ; v 2Chil(v;p) w 2Chil(w;q) 9) if (Sttus[v;w] =flse) n (lel(v) ===) then 1) V Sttus[v; w] = v 2Chil(v;p) CONTAINS SUB (v ; w; Sttus); 11) if (Sttus[v;w] =flse) n (lel(v) ===) _ then 12) Sttus[v; w] = CONTAINS SUB (v; w ; Sttus); 13) return Sttus[v;w]; w 2Chil(w;q) Figure 4: Tree-Pttern Continment Algorithm. in p is elete. On the other hn, for the tree ptterns p n p g in Figure 2, p g v p n the ==-noe in p is mppe to oth the Λ- n-noes in p g in the sense tht Sutree(Λ;p g ) v Sutree(==; p ) n Sutree(; p g ) v Sutree(==; p ). These two itionl senrios re hnle y Steps 1 n 12 in Algorithm CONTAINS SUB: Step 1 ounts for the se where ==-noe (v itself) is mppe to n empty hin of noes, n Step 12 for the se where ==-noe (v itself) is mppe to nonempty hin. Note tht in Steps 8 n 12, the expression W w inchil(w;q) CONTAINS SUB (x; w ; Sttus) returns flseif Chil(w; q) =;. By inution on the strutures of p n q, we n show the following result. Proposition 3.2: Given two tree ptterns p n q, Algorithm CONTAINS(p; q) etermines if q v p in O(jpj jqj) time. 2 The qurti time omplexity of our tree-pttern ontinment lgorithm is ue to, mong other things, the ft tht eh pir of su-ptterns in p n q is heke t most one, euse of the use of the Sttus rry. To simplify the isussion, we hve omitte from Algorithm CON- TAINS ertin sutle etils tht involve tree ptterns with

hins of ==- nλ-noes. Suh ses require some itionl pre-proessing to onvert the tree pttern to some nonil form, ut this oes not inrese our lgorithm s time omplexity. To ensure tht our tree ptterns re onise, we nee to ientify n eliminte reunnt noes in them. Given treeptternp, minimize tree pttern p equivlent to p n e ompute using reursive lgorithm MIN- IMIZE. Strting with the root of p, our minimiztion lgorithm performs the following two steps to minimize the su-pttern Sutree(v; p) roote t noe v in p: (1)Forny v ;v 2 Chil(v; p), ifsutree(v ;p) v Sutree(v ;p), then elete Sutree(v ;p) from Sutree(v; p); n, (2) For eh v 2 Chil(v; p) (tht ws not elete in the first step), reursively minimize Sutree(v ;p). The omplete etils n e foun in [5]. Proposition 3.3: Algorithm MINIMIZE minimizes ny tree pttern p in O(jpj 2 ) time. 2 Proposition 3.4: For ny minimize tree ptterns p n p, p p iff p = p (i.e., they re synttilly equl). 2 Given the low omputtionl omplexities of CON- TAINS n MINIMIZE, one might expet tht this woul lso e the se for Algorithm LUB. Unfortuntely, in the worst se, the size of the (minimize) LUB of two tree ptterns n e exponentilly lrge (see [5] for etile nlysis). Our implementtion results, however, emonstrte tht our LUB lgorithm exhiits resonly low vergese omplexity in prtie. 4 Seletivity-se Aggregtion Algorithm While the LUB lgorithm presente in the previous setion n e use to ompute single, most preise ggregte tree pttern for given set S of ptterns, the size of the LUB my e too lrge n, therefore, my violte the speifie spe onstrint k on the totl size of the ggregte susriptions (Setion 2.2). Thus, in orer to fit our ggregtes within the llotte spe uget, we relx the requirement of single preise ggregte y permitting our solution to e set S = fp 1 ;p 2 ;:::;p m g (inste of single pttern), suh tht eh pttern q 2 S is ontine in some pttern p i 2 S. Of ourse, we lso require tht S provie the tightest ontinment for ptterns in S for the given spe onstrint (Setion 2.2); tht is, the numer of XML ouments tht stisfy some tree pttern in S ut not S, is smll. A simple mesure of the preiseness of S is its seletivity, whih is essentilly the frtion of filtere XML ouments tht stisfy some pttern in S. Thus, our ojetive is to ompute set S of ggregte ptterns whose seletivity is verylose to tht of S. Clerly, the seletivity of our tree ptterns is highly epenent on the istriution of the unerlying olletion of XML ouments (enote y D). It is, however, infesile to mintin the etile istriution D of streming XML ouments for our ggregtion the spe requirements woul e enormous! Inste, our pproh is se on uiling onise synopsis of D on-line (i.e., s ouments re streming y), n using tht synopsis to estimte (pproximte) tree-pttern seletivities. At high level, our ggregtion lgorithm itertively omputes sets tht is oth seletive n stisfies the spe onstrint, strting with S = S (i.e., the originl set S of ptterns), n performing the following sequene of steps in eh itertion: 1. Generte nite set of ggregte tree ptterns C onsisting of ptterns in S n LUBs of similr pttern pirs in S. 2. Prune eh pttern p in C y eleting/merging noes in p in orer to reue its size. 3. Choose nite p 2 C to reple ll ptterns in S tht re ontine in p. Our nite-seletion strtegy is se on mrginl gins [14]: The selete nite p is the one tht results in the minimum loss in seletivity per unit reution in the size of S (ue to the replement of ptterns in S y p). Note tht our pruning step (Step 2) ove mkes nite ggregte ptterns less seletive (in ition to eresing their size). Thus, y repling ptterns in S y ptterns in C, we re effetively trying to reue the size of S y giving up some of its seletivity. In the following susetions, we esrie in more etil our lgorithm for omputing S. We egin y presenting our pproh for estimting the seletivity of tree ptterns over the unerlying oument istriution, whih is ritil to hoosing goo replement nite in Step 3 ove. 4.1 Seletivity Estimtion for Tree Ptterns The Doument Tree Synopsis. As mentione ove, it is simply impossile to mintin the urte oument istriution D (i.e., the full set of streming ouments) in orer to otin urte seletivity estimtes for our tree ptterns. Inste, our pproh is to pproximte D y onise synopsis struture, whih we refer to s the oument tree. Our oument tree synopsis for D, enote y DT, ptures pth sttistis for ouments in D, n is uilt on-line s XML ouments strem y. The oument tree essentilly hs the sme struture s n XML tree, exept for two ifferenes. First, the root noe of DT hs the speil lel. Seon, eh non-root noe t in DT hs frequeny ssoite with it, whih we enote y freq(t). Intuitively, if l 1 =l 2 = =l n is the sequene of tg nmes on noes long the pth from the root to t (exluing the lel for the root), then freq(t) represents the numer of ouments T in D tht ontin pth with tg sequene l 1 =l 2 = =l n originting t the root of T. The frequeny for the root noe of DT is set to N, the numer of ouments in D. As XML ouments strem y, DT is inrementlly mintine s follows. For eh rriving oument T,we first onstrut the skeleton tree T s for oument T.Inthe skeleton tree T s, eh noe hs t most one hil with given tg. T s is uilt from T y simply olesing two hilren of noe in T if they shre ommon tg. Clerly, y trversing noes in T in top-own fshion, n olesing

x x 3 3 3 2 1 2 2 3 1 2 (e) Doument Tree 3 2.3 x x 3 3 1.5 1.5 (f) Compresse Doument Tree () T1 () T2 () T3 () Skeleton tree for T1 x x (g) p1 (h) p2 x * (i) p3 Figure 5: Exmple Douments, Skeleton Tree, Doument Tree, n Ptterns. hil noes with ommon tgs, we n onstrut T s from T in single pss (using n event-se XML prser). As n exmple, Figure 5() epits the skeleton tree for the XML-oument tree in Figure 5(). Next, we use T s to upte the sttistis mintine in our oument tree synopsis DT s follows. For eh pth in T s, with tg sequene sy l 1 =l 2 = =l n,lette the lst noe on the orresponing (unique) pth in DT. We inrement freq(t) y 1. Figure 5(e) shows the oument tree (with noe frequenies) for the XML trees T 1, T 2,n T 3 in Figure 5() to (). Note tht it is possile to further ompress DT y using tehniques similr in spirit to the methos employe y Aoulng et l. [1] for summrizing pth trees. The key ie is to merge noes with the lowest frequenies n store, with eh merge noe, the verge of the originl frequenies for noes in DT tht were merge. This is illustrte in Figure 5(f) for the oument tree in Figure 5(e), n with the lel use to inite merge noes. Due to spe onstrints, in the reminer of this susetion, we only present solutions to the seletivity estimtion prolem using the unompresse tree DT. However, our propose methos n e esily extene to work even when DT is ompresse [5]. We shoul note here tht our seletivity estimtion prolem for tree ptterns iffers from the work of Aoulng et l. [1] in two importnt respets. First, in [1], the uthors onsier the prolem of estimting seletivity for only simple pths tht onsist of -noe followe y tg noes. In ontrst, we estimte seletivities of generl tree ptterns with rnhes, n *- or -noes ritrrily istriute in the tree. Seon, we re intereste in seletivity t the grnulrity of ouments, so our gol is to estimte the numer of XML ouments tht mth tree pttern; inste, [1] resses the seletivity prolem t the grnulrity of iniviul oument elements tht re isovere y pth. It is esy to see tht these re two very ifferent estimtion prolems. Seletivity Estimtion Proeure. Rell tht the seletivity of tree pttern p is the frtion of ouments T in D tht stisfy p. By onstrution, our DT synopsis gives urte seletivity estimtes for tree ptterns omprising single hin of tg-noes (i.e., with no * or ). However, otining urte seletivity estimtes for ritrry tree ptterns with rnhes, *, n is, in generl, not possile with DT summries. This is euse, while DT ptures the numer of ouments ontining single pth, it oes not store oument ientities. As result, for pir of ritrry pths in tree pttern, it is impossile to etermine the ext numer of ouments tht ontin oth pths or ouments tht ontin one pth, ut not the other. Our estimtion proeure solves this prolem, y mking the following simplifying ssumption: The istriution of eh pth in tree pttern is inepenent of other pths. Thus, we estimte the seletivity of tree pttern ontining no == or Λ lels, simply s the prout of the seletivities of eh root to lef pth in the pttern. For ptterns ontining == or Λ, we onsier ll possile instntitions for == n Λ with element tgs, n then hoose s our pttern seletivity the mximum seletivity vlue over ll instntitions. (This is similr to the efinition of fuzzy OR opertor in fuzzy logi [13].) We illustrte our seletivity estimtion methoology in the following exmple. Exmple 4.1 Consier the prolem of estimting the seletivities of the tree ptterns shown in Figures 5(g) to (i) using the oument tree shown in Figure 5(e). The totl numer of ouments, N,is3. Clerly, the numer of ouments stisfying pttern p 1 whih onsists of single pth, n e estimte urtely y following the pth in DT n returning the frequeny for the -noe (t the en of the pth) in DT. Thus, the seletivity of p 1 is 2=3 whih is urte sine only ouments T 2 n T 3 stisfy p 1.Estimting the numer of ouments ontining pttern p 2, however, is somewht more triky. This is euse there re two pths with tg sequenes x== n x=== in DT tht mth p 2 (orresponing to instntiting with x n x=). Summing the frequenies for the two -noes t the en of these pths gives us n nswer of 4 whih over-estimtes the numer of ouments stisfying p 2 (only ouments T 2 n T 3 stisfy p 2 ). To voi oule-ounting frequenies, we estimte the numer of ouments stisfying p 2 to e the mximum (n not the sum) of frequenies over ll pths in DT tht mth p 2. Thus, the seletivity of p 2 is estimte s 2=3. Finlly, the seletivity of p 3 is ompute y onsiering ll possile instntitions for n *, n hoosing the one with the mximum seletivity. The two possile instntitions for tht result in non-zero seletivities re x n x=, n Λ n e instntite with either ; or for == = x, n or for == = x=. Choosing == = x n Λ = results in the mximum seletivity sine the prout of the seletivities of pths x== n x== is mximum, n is equl to (3=3) (2=3) = 2=3. 2 Algorithm SEL (epite in Figure 6), invoke with input prmeters v = v root (root of pttern p) nt = t root (root of DT), omputes the seletivity for n ritrry tree

Algorithm SEL(v, t) Input: v is noe in tree pttern p, t is noe in DT. Output: SelSuP t[v; t]. 1) if (SelSuP t[v; t] is lrey ompute) then 2) return SelSuP t[v; t]; 3) else if (lel(t) 6μ lel(v)) then 4) return SelSuP t[v; t] =; 5) else if (v is lef) then 6) return freq(t)=n; 7) for eh hil v 2 Chil(v; p) o 8) Sel v = mx t2chil(t;dt )fsel (v ;t )g; 9) Sel = Q v 2Chil(v;p) Selv ; 1) if (lel(v) ===) then 11) Sel v = Q v 2Chil(v;p) SEL(v;t); 12) Sel = mxfsel;sel vg; 13) Sel v = mx t2chil(t;dt )fsel(v; t )g; 14) Sel = mxfsel;sel vg; 15) return SelSuP t[v; t] =Sel Figure 6: Tree Pttern Seletivity Estimtion Algorithm. pttern p in O(jDTj jpj) time. In the lgorithm, for noes v 2 p n t 2 DT, SelSuP t[v; t] stores the seletivity of the su-ptternsutree(v; p) with respet to the sutree of DT roote t noe t. This seletivity is estimte similr to the seletivity for pttern p, exept tht we now onsier ll instntitions of Sutree(v; p) (otine y instntiting == n Λ with element tgs), n the seletivity of eh instntition is ompute with respet to t s the root inste of the root of DT. For instne, suppose tht v is the -noe in p 3 (in Figure 5(i)), n t is the hil -noe of the x-noe in DT (in Figure 5(e)). Then, the seletivity of Sutree(v; p 3 ) with respet to t is essentilly the prout of the seletivity of pths =Λ n = with respet to noe t,whihis1 (2=3). Thus, SelSuP t[v; t] = 2=3. Our gol is to ompute SelSuP t[v root ;t root ]. For pir of noes v n t, Algorithm SEL omputes SelSuP t[v; t] from SelSuP t[ ] vlues for the hilren of v n t. Clerly, if lel(t) 6μ lel(v) (Steps 3-4 of the lgorithm), then every pth in Sutree(v; p) egins with lel ifferent from lel(t) n thus the seletivity of eh of the pths is. If lel(t) μ lel(v) n v is lef (Steps 5-6), then we simply instntite lel(v) (if lel(v) === or *) with lel(t), giving seletivity of freq(t)=n. On the other hn, if v is n internl noe of p, then in ition to instntiting lel(v) with lel(t), we lso nee to ompute, for every hil v of v, the instntition for Sutree(v ;p) tht hs the mximum seletivity with respet to some hil t of t. SineSelSuP t[v ;t ] is the seletivity of Sutree(v ;p) with respet to t,the prout of mx t2chil(t;dt ) SelSuP t[v ;t ] for the hilren v of v gives the seletivity of Sutree(v; p) with respet to t. Finlly, if lel(v) = ==, then == n e simply null, in whihse theseletivityof Sutree(v; p) with respet to t is ompute s esrie in Step 11, or == is instntite to sequene onsisting of lel(t) followe y lel(t ),wheret is the hil of t suh tht the seletivity of Sutree(v; p) with respet to t is mximize (Step 13). Oserve tht, in Steps 8 n 13, if t hs no hilren, then mx t2chil(t;dt )f:::g evlutes to. 4.2 Tree Pttern Aggregtion Algorithm We re now rey to present our greey heuristi lgorithm for the tree pttern ggregtion prolem efine in Setion 2.2 (whih is, in generl, n NP-hr lustering prolem [5]). As esrie erlier, to ggregte n input set of tree ptterns S into spe-effiient n preise set, our lgorithm (Algorithm AGGREGATE in Figure 7) itertively prunes the tree ptterns in S y repling smll suset of tree ptterns with more onise upper-oun ggregte pttern, until S stisfies the givenspe onstrint. During eh itertion, our lgorithm first genertes smll set of potentil nite ggregte ptterns C, n selets from these the (lolly) est nite pttern, i.e., the nite tht mximizes the gin in spe while minimizing the expete loss in seletivity. Algorithm AGGREGATE (S; k) Input: S is set of tree ptterns, k is spe onstrint. Output: A set of tree ptterns S suh tht S v S n P p2s jpj»k. 1) Initilize S = S; 2) while P ( p2s jpj >k) o 3) C 1 = fx j x = PRUNE(p; jpj 1); p 2 S g; 4) C 2 = fx j x = PRUNE(p t q; jpj + jqj 1); p; q 2 S g; 5) C = C 1 [ C 2; 6) Selet x 2 C suh tht Benefit(x) is mximum; 7) S = S fp j p v x; p 2 S g [ fxg; 8) return S ; Figure 7: Tree Pttern Aggregtion Algorithm. Cnite Genertion. We now explin the proess for generting the nite set C in Steps 3 5 of Algorithm AGGREGATE. To reue the size of iniviul nite ptterns of the form p or ptq, eh nite is prune y invoking Algorithm PRUNE (etils in [5]). Given n input pttern p n spe onstrint n, Algorithm PRUNE prunes p to smller tree pttern p suh tht p v p n jp j»n. The lgorithm trets tg-noes s more seletive thn Λ- n==-noes, n therefore tries to prune wy Λ- n ==-noes efore the tg-noes. Speifilly, the lgorithm first prunes the Λ- n==-noes in p y (1) repling eh jent pir of non-tg-noes v; w with single ==-noe, if w is the only hil of v, n (2) eliminting sutrees tht onsist of only non-tg-noes. If the tree pttern is still not smll enough fter the pruning of the nontg-noes, we strt pruning the tg-noes. There re two wys to reue the size of tree pttern p y one noe. The first is to elete some lef noe in p, n the seon is to ollpse two noes v n w into single ==-noe, where lel(v) 6= =: n Chil(v;p) = fwg. To help selet goo lef noe to elete (or, pir of noes to ollpse), we mke use of the seletivity of the tg nmes. More speifilly, we use our oument tree synopsis DT to estimte the totl numer of ourrenes of tg nme in the oument olletion D, n then hoose the tgs with higher totl frequenies (whih re less seletive) s nites for pruning.

Cnite Seletion. One the set of nite ggregte ptterns hs een generte, we nee some riterion for seleting the est nite to insert into S. For this purpose, we ssoite enefit vlue with eh nite ggregte pttern x 2 C, enote y Benefit(x), se on its mrginl gin [14]; tht is, we efine Benefit(x) s the rtio of the svings in spe to the loss in seletivity of using x over fp j p v x; p 2 S g. More formlly, if v xroot, t root,nv proot represent the root noes of x, DT, n p 2 S,thenBenefit(x) is equl to: P pvx;p2s jpj jxj SEL(v xroot ;t root) mx pvx;p2s SEL(v proot ;t root) Note tht we ompute the seletivity loss y ompring the seletivity of the nite ggregte pttern x with tht of the lest seletive pttern ontine in it. This gives goo pproximtion of the seletivity loss in ses when the ptterns p; q 2 S use to generte x re similr n overlp in the oument tree DT. The nite ggregte pttern with the highest enefit vlue is hosen to reple the ptterns ontine in it in S (Steps 6 7). 5 Experimentl Stuy To verify the effetiveness of our tree pttern ggregtion lgorithms, we hve onute n extensive performne stuy using rel-life DTDs n lrge numers of tree ptterns. Our results inite tht our propose ggregtion tehniques hieve signifint reutions in the numer s well s totl size of tree ptterns with miniml loss in seletivity. 5.1 Experimentl Teste n Methoology Our generl methoology for evluting the effetiveness of pttern ggregtion lgorithm A is s follows. Given lrge input set of tree ptterns S n spe onstrint k, weusea to ompute set of ggregte ptterns S for S, wheres v S n P p2s jpj»k(our spe onstrint is expresse in terms of numer of noes, sine ptterns n e ritrrily lrge). We then mesure the loss in preision when using S inste of S to filter XML ouments. Oserve tht when k = 1, S ontins single ontiner pttern ( == ). To mesure the loss in preision of the ggregte set S, we use suset D of representtive set of XML ouments, suh tht no oument in D mthes ny tree pttern in our initil pttern set S. The reson, of ourse, is tht XML ouments tht mth S re lso gurntee to mth S, so they re unlikely to ffet our preisionloss mesurements. As S eomes less preise, some ouments in D will e erroneously reporte s mthes. Let Mthes(D ;S ) e the numer of ouments in D tht mth S ; the loss in preision of S over S n e estimte s SelLoss(S ;S) = Mthes(D ;S )=jd j. An ggregtion lgorithm is oviously more effetive if SelLoss(S ;S) remins smll s P p2s jpj ereses. XML Douments. We use two rel-life DTDs to generte our XML oument t set. The first one, the Extensile Hypertext Mrkup Lnguge (XHTML) DTD [7], is reformultion of HTML s n XML pplition n is rguly the oument type most wiely use over the Internet. The XHTML DTD (version 1.) ontins 77 elements with 1377 ttriutes. The seon DTD, the News Inustry Text Formt (NITF) DTD[8], is supporte y most of the worl s mjor news genies. The NITF DTD (version 2.5) ontins 123 elements with 513 ttriutes. We generte our t set of XML ouments using IBM s XML Genertor tool [11]. Both the XHTML n NITF DTDs ontin reursive strutures, whih n e neste to proue XML ouments with ritrry numer of levels. We e the option of generting ouments skewe oring to Zipf istriution [18], where some tg nmes pper more frequently thn others, s is generlly the se with rel-life t. For eh eh DTD n eh skew vlue D = f; 1; 2g, we generte two isjoint sets of 5 XML ouments with pproximtely 1 noes n 1 levels on verge. The first set orrespons to the olletion of XML ouments use to onstrut the oument tree DT for seletivity estimtion; the seon set is use to mesure the loss in preision of the ggregtion lgorithms. Both sets were generte with the sme prmeters, n thus n e expete to hve similr istriutions. In eh experiment, we use the omine XML ouments for oth the XHTML n NITF DTDs, i.e., we use totl of 1 ouments for the oument tree DT, n ( ifferent) 1 ouments for mesuring the loss in preision. XPth Expressions. To generte the set of tree ptterns S, we implemente n XPth expression genertor tht tkes DTD s input n retes set of vli XPth expressions se on set of prmeters tht ontrol: (1) the mximum height h of the tree ptterns; (2) the proilities p Λ n p == of hving wilr Λ or esennt == opertor t noe of tree pttern; (3) the proility p h of hving more thn one hil t given noe; n (4) the skew S of the Zipf istriution use for seleting element tg nmes. For eh DTD n eh skew vlue S = f; 1; 2g, we generte set of 5 tree ptterns with h =1n p Λ = p == = p h =:1. Eh experiment ws run with tree ptterns from oth the XHTML n NITF DTDs, i.e., 1 tree ptterns whih mounte to more thn 1 noes. Algorithms. We ompre two ifferent ggregtion lgorithms in our experiments. The first ( nive ) lgorithm, PRUNE, is se on simple noe pruning n works s follows. At eh itertion, it selets tree pttern p mx from S with the lrgest numer of tg-noes, ollpses multiple Λ- n==-noes, n eletes prunle noe (i.e., lef noe or noe lote next to ==-noes) with the highest frequeny (i.e., lest seletive) in the oument tree DT. If there is lrey tree pttern ientil to the prune pttern, then the uplite is remove from S. The lgorithm itertes until the spe onstrint is stisfie. The seon lgorithm, AGGR, is our greey tree pttern ggregtion lgorithm (from Figure 7) with oth nite genertion n seletion (se on mximizing the enefit). Our experiments were onute on 866 MHz Intel Pentium III

Seletivity Loss (%) 1 8 6 4 Prune (θ D =) Prune (θ D =1) Prune (θ D =2) Aggr (θ D =) Aggr (θ D =1) Aggr (θ D =2) Seletivity Loss (%) 1 8 6 4 Prune (θ S =) Prune (θ S =1) Prune (θ S =2) Aggr (θ S =) Aggr (θ S =1) Aggr (θ S =2) Seletivity Loss (%) 1 8 6 4 Prune (θ D =θ S =) Prune (θ D =θ S =1) Prune (θ D =θ S =2) Aggr (θ D =θ S =) Aggr (θ D =θ S =1) Aggr (θ D =θ S =2) 2 2 2 2 4 6 8 1 12 14 Numer of Noes (x1,) 2 4 6 8 1 12 14 Numer of noes (x1,) 2 4 6 8 1 12 14 Numer of noes (x1,) () Vrying D ( S =) () Vrying S ( D =) () Vrying S n D Figure 8: Evlution of the Aggregtion Algorithms. mhine with 512 MB of min memory running Linux. Both lgorithms omplete the ggregtion of 1 tree ptterns in pproximtely 1 minutes. 5.2 Experimentl Results We first ompre the performne of the two ggregtion lgorithms y vrying the skew for element tgs in the XML ouments n in the XPth expressions. We rn the experiments with no skew, with skewe XML ouments, with skewe XPth expressions, n with skew in oth the XML ouments n XPth expressions. In the lst se, we skew the istriution for element nmes in the opposite iretion (pplying the sme skew to oth the XML ouments n XPth expressions woul yiel similr results s with no skew). The experimentl results re shown in Figures 8(), 8(), n 8(), where the spe onstrint, expresse in terms of the numer of noes, is vrie long the x-xis, n the y-xis inites the oserve loss in seletivity for given spe onstrint, i.e., the perentge of XML ouments tht re erroneously reporte s mthes. We lso mesure the enefits of ggregtion in terms of filtering performne, using the XTrie mthing lgorithm esrie in [6]. Sine the ost of filtering in XTrie grows linerly with the numer of XPth expressions, we expet to oserve signifint improvement in filtering spee s the rinlity of S ereses. Non-skewe worklo. When neither the XML t nor the tree ptterns ontin skew (i.e., D = S = ), the AGGR lgorithm n ggregte tree ptterns up to 15% of their originl size with only 25% loss in preision (the results for non-skewe t re reporte in ll grphs of Figure 8). In ontrst, the preision of PRUNE lgorithm strts to egre muh sooner, n the loss in preision rehes lmost 1% t 25% of the initil spe. The etter performne of AGGR n e ttriute to three min ftors: (1) the upper oun omputtion genertes goo nites with few noes n little loss in preision, (2) the seletivity-se heuristis help to etet n isr nites tht orrespon to ptterns with low seletivity (i.e., frequently ourring for given DTD), n (3) the overing omputtion enles reunnt tree ptterns to e eliminte erly. Skewe XML ouments. Rel-worl XML ouments re generlly not uniformly istriute mong the vli XML t for given DTD. When XML ouments re skewe (Figure 8()), we oserve tht the effetiveness of the AGGR lgorithm inreses. The reson for this is tht, s t eomes more skewe, the XML ouments ten to form lusters with ouments within luster eing more similr thn those in ifferent lusters; this, in turn, improves the ury of seletivity estimtion. The PRUNE lgorithm lso enefits from the skew (lthough to lesser extent) euse of its frequeny-se pruning heuristi. Skewe tree ptterns. We lso oserve signifint improvement in our ggregtion lgorithm when the element nmes of tree ptterns re skewe (Figure 8()). Inee, the skew inues lustering of ptterns suh tht similr tree ptterns re groupe into the sme luster, whih onsequently inreses the proportion of ptterns tht evelop ontinment reltionships. This permits the ggregtion lgorithm to reue the size of S with miniml loss of seletivity, y omputing tighter upper oun ptterns n isring overe ptterns. Skewe worklo. The two ggregtion lgorithms perform est when oth the XML t n the tree ptterns re skewe in ifferent iretions (Figure 8()). With high skew vlues, there is little overlp etween the element nmes of the XML ouments n the tree ptterns, n AGGR remins highly seletive with only few hunres noes. The PRUNE lgorithm lso exhiits signifint improvements n mintins 5% seletivity even fter the originl numer of noes re reue to less thn thir. Filtering spee. As mentione previously, the ost of mthing tree ptterns ginst inoming XML ouments is proportionl to the numer of tree ptterns. Sine AGGR genertes nites y omputing upper ouns, the nites over more ptterns, n s result, the numer of ptterns in S shrinks fster with AGGR. Figure 9 shows tht the verge filtering time per oument ereses fster (s spe is inrese) for AGGR thn for the PRUNE lgorithm. Our ggregtion lgorithm is therefore more effe-