Automata for Analyzing and Querying Compressed Documents Barbara FILA, LIFO, Orl eans (Fr.) Siva ANANTHARAMAN, LIFO, Orl eans (Fr.) Rapport No

Similar documents
Coalgebra, Lecture 15: Equations for Deterministic Automata

Minimal DFA. minimal DFA for L starting from any other

Lecture 08: Feb. 08, 2019

Convert the NFA into DFA

Closure Properties of Regular Languages

Formal Languages and Automata

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

1 Nondeterministic Finite Automata

Designing finite automata II

Parse trees, ambiguity, and Chomsky normal form

Assignment 1 Automata, Languages, and Computability. 1 Finite State Automata and Regular Languages

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS

Lecture 09: Myhill-Nerode Theorem

Bases for Vector Spaces

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers. Mehryar Mohri Courant Institute and Google Research

Nondeterminism and Nodeterministic Automata

p-adic Egyptian Fractions

Converting Regular Expressions to Discrete Finite Automata: A Tutorial

Chapter 2 Finite Automata

CM10196 Topic 4: Functions and Relations

Lecture 9: LTL and Büchi Automata

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers

Regular expressions, Finite Automata, transition graphs are all the same!!

First Midterm Examination

3 Regular expressions

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

CS 275 Automata and Formal Language Theory

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

Theory of Computation Regular Languages. (NTU EE) Regular Languages Fall / 38

Model Reduction of Finite State Machines by Contraction

More on automata. Michael George. March 24 April 7, 2014

Finite Automata-cont d

CS 301. Lecture 04 Regular Expressions. Stephen Checkoway. January 29, 2018

Farey Fractions. Rickard Fernström. U.U.D.M. Project Report 2017:24. Department of Mathematics Uppsala University

Tutorial Automata and formal Languages

Talen en Automaten Test 1, Mon 7 th Dec, h45 17h30

Lecture 3: Equivalence Relations

The size of subsequence automaton

Formal Languages and Automata Theory. D. Goswami and K. V. Krishna

CHAPTER 1 Regular Languages. Contents

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

1 From NFA to regular expression

Thoery of Automata CS402

Formal languages, automata, and theory of computation

Harvard University Computer Science 121 Midterm October 23, 2012

DFA minimisation using the Myhill-Nerode theorem

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Theory of Computation Regular Languages

Deterministic Finite Automata

Java II Finite Automata I

Foundations of XML Types: Tree Automata

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Lexical Analysis Finite Automate

CS 330 Formal Methods and Models

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

Homework Solution - Set 5 Due: Friday 10/03/08

Table of contents: Lecture N Summary... 3 What does automata mean?... 3 Introduction to languages... 3 Alphabets... 3 Strings...

CS375: Logic and Theory of Computing

Grammar. Languages. Content 5/10/16. Automata and Languages. Regular Languages. Regular Languages

CHAPTER 1 Regular Languages. Contents. definitions, examples, designing, regular operations. Non-deterministic Finite Automata (NFA)

Homework 4. 0 ε 0. (00) ε 0 ε 0 (00) (11) CS 341: Foundations of Computer Science II Prof. Marvin Nakayama

ɛ-closure, Kleene s Theorem,

State Minimization for DFAs

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

Automata Theory 101. Introduction. Outline. Introduction Finite Automata Regular Expressions ω-automata. Ralf Huuck.

Quadratic Forms. Quadratic Forms

Homework 3 Solutions

Context-Free Grammars and Languages

First Midterm Examination

A negative answer to a question of Wilke on varieties of!-languages

GNFA GNFA GNFA GNFA GNFA

Revision Sheet. (a) Give a regular expression for each of the following languages:

80 CHAPTER 2. DFA S, NFA S, REGULAR LANGUAGES. 2.6 Finite State Automata With Output: Transducers

Non-deterministic Finite Automata

CS 267: Automated Verification. Lecture 8: Automata Theoretic Model Checking. Instructor: Tevfik Bultan

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

CS 330 Formal Methods and Models Dana Richards, George Mason University, Spring 2016 Quiz Solutions

Let's start with an example:

PART 2. REGULAR LANGUAGES, GRAMMARS AND AUTOMATA

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

FABER Formal Languages, Automata and Models of Computation

20 MATHEMATICS POLYNOMIALS

2.4 Linear Inequalities and Interval Notation

Worked out examples Finite Automata

1B40 Practical Skills

CMSC 330: Organization of Programming Languages

Myhill-Nerode Theorem

Regular Expressions (RE) Regular Expressions (RE) Regular Expressions (RE) Regular Expressions (RE) Kleene-*

BACHELOR THESIS Star height

Analytically, vectors will be represented by lowercase bold-face Latin letters, e.g. a, r, q.

dx dt dy = G(t, x, y), dt where the functions are defined on I Ω, and are locally Lipschitz w.r.t. variable (x, y) Ω.

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Transcription:

Automt for Anlyzing nd Querying Compressed Documents Brr FILA, LIFO, Orléns (Fr.) Siv ANANTHARAMAN, LIFO, Orléns (Fr.) Rpport N o 2006-03

Automt for Anlyzing nd Querying Compressed Documents Brr Fil, Siv Annthrmn LIFO - Université d Orléns (Frnce), e-mil: {fil, siv}@univ-orlens.fr Astrct. In first prt of this work, tree/dg utomt re defined s extensions of (unrnked) tree utomt which cn run indifferently on trees or dgs; they cn thus serve s tools for nlyzing or querying ny semi-structured document, whether or not given in compressed formt. In second prt of the work, we present method for evluting positive unry queries, expressed in terms of Core XPth xes, on ny dg t representing n XML document possily given in compressed form; the evlution is done directly on t, without unfolding it into tree. To ech Core XPth query of certin sic type, we ssocite word utomton; these utomt run on the grph of dependency etween the non-terminls of the miniml strightline regulr tree grmmr ssocited to the given dg t, or long complete siling chins in this grmmr. Any given positive Core XPth query cn e decomposed into queries of the sic type, nd the nswer to the query, on the dg t, cn then e expressed s su-dg of t whose nodes re suitly leled under the runs of such utomt. Keywords: Tree utomt, Tree grmmrs, Dgs, XML, Core XPth. 1 Introduction Severl lgorithms hve een optimized in the pst, y using structures over dgs insted of over trees. Tree utomt re widely used for querying XML documents (e.g., [8, 9,15,16]); on the other hnd, the notion of compressed XML document hs een introduced in [2, 7,12], nd possile dvntge of using dg structures for the mnipultion of such documents hs een rought out in [12]. It is legitimte then to investigte the possiility of using utomt over dgs insted of over trees, for querying compressed XML documents. Dg utomt (DA) were first introduced nd studied in [5]; DA ws defined there s nturl extension of tree utomton, i.e. s ottom-up tree utomton running on dgs; nd the lnguge of DA ws defined s the set of dgs tht get ccepted under (ottom-up) runs, defined in the usul sense; the emptiness prolem for DAs ws shown there to e NP-complete, nd the memership prolem proved to e in NP; ut the prolem of stility under complementtion of the clss of dg utomt closely linked with tht of determiniztion ws left open. These two issues hve since een settled negtively in [1]: the reson is tht the set of ll terms (trees) represented y the set of dgs ccepted y non-deterministic DA is not necessrily regulr tree lnguge; consequence is tht the clss of tree lnguges recognized y DAs (s sets of ccepted dgs) is strict superclss of the clss of regulr tree lnguges. It is well-known however, tht nswers to MSO-definle queries on (semi-)structured trees form regulr tree lnguges ([18]); it is thus necessry to define the lnguges of DAs in mnner different from tht of [5,1], if they re to serve s tools for nlyzing nd querying document, independently of whether it is given in (prtilly or fully) compressed formt, or s tree. Our first im in this work is therefore to redefine the notion of the lnguge of DA suitly, with such n ojective. 2

For chieving tht, we first present (in Section 2) the notion of compressed document s tree/dg (trdg, for short), designting directed cyclic grph tht my e prtilly or fully compressed. The terminology trdg hs een chosen to distinguish it from tht of tdg employed in [1]; this ltter term will e employed in this pper when referring to fully compressed dg. A Tree/Dg utomton (TDA, for short) is then defined s n utomton which runs on trdgs. The essentil differences with the DAs of [1] re the following: (i) our TDAs cn e unrnked, nd (ii) lthough the trnsition rules of TDA look quite like those of the DAs in [1], or those of TAs, run of TDA on ny given trdg t will crry with it not only ssignments of sttes to the nodes of t, ut lso to the edges of t; runs will e so defined tht TDA ccepts ny given trdg t if nd only if it ccepts the tree ˆt otined y uncompressing t, s tree utomton running on the tree ˆt, in the usul sense. In the second prt of the pper, we present n pproch sed on word utomt for evluting queries on trdgs tht represent XML documents in prtilly or fully compressed formt; the terms trdg nd document will therefore e considered synonymous in the sequel. Any given trdg t is first seen s equivlent to miniml strightline regulr tree grmmr L t, tht one cn nturlly ssocite with t, cf. e.g., [3, 4]. From the grmmr L t, we construct the grph of dependency D t etween its non-terminls, nd lso the chilings (liner grphs formed of complete chins of siling non-terminls) of L t. The word utomt tht we uild elow will run on D t or the chilings of L t, rther thn on the document t itself. We shll only consider positive unry queries expressed in terms of Core XPth xes. (The view we dopt llows us to define the vrious xes of Core XPth on compressed documents, in mnner which does not modify their semntics on trees.) For evluting ny such query on ny document (trdg) t, we proceed s follows. We first rek up the given query into sic su-queries of the form Q= //*[xis::] where xis is Core XPth xis of certin type. To ech such sic query Q, we ssocite word utomton A Q. The utomton A Q runs on the grph D t when xis is non-siling, nd on the chilings of L t when xis is siling xis. An essentil point in our method is tht the runs of A Q re guided y some well-defined semntics for the nodes trversed, indicting whether the current node nswers Q, or is on pth leding to some other node nswering Q. The utomton, though not deterministic, is mde effectively unmiguous y defining suitle priority reltion etween its trnsitions, sed on the semntics. A sic query Q cn then e evluted in one single top-down pss of A Q, under such n unmiguous run. An ritrry positive unry Core XPth query cn e evluted on t y comining the nswers to its vrious sic su-queries, nd its nswer set is expressed s su-trdg of t, whose nodes get leled in conformity with the semntics. It is importnt to note tht the evlution is performed on the given trdg t; s such, on two different trdgs corresponding to two different compressions of sme XML tree, the nswers otined my not e the sme, in generl. The pper is structured s follows: Section 2 presents the notions of trdgs, nd of Tree/Dg utomt. In Section 3, we construct from ny trdg t its normlized strightline regulr tree grmmr L t, s well s the dependency grph D t nd the chilings of L t ; these will e seen s rooted leled cyclic grphs (rlgs, for short); the sic notions of Core XPth re lso reclled. Section 4 is devoted to the construction of the word utomt for ny sic Core XPth query, sed on the semntics, nd n illustrtive exmple. In Section 5 we prove tht the runs of these utomt, uniquely nd effectively determined under mximl priority condition, generte the nswers to the queries. Section 6 shows how non sic (composite, or imricted) Core XPth query cn e evluted 3

in stepwise fshion. In Section 7, we show how to refine our pproch, so s to derive, from the nswer for ny given Core Xpth query Q on trdg t, the nswer set for the sme query Q on the tree-equivlent ˆt of t. In the ppendices, we show how to trnslte the usul Core XPth queries into one in stndrd form on which our pproch is pplicle (the trnsltion is done in liner time on the size of the given query); we lso present polynomil time lgorithm for constructing the mximl priority run, for ny sic query utomton over ny given document (trdg), with complexity ound of O(n 3 ), where n is the numer of nodes of the trdg; the ound reduces to O(n 2 ) on trees where the reltion Prents is trivil; complete illustrtive exmple, on composite imricted query, is given in the lst ppendix. 2 Tree/Dg Automt Definition 1 A tree/dg (trdg for short) over not necessrily rnked lphet Σ is rooted dg (directed cyclic grph) t = (N odes(t), Edges(t)), where, for ny node u Nodes(t): - u hs nme nme t (u) = nme(u) Σ; - the edges going out of ny node re ordered; - nd if nme(u) is rnked, then the numer of outgoing edges t u is the rnk of nme(u). Given ny node u on trdg t, the notion of the su-trdg of t rooted t u is defined s usul, nd denoted s t u. If v is ny node, γ(v) = u 1...u n will denote the string of ll its not necessrily distinct children nodes; for every 1 i n, the i-th outgoing edge from v to its i-th child node u i γ(v) will e denoted s i e(v, i); we shll lso write then v u i ; the set of ll outgoing (resp. incoming) edges t ny node v will e denoted s Out v (t), or Out v (resp. In v (t), or In v ); nd for ny node u, we set: Prents(u) = {v Nodes(t) u is child of v}. A trdg t will e sid to e tree iff for every node u on t other thn the root, Prents(u) is singleton. For ny trdg t, we define the set Pos(t) s the set of ll the positions pos t (u) of ll its nodes u, these eing defined recursively, s follows: if u is the root node on t, then pos t (u) = ǫ, otherwise, pos t (u) = {α.i α pos t (v), v is prent of u, u is n i-th child of v}. The set Pos(t) consists of (some of the) words over nturl integers. To ny edge e : u i v on trdg t, is nturlly ssocited the suset pos t (e) = pos t (u).i of Pos(t). The function nme t is extended nturlly to the positions in Pos(t) s follows: for every u Nodes(t) nd α pos t (u), we set nme t (α) = nme t (u). Given trdg t, we define its tree-equivlent s tree ˆt such tht: Pos(ˆt) = Pos(t), nd for every α Pos(t) we hve nme t (α) = nmeˆt (α). It is immedite tht ˆt is uniquely determined, up to tree isomorphism; it cn ctully e constructed cnoniclly (cf. [7]), y tking for nodes the set Pos(t), nd for directed edges the set {(α, α.i) α, α.i Pos(t)}, ech node α eing nmed with nme t (α). There is then nturl, nme preserving, surjective mp from N odes(ˆt) onto N odes(t); it will e referred to in the sequel s the compression mp, nd denoted s c. A trdg is sid to e tdg, or fully compressed, iff for ny two different nodes u, u on t, the two su-dgs t u nd t u hve non-isomorphic tree-equivlents; otherwise, the trdg is sid to e prtilly compressed when it is not tree. For exmple, the tree to the left of Figure 1 is the tree-equivlent of the prtilly compressed trdg to the right, nd lso to the fully compressed tdg to the middle. We define now the notion of Tree/Dg utomton, first over rnked lphet Σ, to fcilitte understnding. The definition is then esily extended to the unrnked cse. 4

f f f Tree Fully Compressed Prtilly Compressed Fig.1. tree, tdg, nd trdg Definition 2 A Tree/Dg utomton (TDA, for short) over rnked lphet Σ is tuple (Σ, Q, F, ), where Q is finite non-empty set of sttes, F Q is the set of finl (or ccepting) sttes, nd is set of trnsition rules of the form: f(q 1,..., q k ) q, where f Σ is of rnk k, nd q 1,..., q k, q Q. It will e convenient to write the trnsition rules of TDA in different (ut equivlent) form: trnsition of the form f(q 1,...,q k ) q is lso written s (f, q 1... q k ) q, where q 1...q k is seen s word in Q, of length = rnk(f) in the rnked cse. The notion of TDA is then extended esily to the unrnked cse, i.e., where the signture symols nming the nodes re not ssumed to e of fixed rnk: it suffices to define the trnsitions to e of the form (f, ω) q, where ω Q ; we my ssume wlog tht ω is -regulr expression on Q not involving +, y replcing rule (f, ω + ω ) q, y the two rules (f, ω) q, (f, ω ) q. A TDA is sid to e ottom-up deterministic iff whenever there re two trnsition rules of the form (f, ω) q, (f, ω ) q, with q q, we hve necessrily ω ω = ; otherwise it is sid to e non-deterministic. We lso gree to denote the trnsitions of the form (f, ) q simply s f q, nd refer to them s initil trnsitions. For defining the notion of runs of TDAs on trdg in ottom-up style, we need some preliminries. Let A e TDA with stte set Q nd trnsition set. Suppose t is trdg nd ssume given mp M : Edges(t) Q. If u is ny node on t with u 1... u n s the string of ll its (not necessrily distinct) children, the string M(e(u, 1))...M(e(u, n)), formed of sttes ssigned y M to the outgoing edges t u, will e denoted s M(Out u ). We then define, recursively in ottom-up style, inry reltion t u on the sttes of Q, with respect to (w.r.t. or wrt, for short) the given mp M; this reltion, denoted s M u = u, is defined s follows: Definition 3 Let A, t, M e s ove, nd u ny given node on the trdg t. If u is lef with nme(u) =, then q u q iff whenever q we lso hve q ; otherwise q u q iff: (i) (nme(u), M(Out u )) q is n instnce of trnsition rule in ; i.e., hs rule (nme(u), ω) q such tht M(Out u ) is in ω; (ii) there exists mp q : Q Q, such tht: - q (q) = q, nd the rule (nme(u), q (M(Out u ))) q is lso n instnce of trnsition rule in ; - for ny edge e : u i u Out u, we hve: M(e) u q (M(e)). Definition 4 Let A = (Σ, Q, F, ) e ny given TDA, nd t ny given trdg. A run of A on t is pir (r, M), where r: Nodes(t) Q nd M : Edges(t) Q re mps such tht the following conditions hold, t ny node u on t: 5

(1) if nme(u) = f, then the rule (f, M(Out u )) r(u) is n instnce of trnsition rule in ; (2) there is n incoming edge e In u with M(e) = r(u); nd for every e In u such tht M(e ) = q q = r(u), we hve q M u q A run (r, M) is ccepting on trdg t iff r(ǫ) F, i.e, r mps the root-node of t to n ccepting stte. A trdg t is ccepted y TDA iff there is n ccepting run on t. The lnguge of TDA is the set of ll trdgs tht it ccepts. Remrk 1. i) Note tht if t is tree, then In u is singleton t every non-root node u on t, so run (r, M) of ny TDA on t cn e identified with its first component r; we get then the usul notion of runs of tree utomt on trees. Exmple 1. Over the unrnked signture {, f, g} consider TDA A, with the following trnsitions: p, q, p, q, (, p) q, (, q) p, (, q ) q, (g, q Q ) q, (g, p q) p, (f, q p q) q fin, (f, p Q ) q fin, with Q = {p, q, q, q fin }, nd q fin s the unique ccepting stte. An ccepting ottom-up run of A on tdg is depicted on the left of Figure 2, nd on its right, the sme run s seen on the tree equivlent of the tdg. f q fin f q fin p q p p p g q q p g q g p q q p q q q q p Fig.2. A ottom-up ccepting run of the TDA of Exmple 1 on trdg, nd the sme seen on its tree equivlent. A few comments on the ove run my e of help: we strt with ssigning stte q to the lef node, under r; the ssignments of stte q under M to ll the incoming edges t this node poses no prolem; we cn then ssign stte p to node, nd susequently lso p to the node g, under r, vi the trnsition rule (g, pq) p; we then ssign p under M to the first incoming edge t g; to ssign stte q under M to the second incoming edge t g, we just need to check tht: - for mp : Q Q such tht (p) = q, (q) = p, the rule (g, (p)(q)) q is n instnce of trnsition rule of the TDA; - for the outgoing edge g, leled with p y M, we hve p q = (p); - for the outgoing edge g, leled with q y M, we do hve q p = (q); reching q fin t the root-node is trivil vi the lst trnsition rule. (Note tht 6

we could hve s well ssigned p under M to the second incoming edge t g, with no conditions to check, then rech q fin.) Remrk 1 (contd.). ii) Unlike the DAs of [5] or [1], the following ottom-up non-deterministic TDA: q 1, q 2, f(q 1, q 2 ) q, with q 0, q 1, q s sttes where q is ccepting, hs non-empty lnguge: s TDA it ccepts f(, ). For deterministic TDA, we hve the following result (s expected): Proposition 1 Let A e ottom-up deterministic TDA, nd t ny given trdg; then there is t most one run of A on t. Proof. Let Q e the set of sttes of A, nd M : Edges(t) Q ny given mp ssigning sttes to the edges on t. We shll show y induction tht the hypothesis of determinism on A implies tht, t ny node u on t, the inry reltion M u = u defined ove (Definition 3), w.r.t. the mp M, is the identity reltion on the set Q. The proposition will then follow from conditions (1) nd (2) on runs, cf. Definition 4; we will get, in prticulr, tht for every incoming edge e t u, M(e) must e the sme s r(u); so the run cn e identified with its first component r (s on tree). The induction will e on non-negtive integer d u, tht we define t ny node u of t nd refer to s its height on t s the mximl numer of rcs on t from u to the lef nodes. If d u = 0, then u is lef node; tht u is the identity reltion on Q in this cse is immedite, from the determinism of A, nd the definition of u. So, ssume tht d u > 0, nd let v 1...v n e the string of ll the children nodes of u on t. By the inductive hypothesis, for every i, 1 i n, the reltion vi is the identity reltion on Q; it follows then, from the conditions (i) nd (ii) on the reltion u (Definition 3), tht this ltter must lso e the identity reltion on Q. We my now formulte the principl result of the first prt of this pper: Proposition 2 i) A TDA ccepts trdg t if nd only if it ccepts the tree equivlent of t. ii) The emptiness prolem for TDA is decidle in time P w.r.t. its numer of sttes. iii) The uniform memership prolem for TDA is decidle in time NP (resp. time P) w.r.t. its numer of sttes, nd the numer of edges (resp. nd the numer of positions) on the given trdg. Proof. Let ˆt e the tree equivlent of the trdg t, nd c the nturl surjective compression mp from Nodes(ˆt) onto Nodes(t). Property i): For proving the only if prt, one uses the following resoning, coupled with induction on the height function t the nodes of t (defined in the proof of the previous proposition): Let (r, M) e n ccepting run of the given TDA on the trdg t; consider node s on the tree equivlent ˆt, of which the node u on t is the imge under the compression mp c; let r(u) = q under the given run of the TDA on t; then, for every stte q of the TDA such tht q M u q, one cn construct prtil run of the TDA seen s usul tree utomton on the tree ˆt, climing up from lef elow s on ˆt to the node s, nd ssigning the stte q to this node (for n illustrtive exmple, see the tree to the right of Figure 2). Proving the if prt of Property i) is little more complex. We strt with given ccepting run ˆρ of the given TDA, s ottom-up tree utomton running in the usul sense on the tree ˆt; from this run ˆρ, we shll construct run (r, M) of the TDA on the trdg t, y n inductive, top-down trversl of the tdg t; for this top-down trversl, we will e using n integer vlued function defined t 7

ny node u of t nd referred to s its depth on t s the mximl numer of rcs on t from the root node on t to the node u. We shll lso use the fct tht the nodes of ˆt re in nturl ijection with the set Pos(t) of positions on t. The topdown construction of the run (r, M) is done y the following pseudo-lgorithm, where d stnds for the mximl depth on t t its lef nodes. BEGIN /* define first r t the root node on t, nd M on its outgoing edges */ r(ǫ t ) = ˆρ(ǫˆt ); For every outgoing edge e j, 1 j k, t ǫ t, set M(e j ) = ˆρ(ǫ.j); i = 1; /* Now go down */ while (i < d) do { For every node u t depth i do { choose e In u (t), nd α pos t (e) such tht M(e) = ˆρ(α); set r(u) = M(e); For every e j Out u (t), 1 j m, outgoing from u, set M(e j ) = ˆρ(α.j); } i = i + 1; } END. It is not difficult to check then, tht y construction, the pir of mps (r, M) gives n ccepting run of the TDA on the trdg t. (The resoning is illustrted elow.) Properties ii) nd iii) follow, in the rnked cse, from the proof of i) nd the results of TATA ([6]), Chpter 1; in the unrnked cse, one cn either employ resoning sed on reduction to the rnked cse s in [10], or ppel directly to the results of [13]. (Note: the numer of positions on trdg is the sme s the size of its tree equivlent.) We illustrte here the resoning employed in the proof of the if prt of ssertion i) of the ove proposition, with the tdg t of Exmple 1. We strt with the run ˆρ on its tree-equivlent ˆt, s depicted to the right of Figure 2. At strt, to the root node on t (t depth 0) is ssigned the stte q fin, nd to its three outgoing edges, re signed the three sttes p, q, q respectively; t g, which is the only node on t t depth 1, we choose the first incoming edge (of position 1, nd leled with p y M), nd set r(u) = ˆρ(1) = p; the two outgoing edges t g on t hve s positions the sets {11, 21}, {12, 22} respectively; to these two outgoing edges t g on t, we ssign the sttes tht ˆρ ssigns to the two sons of the node g t position 1 on ˆt, nmely p, q respectively (this mens in essence tht we hve selected the positions 11 nd 12 on the two outgoing edges t g on t); next, we go to depth 2 on t, where is the unique node, to which we then hve to ssign the stte ˆρ(11) tht M hs lredy ssigned to its incoming edge; the rest of the resoning is ovious, so left out. Remrk 2. Let t t e two given trdgs such tht Pos(t ) = Pos(t), nd there is nme preserving surjective mp c from Nodes(t ) onto Nodes(t). We cn then define t to e compression, or compressed form, of t ; nd refer to t s n uncompressed equivlent of t, nd to the surjective mp c on Nodes(t ) s compression mp. It is esily checked tht t nd t hve then the sme tree-equivlent; nd it folows from Proposition 2 ove tht ny given TDA A ccepts t if nd only if it ccepts t. This mens tht it is legitimte to define the lnguge of TDA s the set of ll tdgs tht it ccepts (or trees tht it ccepts), or s the set of ll trdgs ccepted, up to tree-equivlence. 8

3 Querying Compressed Documents: Preliminries Given trdg t, one cn nturlly construct regulr tree grmmr ssocited with t, which is strightline (cf. [4]), in the sense tht there re no cycles on the dependency reltions etween its non-terminls, nd ech non-terminl produces exctly one su-trdg of t. Such grmmr will e denoted s L t, if it is normlized in the following sense: (i) for every non-terminl A i of L t, there is exctly one production of the form A i f(a j1,..., A jk ), where i < j r for every 1 r k; we shll then set Sons(A i ) = {A j1,...,a jk }, nd sym Lt (A i ) = f; (ii) the numer of non-terminls is the numer of nodes on t. Such normlized grmmr L t is uniquely defined up to renming of the nonterminls. For instnce, for the trdg t to the left of Figure 3 we get the following normlized grmmr: A 1 f(a 2, A 3, A 4, A 5, A 2 ), A 2 c, A 3 (A 5 ), A 4, A 5. Such grmmr is esily constructed from t, for instnce y using stndrd lgorithm which computes the depth of ny node (s the mximl distnce from the root), to numer the non-terminls so s to stisfy condition (i) ove. c t: f D t : A _ 1 ( f, ) A 2 (c, _ ) A (, _ 3 ) A, _ 4 ( ) A 5 (, F 1 : A 2 (c, _ ) A (, _ 3 ) A, _ 4 ( ) A 5 (, A 2 (c, _ ) F 0 : A 1 ( f, F3: A 5 (, Fig.3. trdg t, ssocited rlg D t, nd chilings of L t The dependency grph of the normlized grmmr L t ssocited with t, nd denoted s D t, consists of nodes nmed with the non-terminls A i, 1 i n, nd one single directed rc from ny node A i to node A j whenever A j is son of A i. The root of D t is y definition the node nmed A 1. The notion of Sons of the nodes on D t is derived in the ovious wy from tht defined ove on L t. Furthermore, to ny production A i f(a j1,..., A jk ) of L t, we ssocite rooted liner grph composed of k nodes respectively nmed A j1,..., A jk, with root t A j1 nd such tht for ll l {2,...,k} the node nmed A jl is the son of the node nmed A jl 1. This grph will e clled the chiling of L t ssocited with the (unique) A i -production; it is denoted s F i. We lso define further chiling denoted F 0, s the liner grph with single node nmed A 1, where A 1 is the xiom of L t. In the sequel, we designte y G either D t or ny of the chilings F of L t. We complete ny of these cyclic grphs G into rooted leled cyclic grph (rlg, for short), y ttching to ech node u on G, with nme(u) = A i, lel denoted lel(u), nd defined s lel(u) = (sym Lt (A i ), ); cf. Figure 3. 3.1 Positive Core XPth Queries on trdgs In this pper we restrict our study to positive Core XPth queries on trdgs. Recll tht Core XPth is the nvigtionl segment of XPth, nd is sed on the following xes of XPth (cf. [10, 19]): self, child, prent, ncestor, descendnt, following-siling, preceding-siling. A loction expression 9

is defined s predicte of the form [xis::], where xis is one of the ove xes, nd is symol of Σ. Given ny trdg t over Σ, context node u on t nd Σ, the semntics for xis is defined y evluting this predicte t u. The semntics for the xes self, child, descendnt re esily defined, exctly s on trees (cf. [19]). For defining the semntics of the remining xes, we first recll tht Prents(u) = {v Nodes(t) u is child of v}. Definition 5 Given context node u on trdg t, nd Σ: i) [prent::] evlutes to true t u, if nd only if there exists -nmed node in Prents(u); ii) [ncestor::] evlutes to true t u, iff either [prent::] evlutes to true t u, or there exists node v Prents(u) such tht [ncestor::] evlutes to true t v; iii) [following-siling::] evlutes to true t u, iff there exists -nmed node u, nd node v on t such tht γ(v) is of the form...u...u...; iv) [preceding-siling::] evlutes to true t u, iff there exists -nmed node u, nd node v on t such tht γ(v) is of the form...u...u... For the composite xes descendnt-or-self nd ncestor-or-self, the semntics re then deduced in n ovious mnner. We shll lso need position predictes of the form [position()= i]; their semntics is tht the expression [child:: [position()= i]] evlutes to true t context node u, iff: [child::] evlutes to true t u, nd u is n i-th child of some prent. Positive Core XPth query expressions re usully defined in the literture (cf. e.g., [7]), s those generted y the following grmmr: A ::= self child descendnt prent ncestor preceding-siling following-siling S cn ::= A:: position()= i S cn nd S cn S cn or S cn E cn ::= A:: [S cn ] E cn [E cn ] Q cn ::= /S cn /E cn Q cn /Q cn We shll refer to the query expressions generted y this grmmr s cnonicl; they cn e shown to e of the type /C 1 /C 2 /.../C n, where ech C i is of the form A::[X cn ], or of the form A::[X cn ] conn A :: [X cn ], with conn {nd, or}, nd X cn, X cn {S cn, E cn, true}; we gree here to identify A::[true] with A::. Any such positive Core XPth query expression cn e trnslted into one tht is in stndrd form, i.e., where the formt of the su-queries is of the type xis:: ; we formlize this ide now. We shll refer to the xes self, child, descendnt, prent, ncestor, preceding-siling, following-siling s sic. A sic Core XPth query is query of the form //*[xis::], where xis is sic xis. More generlly, the queries we propose to evlute on trdgs re defined formlly s the expressions Q std generted y the following grmmr, where stnds for ny node nme on the documents, or for (mening ny ): A ::= self child descendnt prent ncestor preceding-siling following-siling S ::= A:: position()= i S nd S S or S Root E ::= A:: [S] E[E] Q std ::= //* //*[S] //*[E] Core XPth queries Q std of the formt generted y this grmmr re sid to e in stndrd form; to e le to hndle ny positive Core XPth query with such grmmr, we hve introduced specil predicte clled Root, deemed true only t the root node of the trdg considered. By the evlution of given query expression Q on ny trdg t, we men the ssignment: t the set of ll context nodes on t where the expression Q evlutes to true (following the conventions of Definition 2); this ltter set is lso clled the nswer for Q on t. Two given queries Q 1, Q 2 re sid to e equivlent 10

iff, on ny trdg t, the nswer sets for Q 1 nd Q 2 re the sme. Any positive Core XPth query Q cn cn e trnslted into n equivlent one in stndrd form; e.g., /c[following-siling::g]/d is equivlent to //*[self::d nd prent::*[root nd self::c [following-siling::g]]] in stndrd form. An inductive procedure performing such trnsltion in the generl cse (of liner complexity w.r.t. the numer of loction steps in Q cn ) is given in Appendix I. The following proposition results from Definition 5. Proposition 3 (1) For ny set of nodes X on trdg t, nd ny xis A, we hve: A(X) = {/child:: [position()= i 1 ]/.../child:: [position()= i k ]/A:: } x X, α pos t (x) α = i 1...i k (2) For ny trdg t, nd ny node with nme on t, we hve: (i) //*[preceding::] = {descendnt-or-self(following-siling( u //*[self::u nd (descendnt:: or self::)]))} (ii) //*[following::] = {descendnt-or-self(preceding-siling( u //*[self::u nd (descendnt:: or self::)]))} Finlly, following [2], for ny set S of nodes on t, the sets of nodes following(s) nd preceding(s) cn now e defined formlly, s follows: following(s) = descendnt-or-self(following-siling(ncestor-or-self(s))), preceding(s) = descendnt-or-self(preceding-siling(ncestor-or-self(s))). Note: Unlike on tree, the ncestor, descendnt, following, self nd preceding xes do not prtition the set of nodes on trdg t, in generl. 4 Automt for the Bsic Core XPth Queries 4.1 The Semntics of the Approch We first consider sic Core XPth queries. Composite or imricted queries will susequently e evluted in stepwise fshion; see Section 6. To ny sic query Q = //*[xis::], we shll ssocite word utomton (ctully trnsducer), referred to s A Q. It will run top-down, on the rlg D t if xis is non-siling, nd on ech of the chilings F of L t otherwise. In either cse, run will ttch, to ny node trversed, pir of the form ( l, x), where the component l of the pir hs the intended semntics of selection or not, y Q, of the corresponding node on t, nd the component x will e 1 or 0, with the intended semntics tht x = 1 iff the corresponding node on t hs descendnt nswering Q. At the end of the run, lel(u), t ny node u of D t, will e replced y new lel derived from the ll-pirs ttched to u y the run. To formlize these ides, we introduce set of new symols L = {s, η,, } referred to s llels (the term llel is used so s to void confusion with the term lel). We define ll-pirs s elements of the set L {0, 1}, nd the sttes of A Q s elements of the set {init} (L {0, 1}). For ny Q, the utomton A Q is over the lphet Σ {s, η}, hs init s its initil stte, nd hs no finl stte. The set Q of trnsitions of A Q will consist of rules of the form (q, τ) q where q {init} (L {0, 1}), q (L {0, 1}), nd τ Σ {s, η}. For ny rlg G, we define function ll: Nodes(G) Σ {s, η}, y setting ll(u) = π 1 (lel(u)), the first component of lel(u). The utomton A Q ssocited to sic query Q =//*[xis::] will run top-down on the rlg G, 11

where G is D t if xis is sic non-siling xis, nd G is ny chiling F of L t if xis is sic siling xis. A run of A Q on G is mp r: Nodes(G) L {0, 1}, such tht, for every u Nodes(G), the following holds: - if u is root G, then the rule (init, ll(u)) r(u) is in Q ; - otherwise, for every v γ(u) the rules (r(u), ll(v)) r(v) re ll in Q. (Note: when xis is non-siling, this mounts to requiring tht, for ny node v, the stte r(v) must e in conformity with the sttes r(u) for every prent node u of v, with respect to the rules in Q.) From the run of the utomton A Q nd from the sttes it ttches to the nodes of D t, we will deduce, t every node u of t, well-determined ll-pir s ( new) lel t u, vi the nturl ijection etween Nodes(t) nd Nodes(D t ). The ll-pirs thus ttched to the nodes of t will hve the following semntics (where x stnds for the nme of the node u on t, corresponding to the current node on D t ): - (, 1) : x =, current node on t is selected y (i.e., is n nswer for) Q; - (, 1) : x =, current node is not selected, ut hs selected descendnt; - (, 0) : x =, current node is not selected, nd hs no selected descendnt; - (s, 1) : x, current node is selected; - (η, 1) : x, current node is not selected, ut hs selected descendnt; - (η, 0) : x, current node is not selected, nd hs no selected descendnt. Only the nodes on D t, to which the run of A Q ssocites the lels (s, 1) or (, 1), correspond to the nodes of t tht will get selected y the query Q. The ll-pirs with oolen component 1 will lel the nodes of D t corresponding to the nodes of t which re on pth to n nswer for the query Q; thus the utomt A Q will hve no trnsitions from ny stte with oolen component 0 to stte with oolen component 1. Moreover, with view to define runs of such utomt which re unique (or unmiguous in sense tht will e presently mde cler), we define the following priority reltions etween the llpirs: (η, 0) > (η, 1) > (s, 1), nd (, 0) > (, 1) > (, 1). A run of the utomton A Q will lel ny node u on G with n ll-pir either from the group {(, 0), (, 1), (, 1)} or from the group {(η, 0), (η, 1), (s, 1)}; nd this group is determined y ll(u). For ese of presenttion, we gree to set η := s, nd often denote either of the ove two groups of ll-pirs under the uniform nottion {(l, 0), (l, 1), (l, 1)}, where l {η, }, with the ordering (l, 0) > (l, 1) > (l, 1). We shll construct run r of A Q on G tht will e uniquely determined y the following mximl priority condition: (MP): t ny node v on G, r(v) is the mximl ll-pir ( l, x) for the ordering > in the group {(l, 0), (l, 1), (l, 1)} determined y ll(v), such tht A Q contins trnsition rule of the form (r(u), ll(v)) ( l,x), for every prent u of v. Such run will ssign lel with oolen component 1 only to the nodes corresponding to those of the miniml su-trdg t contining the root of t nd ll the nswers to Q on t. 4.2 Re-leling of D t y the Runs of A Q We first consider non-siling sic query Q on given document t, nd given run r of the utomton A Q on the D t ; t the end of the run, the nodes on D t will get re-leled with new ll-pirs, computed s elow for every u Nodes(D t ): l r (u) = (s, 1) iff r(u) {(s, 1), (, 1)}, l r (u) = (η, 1) iff r(u) {(η, 1), (, 1)}, l r (u) = (η, 0) iff r(u) {(η, 0), (, 0)}. 12

The rlg otined in this mnner from D t, following the run r nd the ssocited re-leling function l r, will e denoted s r(d t ). For sic query Q over siling xis, the sitution is little more complex, ecuse severl different nodes on one chiling of L t cn hve the sme nme (non-terminl), or severl different chilings cn hve nodes nmed y the sme non-terminl, or oth. Thus, to ny node of D t, nmed with non-terminl A, will correspond in generl set of ll-pirs, ssigned y the vrious runs of A Q to the A-nmed nodes on the vrious chilings of L t. We therefore proceed s follows: for every complete set r of runs of A Q, formed of one run r F on ech chiling F, we will define r(d t ) s the re-leled rlg derived from D t, under r. With tht purpose we ssocite to r nd ny u Nodes(D t ), set of ll-pirs: ll r (u) = {r F (v) v Nodes(F), nd nme(v) = nme(u)}. r F r We then derive, t ech node of D t unique ll-pir in conformity with the semntics of our pproch, y using the following function: λ r (u) = s ll r (u) {(s, 1), (, 1)}, λ r (u) = η ll r (u) {(s, 1), (, 1)} =. From D t nd this function λ r, we next derive n rlg λ r (D t ) y re-leling ech node u on D t with the pir (λ r (u), ). And finlly we define r(d t ) s the rlg otined from λ r (D t ), y running on it the utomton for the sic nonsiling query //*[self::s], s indicted t the eginning of this susection. In prcticl terms, such run mounts in essence to setting, s the second component of lel(u) t ny node u, the oolen 1 iff u is on pth to some node with ll s, nd 0 otherwise. All these detils re illustrted with n exmple in the following susection. 4.3 The Automt We first present the utomt for the sic queries //*[self::] nd for //*[following-siling::], nd give n illustrtive exmple using the former for = s, nd the ltter for =. The utomt for the other sic queries re given fter the exmple. Automt: for //*[self::] nd for //*[following-siling::] γ= init γ= γ= η, 1 γ= γ= γ= η, 0 T, 1 γ= init T, 1 η, 0 T, 0 s, 1 Figure 4 elow illustrtes the evlution of Q =//*[following-siling::], on the trdg t of Figure 3. We first use the utomton for the sic query //*[following-siling::] with =, nd then the utomton for //*[self::] with = s. The su-trdg of t, formed of nodes corresponding to those of r(d t ) with lels hving oolen component 1, contins ll the nswers to Q on t. 13

r 1 on F 1 : A 2 (c, _ ) A (, _ 3 ) A, _ 4 ( ) A 5 (, A 2 (c, _ ) ( s, 1 ) ( s, 1) (T,1) ( T, 0) ( η, 0) r0 on F0 : A _ 1 ( f, ) (η, 0) r 3 on F 3 : A 5 (, ( T, 0) r 0 r 1, r 3, on D t : (η, 0) A 1 ( f, A 2 (c, _ ) A (, _ 3 ) A, _ 4 ( ) ( s, 1 ) ( s, 1) (T,1) ( η, 0) ( T, 0) A 5 (, D t ) λ r ( : A 1 ( η, _ ) run of the utomton for //*[self : : s] on ( η, 1) A 1 ( η, _ ) finl re leled λ r ( D ): t rlg:r(d t ) A 1 ( η, 1) A 2 (, _ ) s A 3 ( s, _ ) A 4 (s, (, _ ) A 2 s (T,1) (, _ ) A 3 s (T,1) A 4 ( s, A2 ( s, 1) A 3 ( s, 1 ) A 4 ( s, 1) (T,1) A 5 (η _, ) (η, 0) A 5 (η _, ) A 5 (η, 0) Fig. 4. Automton for the query //*[prent::] init η, 1 η, 0 T, 0 T, 1 s, 1 T, 1 Automton for the query //*[ncestor::] T, 1 γ= γ= γ= T, 1 η, 1 init T, 0 γ= s, 1 γ= γ= γ= η, 0 γ= γ= 14

Automton for the query //*[child::] init T, 0 η, 0 T, 1 η, 1 T, 1 Automton for the query //*[preceding-siling::] s, 1 s, 1 η, 1 init T, 1 T, 0 η, 0 T, 1 Automton for the query //*[descendnt::] init γ= γ= T, 1 T, 0 γ= η, 0 γ= γ= γ= s, 1 γ= γ= A few words on some of the utomt y wy of explntion. First, the reson why the utomton for self does not hve the sttes (, 0), (, 1), (s, 1): for (, 0), (, 1), y the semntics of susection 4.1 we must hve x =, where x is the nme of the current node on t, ut then the query //*[self::] should select the current node, so one cnnot e t such stte; s for (s, 1), the resoning is just the opposite. Next, the reson why the utomton for descendnt does not hve the sttes (η, 1), (, 1): if the semntics ttriute one of these pirs to ny node u, tht would men the node u hs selected descendnt u ; which mens tht u hs some -descendnt node, which would then e -descendnt for u too, so Q should select u. 15

5 Mximl Priority Runs of Bsic Query Automt Note tht the following properties, required y our semntics of susection 4.1, hold on the utomt A Q constructed ove, for ny sic Core XPth query Q = //*[xis::]: i) There re no trnsitions from ny stte with oolen component 0 to stte with oolen component 1; ii) The -trnsitions hve ll their trget sttes in {(, 0), (, 1), (, 1)}; nd for ny γ, the trget sttes of γ-trnsitions re ll in {(η, 0), (η, 1), (s, 1)}. Theorem 1 Let Q e ny sic Core XPth query, t ny given trdg, nd let G denote either the rlg D t, or ny given chiling F of L t. Assume given leling function L from Nodes(G) into the set of ll-pirs, which is correct with respect to Q, i.e., in conformity with the semntics of susection 4.1. Then there is run r of the utomton A Q on G, such tht : i) r is comptile with L; i.e., r(u) = L(u) for every node u on G; ii) r stisfies the mximl priority condition (MP) of susection 4.1. Proof. We first construct, y induction, complete run (i.e., defined t ll the nodes of G) stisfying property i). For tht, we shll employ resonings tht will e specific to the xis of the sic query Q. We give here the detils only for the xis prent; they re similr for the other xes. Q = //*[prent::]: (The xis considered is non-siling so G = D t here.) At the root u node of D t, we set r(u) = L(u); we hve to show tht there is trnsition rule in A Q of the form (init, ll(u)) L(u). Oviously, for the xis prent, the root node u cnnot correspond to node on t selected y Q, so the only ll-pirs possile for L(u) re (l, 0), (l, 1), with l {η, }; for ech of these choices, we do hve trnsition rule of the needed form, on A Q. Consider then node v on D t such tht, t ech of its ncestor nodes u on D t, the prt of the run r of A Q hs een constructed such tht r(u) = L(u); ssume tht the run cnnot e extended t the node y setting r(v) = L(v). This mens tht there exists prent node w of v, such tht ( L(w), ll(v)) L(v) is not trnsition rule of A Q ; we shll then derive contrdiction. We only hve to consider the cses where the oolen component of L(w) is greter thn or equl to tht of L(v). The possile couples L(w), L(v) re then respectively: L(w) : (, 0) (, 1) (, 1) (, 1) (, 1) L(v) : (η, 0) (, 1) (η, 1) (, 1) (η, 1) In ll cses, we hve ll(w) = ecuse of the semntics, so the node (on t corresponding to the node) v hs -prent, so must e selected; thus the ove choices for L(v) re not in conformity with the semntics; contrdiction. We now prove tht the complete run r thus constructed, stisfies property ii). For this prt of the proof, the resoning does not need to e specific for ech Q; so, write Q more generlly, s //*[xis::] for some given. Suppose the run r does not stisfy the mximl priority condition t some node v on G; ssume, for instnce, tht the run r mde the choice, sy of the ll-pir (l, 1), lthough the mximl leling of the node v, in mnner comptile with the ll-pirs of ll its prents, ws the ll-pir (l, 0). Since L is ssumed correct, nd r is comptile with L, the mximl possile leling (l, 0) would men tht the node (on t corresponding to the node) v hs no descendnt selected y Q; wheres, the choice tht r is ssumed to hve mde t v, nmely the ll-pir (l, 1), hs the opposite semntics whether or not ll(v) = ; in other words, the leling L would not e correct with respect to Q; contrdiction. The other possiilities for the d lelings under r lso get eliminted in similr mnner. Theorem 2 Let Q, t, D t, F, G e s ove. Let r e (complete) run of the utomton A Q on G, which stisfies the mximl priority condition (MP) of 16

susection 4.1. Then the leling function L on N odes(g), defined s L(u) = r(u) for ny node u, is correct with respect to the semntics of susection 4.1. Proof. Let us suppose tht the leling L deduced from r is not correct with respect to Q; we shll then derive contrdiction. The resoning will e y cse nlysis, which will e specific to the xis of the sic query Q considered. We give the detils here for Q = //*[descendnt::]. The xis is non-siling, so we hve G = D t here. The sets Nodes(t), Nodes(D t ) re in nturl ijection, so for ny node u on D t we shll lso denote y u the corresponding node on t, in our resonings elow. We sw tht the utomton A Q for the descendnt xis does not hve the sttes (η, 1), (, 1). Consider then node u on D t such tht: for ll ncestor nodes w of u, the llel r(w) is in conformity with the semntics, ut the ll-pir r(u) is not in conformity. Now, A Q hs only 5 sttes: (init), (, 1), (s, 1), (, 0), (η, 0), of which only the lst four cn llel the nodes. So the possile d choices tht r is ssumed to hve mde t our node u, re s follows: () r(u) = (, 1), ut the node u is not n nswer to the query Q. Here nme(u) must e, so the choice of r ought to hve een (, 0); () r(u) = (s, 1), ut the node u is not n nswer to the query Q. Here nme(u), so the choice of r ought to hve een (η, 0); (c) r(u) = (η, 0), ut the node u is n nswer to the query Q. Here nme(u), so the choice of r ought to hve een (s, 1); (d) r(u) = (, 0), ut the node u is n nswer to the query Q. Here nme(u) must e, so the choice of r ought to hve een (, 1). In ll the four cses, we hve to show: i) tht the ought-to-hve-een choice ll-pir is rechle from ll the prent nodes of u; ii) nd tht, with such new nd correct choice mde t u, r cn e completed from u, into run on the entire dg D t. The resoning will e similr for cses (), (), nd for the cses (c), (d). Here re the detils for cse (): Tht u is not n nswer to Q mens tht u hs no -descendnt node, so for ll nodes v elow u on D t, we hve ll(v). Therefore, ssertions i) nd ii) ove follow from the following oservtions on the utomton for Q= //*[descendnt::]: i) if r could rech the stte (, 1) t node u (vi -trnsition) from ny prent node of u, then (, 0) is lso rechle thus t u, from ny of them; ii) if, from the stte (, 1), r could rech ll the nodes on D t elow u (with stte (η, 0)), vi trnsitions over γ, then it cn do exctly the sme now, with the correct choice ll-pir (, 0) t u. As for cse (c): Node u is n nswer to Q here, so u hs -descendnt; let v e -node elow u on D t ; the ll-pir r(v) tht r ssigns to v must then e either (, 1) or (, 0); this implies tht r pssed from the stte (η, 0) supposedly ssigned y r to u to (, 1) or (, 0) somewhere etween u nd v; which is impossile, s is esily seen on the utomton A Q for the xis descendnt considered. The resoning for cse (d) is even esier: from stte (, 0), no stte with n outgoing -trnsition is rechle. 6 Evluting Composite Queries A composite query is query in stndrd form, ut is not sic. We propose to evlute such query incrementlly. For this, it suffices to consider queries tht re of the form //*[A::x conn A ::x ], where conn {nd, or}, or of the form //*[A 1 ::*[A 2 ::]]. For those of the former type, we oserve first tht the components in disjunction (resp. conjunction) under * cn e evluted seprtely. Indeed, the nswer for Q = //*[A::x conn A ::x ] cn 17

e otined s union (resp. intersection) of the nswers for the two component queries //*[A::x], nd //*[A ::x ], when conn is n or (resp. n nd). We pply the method descried erlier, seprtely for Q 1 = //*[A::x] nd for Q 2 = //*[A ::x ], thus getting two respective evluting runs r 1, r 2. Any node u of the dg D t will then e re-leled, y the composite query Q, with ll-pirs computed y function AND when conn = nd (resp. OR when conn = or), in conformity with the semntics presented in the Section 4.1: AND(u) = (s, 1) iff r 1 (u) = (l, 1) = r 2 (u); AND(u) = (η, 0) iff r 1 (u) = (l, 0) or r 2 (u) = (l, 0); AN D(u) = (η, 1) otherwise. OR(u) = (s, 1) iff r 1 (u) = (l, 1) or r 2 (u) = (l, 1); OR(u) = (η, 0) iff r 1 (u) = (l, 0) = r 2 (u); OR(u) = (η, 1) otherwise. Figure 5 elow illustrtes the ove resoning, for the evlution of the composite query Q = //*[self:: nd prent::], on the trdg t of Figure 3: ( η,1) //*[self : : ] //*[prent : : ] nd(d t ) A 1 ( f, ( η,1) A 1 ( f, A 1 (η,1 ) A2(c, _ ) A (, _ 3 ) A, _ 4 ( ) ( η,0) ( η,1) ( T, 1) A 2 (c, _ ) A (, _ 3 ) A, _ 4 ( ) η,0 ( η,0) ( ) ( T, 1) A 2 ( η, 0) A 3 ( η,1) A 4( η, 0) ( T, 1) A 5 (, ( s,1) A 5 (, A 5 ( s, 1 ) Fig. 5. We next consider the queries of the form Q = //*[A 1 ::*[A 2 ::]], with imricted predictes. For their evlution, we first consider mximl priority run evluting r 2 (resp. set of runs r 2 ) of the utomton ssocited to the inner query //*[A 2 ::], on D t (resp. the set of ll chilings of L t ). This run (resp. set of runs) will output the rlg r 2 (D t ) (resp. r 2 (D t )), s descried in Section 4.2. Evluting the imricted query Q on the dg t is then done y running the utomton for the sic outer query //*[A 1 ::s] on r 2 (D t ) (resp. r 2 (D t )). Finlly, the nswer for query of the type Q = //*[child::x[position()= k]], is the suset of the nodes nswering //*[child::x], which correspond to k-th node on some chiling. 7 Deriving the Answer on the Tree-equivlent Given Core XPth query Q nd its nswer set on trdg t, we show here how to derive the nswer for the sme query Q on the tree-equivlent ˆt of t; this is of importnce, since the stndrd model for n XML document (even when given in compressed form) is generlly considered s the tree representtion of the document. We oserve, to strt with, tht the nswer set for Q on t is in generl superset of the nswer set for Q on the tree-equivlent ˆt. This cn e so for the following two resons: (i) If certin node u on t is selected y Q, not ll of the nodes u on ˆt, tht re lifts of u under the compression mp c on Nodes(ˆt), my nswer the 18

query Q on the tree ˆt, even when Q is sic query. For instnce, consider the sic query //*[prent::]; on the fully compressed tdg f((c), (c)), the (unique) node nmed c is n nswer; it hs two c-nmed nodes s lifts on the tree-equivlent ˆt, of which only one is n nswer for the query. (ii) A node u on trdg t my nswer composite query Q, ut none mong the lifts of u on ˆt my nswer the sme query Q on the tree ˆt. For instnce, the unique c-nmed node on the compressed tdg f((c), (c)) nswers the query //*[prent:: nd prent::], ut there is no node on the tree-equivlent nswering this query. Actully, such situtions rise only for queries involving the upwrd xes prent, ncestor, which define reltions tht re less trivil on trdgs thn on trees. We cn formulte this oservtion more precisely, s follows: Lemm 1. Let A e one of the xes self,child,descendent, Q the sic query //*[A::x], t ny given trdg, ˆt its tree-equivlent, u ny given node on t, nd u c 1 (u) ny node lift of u on ˆt. Then: wrt the mximl priority runs of the utomton for the xis A, respectively on D t nd Dˆt, the nodes u on t, nd u on ˆt, get leled y the sme ll-pir; in prticulr, the node u nswers Q on t if nd only if the node u nswers the sme query Q on the tree ˆt. Proof. Follows y oserving tht the semntics of Section 4.1 hve een defined in mnner which is top-down, nd tht the compression mp c : Nodes(ˆt) Nodes(t) mps the set Nodes(ˆt u ), of nodes elow u on ˆt, onto the set of nodes of the su-trdg t u. The ove lemm is first step towrds the ojective of this section. As second step, we propose to distinguish, on the utomt constructed ove for the two queries //*[prent::], //*[ncestor::], the trnsitions tht will never e fired on tree; such s, e.g., the one from stte (η, 1) to stte (, 1). (Note: for this trnsition to e firle, we hve to rech node corresponding to -nmed node on the trdg, which must then lso hve s (unique) prent on the tree i.e., the node from which the trnsition is to e fired -nmed node; this prent node cnnot correspond then to node leled with (η, 1).) Such trnsitions tht re not firle on tree) will e depicted with dotted rrows on the utomton; the trnsitions with full rrows re then the ones tht re firle oth on trdgs nd on trees. The two utomt thus revised re s follows: Automton for the query //*[prent::] -revised init η, 0 T, 0 η, 1 T, 1 s, 1 T, 1 19

Automton for the query //*[ncestor::] -revised T, 1 γ= γ= γ= T, 1 η, 1 init T, 0 γ= s, 1 γ= γ= γ= η, 0 γ= γ= The next step towrds our ojective of this section consists in completing mximl priority run r of the utomton for ny given sic query, y ssociting to current node u on D t, suset of Pos t (u) (rememer: Nodes(t) nd Nodes(D t ) re in nturl ijection), denoted s P r (u), nd defined s follows: cse u selected (i.e. the lel of u under r is (s, 1) or (, 1): we set P r (u) = Pos t (u) α,i {α.i α.i Pos t (u)}, the union eing tken over the positions α of prent nodes v on t such tht the trnsition from v to u is dotted; cse u not selected (i.e. the lel of u under r is neither (s, 1) nor (, 1): we set here P r (u) = Pos t (u). A run r completed in this mnner will e denoted in oldfce type s r, giving thus mp r : Nodes(t) L {0, 1} Pos(t), defined y u (r(u), P r (u)). In order to derive the nswer to composite query Q on the tree-equivlent of t, from the nswer for Q on t, we need nturlly to complete the functions AND nd OR of Section 6, y dding component giving the selected positions for query Q which is conjunction or disjunction of two su-queries Q 1, Q 2. These completed functions, gin denoted oldfce s AND nd OR, or defined elow in rther ovious mnner (the indices 1, 2 correspond to the runs wrt the two queries, nd l 1, l 2 stnd for s or η; recll tht η stnds for the llel s): AND(u) = ((s, 1), P r1 (u) P r2 (u)) iff r 1 (u) = ((l 1, 1), P r1 (u)) nd r 2 (u) = ((l 2, 1), P r2 (u)); AND(u) = ((η, 0), Pos t (u)) iff r 1 (u) = (l 1, 0) nd r 2 (u) = (l 2, 0); AND(u) = ((η, 1), Pos t (u)), otherwise. OR(u) = ((s, 1), P r1 (u) P r2 (u)) iff r 1 (u) = ((l 1, 1), P r1 (u)) nd r 2 (u) = ((l 2, 1), P r2 (u)); = ((s, 1), P r1 (u)) iff r 1 (u) = (l 1, 1) nd r 2(u) (l 2, 1); = ((s, 1), P r2 (u)) iff r 2 (u) = (l 2, 1) nd r 1 (u) (l 1, 1); OR(u) = ((η, 0), Pos t (u)) iff r 1 (u) = (l 1, 0) nd r 2 (u) = (l 2, 0); OR(u) = ((η, 1), Pos t (u)), otherwise. Exmple 2. We evlute the query Q =//*[ncestor:: [prent::c]] on the trdg t presented to the left of Figure 6; the stndrd form of this query is Q =//*[ncestor::*[self:: nd prent::c]]; nd its nswer consists of ll the nodes hving n ncestor with prent c. But we wnt here to otin the sme nswer for Q on t nd on its tree-equivlent ˆt presented to the right of Figure 6. To find such n nswer, it is necessry to use the revised utomt for the prent nd ncestor xes. For ese of comprehension, we illustrte the evlution of Q directly on the trdg t (nd not on the rlg D t ) it is possile ecuse, for this document the trdg t nd its rlg D t re isomorphic. Note tht ech node u of t is represented y its nme nd the set of positions of the nodes on ˆt tht re lifts of u. First, look t Figure 7 where we hve presented the evlution of Q using the non-revised utomt. We otin then n nswer on t selecting nodes nd g (which re the nodes of t hving n ncestor with prent c); ut, if we unfold this nswer, we otin the tree with s selected nodes: t positions 11, 211, nd 20