Efficient Reasoning about a Robust XML Key Fragment

Size: px

Start display at page:

Download "Efficient Reasoning about a Robust XML Key Fragment"

Ernest Davis
5 years ago
Views:

1 fficient Reasoning about a Robust XML Key Fragment SVN HARTMANN Clausthal University of Technology and SBASTIAN LINK Victoria University of Wellington We review key constraints in the context of XML as introduced by Buneman et al. We demonstrate that (1) one of the proposed inference rules is not sound in general, and (2) the inference rules are incomplete for XML key implication, even for non-empty sets of simple key paths. This shows, in contrast to earlier statements, that the axiomatisability of XML keys is still open, and efficient algorithms for deciding their implication still need to be developed. Solutions to these problems have a wide range of applications including consistency validation, XML schema design, data exchange and integration, consistent query answering, XML query optimisation and rewriting, and indexing. In this paper, we investigate the axiomatisability and implication problem for XML keys with non-empty sets of simple key paths. In particular, we propose a set of inference rules that is indeed sound and complete for the implication of such XML keys. We demonstrate that this fragment is robust by showing the duality of XML key implication to the reachability problem of fixed nodes in a suitable digraph. This enables us to develop a quadratic time algorithm for deciding implication, and shows that reasoning about this XML key fragment is practically efficient. Therefore, XML applications can be unlocked effectively since they benefit not only from those XML keys specified explicitly by the data designer but also from those ones that are specified implicitly. Categories and Subject Descriptors: H.2.1 [Database Management]: Languages Data description languages (DDL); F.4.3 [Mathematical Logic and Formal Languages]: Formal This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version. This research is supported by the Marsden fund council from Government funding, administered by the Royal Society of New Zealand. S. Hartmann is supported by a research grant of the Alfried Krupp von Bohlen and Halbach Foundation, administered by the German Scholars Organization. Authors addresses: S. Hartmann, Department of Informatics, Julius-Albert-Str. 4, Clausthal-Zellerfeld, Germany, sven.hartmann@tu-clausthal.de; S. Link, School of Information Management, Victoria University of Wellington, PO Box 600, Wellington 6015, New Zealand, sebastian.link@vuw.ac.nz. Copyright 200x by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to Post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) , or permissions@acm.org. ACM /20YY/ $5.00 ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY, Pages 1 33.

2 2 S. Hartmann and S. Link Languages Decision problems; G.2.2 [Discrete Mathematics]: Graph Theory Graph algorithms, Trees General Terms: Algorithms; Design; Management; Theory Additional Key Words and Phrases: XML data, XML key, Axiomatisation, Implication, Reachability 1. INTRODUCTION The extensible Markup Language (XML,[Bray et al. 2006]) has recently evolved to be the standard for data exchange on the Web, and also represents a uniform model for data integration. It provides a high degree of syntactic flexibility but has little to offer to specify the semantics of its data. Consequently, the study of integrity constraints has been recognised as one of the most important yet challenging areas of XML research [Fan 2005; Suciu 2001; Vianu 2003; Widom 1999]. Several classes of integrity constraints have been defined for XML including keys [Buneman et al. 2002], path constraints [Buneman et al. 2001; Buneman et al. 2000], inclusion constraints [Fan and Libkin 2002; Fan and Siméon 2003] and functional dependencies [Arenas and Libkin 2004; Hartmann and Trinh 2006; Vincent et al. 2004]. However, for almost all classes of constraints the complex structure of XML data results in decision problems that are intractable. It is therefore a major challenge to find natural and useful classes of XML constraints that can be reasoned about efficiently [Fan 2005; Fan and Libkin 2002; Fan and Siméon 2003; Suciu 2001; Vianu 2003]. Prime candidates of such classes are absolute and relative keys [Buneman et al. 2002; 2003] that are defined on the basis of a tree model for XML as proposed by DOM [Apparao et al 1998] and XPath [Clark and DeRose 1999], but independently from schema specifications such as DTDs or XSDs [Thompson et al. 2004]. Figure 1 shows such a representation in which nodes are annotated by their type: for element, A for attribute, and S for string (PCDATA). Keys are defined in terms of path expressions, and determine nodes either relative to a set of context nodes or the root. Nodes are determined by (complex) values on some selected descendent nodes. In Figure 1, an example of a reasonable absolute key is that the isbn values identify the book node. That is, the isbn descendent nodes of different book nodes must have different values. In contrast, an author cannot be identified in the entire tree by its first.s and last.s descendent nodes since the same author can have written more than one book. However, the author can indeed be identified by its first.s and last.s descendent nodes relatively to the book node. That is, for each individual book node, different author descendent nodes must differ on their first.s or last.s descendent nodes. 1.1 Some Applications XML keys have a variety of applications, e.g., in schema design, query rewriting and optimisation, efficient storing and updating, data exchange and integration, consistent query answering and data cleaning [Fan 2005]. The full potential of these applications will only be unlocked effectively if algorithms are established that can reason about XML keys efficiently. For instance, if the implication of

3 fficient Reasoning about XML Keys 3 book isbn= title Foundations of Databases /title author first Victor /first last Vianu /last /author /book book isbn= title ICDT 2001 /title author first Victor /first last Vianu /last /author author first Jan /first last van den Bussche /last /author /book db isbn A " " title S Foundations of Databases book first S Victor isbn author A " " last S Vianu title S ICDT 2001 book author author first last first last S S S S Victor Vianu Jan van den Bussche Fig. 1. XML data fragment and its tree representation. XML keys can be decided efficiently, then one can capitalise not only on those keys that were specified explicitly by the database designer but also on those ones that were specified implicitly. Assume for example that the most frequent type of XPath queries asks for the title of books co-authored by a particular author, e.g., //book[/author/text() = Victor Vianu ]/title. The layout of the XML data in Figure 1 is not well-suited for processing these queries since book-nodes cannot be identified by their author-children in the entire document (this key is not specified implicitly). In particular, if author-nodes are encrypted, then the evaluation of the query requires us to decrypt the author-nodes of every single book-node. Similarly, updates of author-nodes are time-consuming. In this situation the layout illustrated in Figure 2 represents a better choice. Books are now listed by their authors who need to be stored only once. Thus, redundancies with respect to authors are eliminated. Instead, we have redundancies with respect to books but these result in an efficient processing of the frequent type of queries. Moreover, updates of author information will not cause any processing difficulties. Finally, to process our original query, author-nodes only need to be decrypted until Victor Vianu is found. Notice that these improvements are due to the new absolute key that author-nodes can be identified in the entire document by their first.s- and last.s-values. The original query is rewritten into //author[/text() = Victor Vianu ]/book/title. The example illustrates that facilities for reasoning about XML keys present us with opportunities for designing XML databases that permit a more efficient processing of frequent queries and updates [Arenas and Libkin 2004; 2005; Vincent et al. 2004].

4 4 S. Hartmann and S. Link db author author first last book book first last book S Victor S isbn isbn S S Vianu A title A title Jan van den isbn A title " " " " Bussche " " S S S Foundations of Databases ICDT 2001 ICDT 2001 Fig. 2. XML data with different layout. 1.2 Previous and Related Work Constraints have been studied extensively in the context of the relational model of data, some excellent surveys include [Fagin and Vardi 1984; Thalheim 1991]. Dependencies have been investigated in other data models, e.g., in nested data models [Hara and Davidson 1999; Hartmann and Link 2008; Paredaens et al. 1989], temporal data models [Chomicki and Niwinski 1995; Jensen et al. 1996], conceptual [Hartmann 2001; Lenzerini and Nobili 1990; Liddle et al. 1993; Weddell 1992; Thalheim 2000] and object-oriented data models [Biskup and Polle 2001; Ito and Weddell 1994; Tari et al. 1997]. Recent work on XML constraints include [Arenas and Libkin 2004; 2005; Buneman et al. 2002; 2003; Buneman et al. 2001; Buneman et al. 2000; Fan and Libkin 2002; Fan and Siméon 2003; Hartmann and Trinh 2006; Hartmann and Link 2007b; 2007a; Vincent et al. 2004; Vincent et al. 2007], for brief surveys see [Fan 2005; Hartmann et al. 2007]. We continue to study the notion of XML keys introduced and discussed in [Buneman et al. 2002]. In order to unlock the vast amount of application domains effectively it is crucial to define expressive notions of XML keys whose associated decision problems are still tractable. The expressiveness of XML keys is influenced by the choice of a navigational path language, and by the choice of a notion for value equality. In [Buneman et al. 2003] XML keys are defined using (i) path expressions that are formed from node labels by recursive applications of child- and descendant-or-self -operators, and (ii) isomorphic subtrees (identity on strings) as a notion for value equality. Moreover, the set of key path expressions can be empty, i.e., nodes may also be identified (in subtrees) without any data. For instance, in every subtree rooted at a book node there is at most one title-descendant of that book node. The notion of XML keys from [Buneman et al. 2003] is more expressive than the notion originally proposed in [Buneman et al. 2002] where the descendantor-self -operator is not allowed to occur in any key path expression. In [Buneman et al. 2003] an axiomatisation for the general class of XML keys is proposed, and based on this set of inference rules an algorithm for deciding XML key implication in heptic-time was developed as well [Buneman et al. 2003]. In [Davidson et al. 2007; 2008] an orthogonal fragment of the XML keys from

5 fficient Reasoning about XML Keys 5 [Buneman et al. 2002; 2003] was considered. In this fragment, the set of key path expressions is either empty or consists of attributes only. This means that value equality is restricted to the equality of strings on attribute nodes. Although a much more general notion of value equality is utilised in the present paper the results do not extend those of [Davidson et al. 2007; 2008] as the class of XML keys in [Davidson et al. 2007; 2008] assigns a semantics to attribute nodes which deviates from the one of previous papers [Buneman et al. 2002; 2003] and the present paper. This article extends earlier work [Hartmann and Link 2007b] by providing details on the formal framework, new examples that illustrate the concepts and tools that we introduce and apply, and the proofs of our results (Sections 2, 3 and 4). The proofs provide new insight into the equivalence between the implication problem of XML keys and the reachability problem of fixed nodes in a suitable digraph. We have further motivated the importance of reasoning techniques for XML keys by pointing the reader to potential applications (Sections 1 and 5). Finally, we comment on possible directions of extending the findings of this article (Section 5). 1.3 Contributions Keys are fundamental to any data model. If XML is ever to carry more of the semantics of its data, then the notion of a key must be studied in detail. This paper highlights the difficulties when developing such appropriate notions. In particular, the structural, tree-like relationships between the various elements of an XML data source make it challenging to provide algorithms that can reason efficiently about XML keys. In this paper, we review XML key constraints. We show that one of the inference rules for key implication [Buneman et al. 2003] is only sound for keys with simple key paths [Buneman et al. 2002], but not sound in general as stated in [Buneman et al. 2003]. The incorrectness is not just a minor detail but shows that the choice of a path language for defining XML keys is crucial. We demonstrate that the axiomatisation proposed in [Buneman et al. 2003] is not only incomplete in the general case but already incomplete in the case of non-empty sets of simple key paths, and thus in particular for XML keys as defined in [Buneman et al. 2002]. Since keys have had a significant impact on XML [Buneman et al. 2002; 2003] we believe that it is important to provide an axiomatisation and show that automated reasoning about XML keys is practically efficient. First, we establish an axiomatisation of XML keys with non-empty sets of simple key paths. This axiomatisation can be applied by database designers to enumerate all implied XML keys. In practice, such an enumeration is often desirable, e.g., to validate the correct specification of explicit knowledge, to design and fine-tune XML databases or to optimise XML queries. In particular, the completeness of the inference rules ensures that all opportunities of utilising implicit knowledge for these purposes have been exploited. Our completeness proof is based on a characterisation of key implication in terms of the reachability problem for fixed nodes in a suitable digraph. This duality result demonstrates the robustness of this class of XML keys. Furthermore, our completeness argument showcases the significance of an axiomatisation for finding algorithms that efficiently decide implication. Indeed, our duality result together with the efficient evaluation of Core XPath [Gottlob et al. 2005] enable us to establish a compact algorithm which decides XML key implication in time

6 6 S. Hartmann and S. Link quadratic in the size of the input keys. Notice that the original technique resulted in a heptic-time (n 7 ) algorithm, cf. [Buneman et al. 2003]. Our decision algorithm complements the enumeration algorithm by a further reasoning capability that can make efficient, but only partial decisions about implicit knowledge. These decisions are only partial in the sense that the input to this algorithm must also contain a candidate for an implied XML key. In contrast, the enumeration algorithm simply lists all implied keys. We believe that the results of this paper are of great practical significance as the implication problem forms the core of many XML applications. 1.4 Organisation We use Section 2 to formalise the underlying XML tree model, the navigational path languages, the notion of value equality and the notion of XML keys [Buneman et al. 2002; 2003]. First, we show that the subnodes rule is not sound for the implication of XML keys [Buneman et al. 2003]. Subsequently, we prove that the inference system for XML keys [Buneman et al. 2003] is incomplete, even for XML keys with non-empty sets of simple key paths. For the remainder of the article we study the axiomatisability and tractability of XML keys with non-empty sets of simple key paths. In Section 3 we establish a finite axiomatisation for this fragment of XML keys. In particular, we characterise XML key implication in terms of the reachability problem for fixed nodes in a suitable digraph. This technique is used in Section 4 to develop a provably-correct algorithm that decides implication in time quadratic in the size of the given keys. Finally, we comment on future work in Section 5 and conclude in Section PRRQUISITS We review the definition of keys and their properties [Buneman et al. 2002; 2003]. Throughout the paper we assume familiarity with basic concepts from graph theory [Jungnickel 1999]. 2.1 The XML Tree Model It is common to represent XML data by ordered, node-labelled trees. We assume that there is a countably infinite set denoting element tags, a countably infinite set A denoting attribute names, and a singleton set {S} denoting text (PCDATA). We further assume that these sets are pairwise disjoint, and put L = A {S}. We refer to the elements of L as labels. An XML tree is a 6-tuple T = (V, lab, ele, att, val, r) where V denotes a set of nodes, and lab is a mapping V L assigning a label to every node in V. A node v V is called an element node if lab(v), an attribute node if lab(v) A, and a text node if lab(v) = S. Moreover, ele and att are partial mappings defining the edge relation of T: for any node v V, if v is an element node, then ele(v) is a list of element and text nodes in V and att(v) is a set of attribute nodes in V. If v is an attribute or text node, then ele(v) and att(v) are undefined. The partial mapping val assigns a string to each attribute and text node: for each node v V, val(v) is a string if v is an attribute or text node, while val(v) is undefined otherwise. Finally, r is the unique and distinguished root node. T is said to be finite if V is finite, and is said to be empty if V consists of the root node only.

7 fficient Reasoning about XML Keys 7 For a node v V, each node w in ele(v) or att(v) is called a child of v, and we say that there is an edge (v, w) from v to w in T. A path p of T is a finite sequence of nodes v 0,..., v m in V such that (v i 1, v i ) is an edge of T for i = 1,..., m. We call p a path from v 0 to v m, and say that v m is reachable from v 0 following the path p. The path p determines a word lab(v 1 )..lab(v m ) over the alphabet L, denoted by lab(p). For a node v V, each node w reachable from v is called a descendant of v. Note that an XML tree has a tree structure: for each node v V, there is a unique path from the root node r to v. 2.2 Value quality of Nodes in XML Trees We can now define value equality for pairs of nodes in an XML tree. Informally, two nodes u and v of an XML tree T are value equal if they have the same label and, in addition, either they have the same string value if they are text or attribute nodes, or their children are pairwise value equal if they are element nodes. More formally, two nodes u, v V are value equal, denoted by u = v v, if and only if the subtrees rooted at u and v are isomorphic by an isomorphism that is the identity on string values. That is, two nodes u and v are value equal when the following conditions are satisfied: (a) lab(u) = lab(v), (b) if u, v are attribute or text nodes, then val(u) = val(v), (c) if u, v are element nodes, then (i) if att(u) = {a 1,...,a m }, then att(v) = {a 1,..., a m} and there is a permutation π on {1,...,m} such that a i = v a π(i) for i = 1,...,m, and (ii) if ele(u) = [u 1,..., u k ], then ele(v) = [v 1,..., v k ] and u i = v v i for i = 1,...,k. Note that the notion of value equality takes the document order of the XML tree into account. For example, the first and second author node (according to document order) in Figure 1 are value equal. We remark that = v is an equivalence relation on the node set V of the XML tree. This is easy to observe as value equality between nodes corresponds to isomorphism of the subtrees rooted at these nodes. 2.3 Path xpressions for Node Selection in XML Trees In order to define keys we need a mechanism for selecting nodes in an XML tree. Path expressions have been widely used for node selection in XML theory and practice, cf. [Clark and DeRose 1999; Suciu 2001]. We are interested in path languages that are expressive enough to be practical, yet sufficiently simple to be reasoned about efficiently. This is the case for the languages PL and PL s that have been used in [Buneman et al. 2002; 2003] for the definition of XML keys. For the sake of completeness we will briefly introduce these languages here. Let be a distinguished symbol not in L. It will serve as a variable length don t care wildcard, that is, as a combination of a single symbol wildcard (denoted by ) and the Kleene star ( ). Let PL denote the set of all words over the alphabet L { }. Further let PL s be the subset of PL containing all words over the alphabet L. Both PL and PL s form free monoids with the binary operation of concatenation (denoted by.) and the empty word (denoted by ε) as identity element. Let P, Q be words from PL. P is a refinement of Q, denoted by P Q, if P is obtained from Q by replacing wildcards in Q by words from PL. For example,

8 8 S. Hartmann and S. Link book.author.first is a refinement of.first. Note that is a pre-order on PL. Let denote the congruence induced by the identity. = on PL. Observe that P Q holds if and only if P and Q are refinements of each other. We now define the semantics of words from PL in the context of XML. Let Q be a word from PL. A path p in the XML tree T is called a Q-path if lab(p) is a refinement of Q. For nodes v, w V, we write T = Q(v, w) if w is reachable from v following a Q-path in T. For example, in the XML tree in Figure 1, all first-nodes are reachable from the root node following a book.author.first-path. Obviously, they are also reachable from the root node following a.first-path. For a node v V, let v[q] denote the set of nodes in T that are reachable from v following any Q-path, that is, v[q] = {w T = Q(v, w)}. As an example consider the second book-node v in the XML tree in Figure 1. Then v[.first] is the set of all first-nodes that are descendants of the second book node. We use [Q] as an abbreviation for r[q] where r is the root node of T. Thus, [.author] is the set of all author-nodes in the entire XML tree. Recall that each attribute or text node in an XML tree T is a leaf. Therefore, a word Q from PL is said to be valid if it does not have labels l A or l = S in a position other than the last one. Note that each prefix of a valid Q is valid, too. Let P, Q be words from PL. P is contained in Q, denoted by P Q, if for every XML tree T and every node v of T we have v[p ] v[q]. It follows immediately from the definition that P Q implies P Q. The containment problem of PL is to decide, given valid P and Q from PL, whether P Q holds. In [Buneman et al. 2003] it is shown that valid P, Q from PL satisfy P Q if and only if P is a refinement of Q and that the containment problem of PL can be decided in quadratic time. In accordance with [Buneman et al. 2002] we will work with the quotient set PL / rather than with PL directly: A word from PL is in normal form if it has no consecutive wildcards. ach congruence class contains a unique word in normal form. ach word from PL can be transformed into normal form in linear time, just by removing superfluous wildcards. In particular, each word from PL s is in normal form. The length Q of a PL expression Q is the number of labels in Q plus the number of in the normal form of Q, cf. [Buneman et al. 2003]. The empty path expression ε has length 0. The natural homomorphism from PL to PL / is an isomorphism when restricted to words in normal form. By abuse of notation we will use the words from P L to denote their respective congruence class, cf. [Buneman et al. 2002]. It is a straightforward exercise to extend the terminology introduced above for PL to PL /. We will call members of PL / (and PL s/ ) PL expressions (or PL s expressions, respectively) in order to emphasise their use for node selection in XML. Note that there is an easy conversion of PL expressions to XPath expressions [Clark and DeRose 1999], just by replacing with.//. and. with /. The choice of a path language for selecting nodes in XML trees is directly influenced by the complexity of its containment problem. Buneman et al. [Buneman et al. 2002; 2003] argue that PL is simple yet expressive enough to be adopted by data designers and maintained by systems for XML applications. Note that Buneman et al. have included the wildcard in their definition of XML keys in

9 fficient Reasoning about XML Keys 9 [Buneman et al. 2002], but not in their investigations on axiomatisability and the complexity of the implication problems [Buneman et al. 2003]. Since we want to establish reasoning facilities for XML keys we utilise the same path languages as defined in [Buneman et al. 2003]. To conclude this section we repeat the notion of value intersection from [Buneman et al. 2003]: For nodes v and v of an XML tree T, the value intersection of v[q] and v [Q] is given by v[q] v v [Q] = {(w, w ) w v[q], w v [Q], w = v w }. That is, v[q] v v [Q] consists of all those node pairs in T that are value equal and are reachable from v and v, respectively, by following Q-paths. 2.4 Keys for XML In [Buneman et al. 2003], Buneman et al. introduce the class K of XML keys. A key ϕ in K is defined as an expression (Q, (Q, {Q 1,...,Q k })) where Q, Q, Q i are PL expressions such that Q.Q.Q i is a valid PL expression for all i = 1,...,k. Herein, Q is called the context path, Q is called the target path, and Q 1,..., Q k are called the key paths of ϕ. In this paper, we use K P L to P L,PL refer to the class K and to distinguish K from other classes of XML keys. The superscript PL of K P L indicates that arbitrary PL expressions may be chosen P L,PL for the context path. The subscripts PL and PL of K P L P L,PL indicate that arbitrary P L expressions may be chosen for the target path and all key paths. The symbol in the second subscript PL indicates that the finite set of key path expressions may be empty. An XML tree T satisfies the key (Q, (Q, {Q 1,..., Q k })) if and only if for every node q [Q] and all nodes q 1, q 2 q[q ] such that there are nodes x i q 1[Q i ], y i q 2 [Q i ] with x i = v y i for all i = 1,...,k, then q 1 = q 2 [Buneman et al. 2003]. More formally, q [Q] q 1, q 2 q[q ] q 1[Q i ] v q 2[Q i ] q 1 = q 2. 1 i k Let Σ {ϕ} be a finite set of XML keys in class C (e.g., C may be K P L P L,PL ). We say that Σ (finitely) implies ϕ, denoted by Σ = (f) ϕ, if and only if every (finite) XML tree T that satisfies all σ Σ also satisfies ϕ. The (finite) implication problem for C is to decide, given any finite set Σ {ϕ} of keys in C, whether Σ = (f) ϕ. Finite and unrestricted implication problem coincide for the class K P L PL,P L [Buneman et al. 2003]. Hence, we will commonly speak of the implication problem for XML keys in K P L. P L,PL For a set Σ of keys in C, let Σ (f) = {ϕ C Σ = (f) ϕ} be its (finite) semantic closure, i.e., the set of all keys (finitely) implied by Σ. The notion of syntactical inference ( R ) with respect to a set R of inference rules can be defined analogously to the notion in the relational data model [Abiteboul et al. 1995, pp ]. That is, a finite sequence γ = [γ 1,...,γ l ] of XML keys is called an inference from Σ by R if every γ i is either an element of Σ or is obtained by applying one of the rules of R to appropriate elements of {γ 1,...,γ i 1 }. We say that the inference γ infers γ l, i.e. the last element of the sequence γ, and write Σ R γ l. For a finite set Σ of

10 10 S. Hartmann and S. Link keys in C, let Σ + R = {ϕ Σ R ϕ} be its syntactic closure under inferences by R. A set R of inference rules is said to be sound (complete) for the (finite) implication of keys in C if for every finite set Σ of XML keys in C we have Σ + R Σ (f) (Σ (f) Σ+ R ). The set R is said to be an axiomatisation for the (finite) implication of keys in C if R is both sound and complete for the (finite) implication of keys in C. Finally, an axiomatisation R is said to be finite if the set R is finite. Buneman et al. [Buneman et al. 2003] present a finite set of inference rules which they state is an axiomatisation for the implication of keys in K P L P L,PL, as well as an algorithm that decides the implication problem for K P L P L,PL and that is based on the axiomatisation. However, the inference rules are neither sound for the implication of keys in K P L L P L,PL, nor complete even for fragments of KP for which all rules P L,PL are still sound. Before we show the incorrectness and incompleteness we introduce the key fragment for which we will establish a finite axiomatisation, and an efficient algorithm for deciding the implication problem. In fact, we restrict our attention to XML keys with a finite, non-empty set of simple key paths, i.e., the fragment K P L Note that the symbol + in PL + s indicates that the finite set of simple key paths must not be empty. Definition 2.1. A key constraint ϕ for XML (or short XML key) in K P L is an PL,P L + s expression (Q, (Q, S)) where Q, Q are PL expressions and S is a finite, non-empty set of PL s expressions such that Q.Q.P are valid PL expressions for all P in S. Herein, Q is called the context path, Q is called the target path, and the elements of S are called the key paths of ϕ. If Q = ε, we call ϕ an absolute key; otherwise ϕ is called a relative key. For an XML key ϕ, we use Q ϕ to denote its context path, Q ϕ to denote its target path, and P ϕ 1,...,P ϕ k ϕ to denote its key paths, where k ϕ is the number of its key paths. The size ϕ of a key ϕ is defined as the sum of the lengths of all path expressions in ϕ, i.e., ϕ = Q ϕ + Q ϕ kϕ + i=1 P ϕ i. xample 2.2. We formalise the examples from the introduction. In an XML tree that satisfies the absolute key (ε, (.book, {isbn})) one will never find two different book nodes that have value equal isbn descendent nodes. Furthermore, under a book node in an XML tree that satisfies the relative key (.book, (author, {first.s, last.s})) one will never find two different author descendent nodes that are value equal on their first.s and last.s descendent nodes. Finally, the relative key (.book, (author, {.S})) is in K P L P L,PL but not in L KP. We will first demonstrate that the finite axiomatisation proposed for the class K P L P L,PL [Buneman et al. 2003] is not sound. Indeed, the axiomatisation contains the so-called subnodes rule (Q, (Q.Q, {P })) (Q, (Q, {Q.P })) where Q, Q, Q and P are all PL expressions. Lemma 2.3. The subnodes rule is not sound for the implication of XML keys in K P L and, therefore, also not sound for the implication of XML keys in K P L,PL + P L. P L,PL.

11 fficient Reasoning about XML Keys 11 root a b c b c d e v 1 v 2 v 3 v 4 v 5 v 6 v 7 Fig. 3. In general, the subnodes rule is not sound. Proof. A simple counter-example is the XML tree T illustrated in Figure 3. T satisfies the absolute key σ = (ε, (a..b.c..d, {e})), but violates the absolute key ϕ = (ε, (a..b, {c..d.e})) since v 2, v 4 [a..b], v 2 v 4 and v 2 [c..d.e] v v 4 [c..d.e] = {(v 7, v 7 )}, i.e., ϕ is not implied by σ. However, ϕ can be inferred from σ using the subnodes rule. Lemma 2.3 shows that the inference rules proposed in [Buneman et al. 2003] are not sound for the implication of keys as defined in [Buneman et al. 2003]. The conference paper [Buneman et al. 2001] contains a much more restrictive definition of value intersection which leaves the subnodes rule sound in the presence of arbitrary PL expressions for the key paths, but this does not affect the incorrectness of the results for the more general notion of value equality [Buneman et al. 2003]. It is stated in [Buneman et al. 2002] that allowing arbitrary path expressions for the P i [key paths] merely complicates the definition of key but does not change much in the way of the theory. This is not true since the subnodes rule is sound for the implication of XML keys in K P L but it is not sound for the implication of XML keys in K P L P L,PL by virtue of Lemma 2.3. Lemma 2.4. The subnodes rule is sound for the implication of XML keys in. K P L Proof. Suppose some XML tree T violates (Q, (Q, {P.P })), i.e., there is some q [Q] and there are some q 1, q 2 q[q ] such that q 1 q 2 and such that there exist p 1 q 1 [P.P ] and p 2 q 2 [P.P ] where p 1 = v p 2 holds. By definition of concatenation, there exist p 1 q 1[P ] and p 2 q 2[P ] such that p 1 p 1 [P ] and p 2 p 2[P ] hold. Since T is a tree and P is a PL s expression we conclude that p 1 p 2 (since otherwise q 1 = q 2 ). This, however, means that T also violates (Q, (Q.P, {P })). Lemma 2.4 suggests that the set of inference rules stated sound and complete for K P L [Buneman et al. 2003] is at least a finite axiomatisation for the implication P L,PL of keys in K P L. Let S denote the set of inference rules in Table I without the subnodes-epsilon rule. This is the axiomatisation proposed in [Buneman et al. 2003] when only PL s expressions are allowed for the key paths. Unfortunately, S turns out to be incomplete even for the implication of keys in K P L since the subnodes-epsilon rule is sound for the implication of XML keys in K P L and independent from S. Lemma 2.5. The subnodes-epsilon rule is sound for the implication of XML keys in K P L. Proof. Suppose an XML tree T violates (Q, (Q, {ε, P.P })). Then there is some node q [Q] and some nodes q 1, q 2 q[q ] such that q 1 q 2, q 1 = v q 2, and there

12 12 S. Hartmann and S. Link (Q, (ε, S)) (Q, (Q, S {ε, P })) (Q, (Q, S {ε, P.P })) (Q, (Q, S)) (Q, (Q, S {P })) (epsilon) (prefix-epsilon) (superkey) (Q, (Q.P, {P })) (Q, (Q, {P.P })) (Q, (Q, S)) (Q, (Q, S)) Q Q (Q, (Q, S)) (Q, (Q, S)) Q Q (subnodes) (context-path-containment) (target-path-containment) (Q, (Q.Q, S)) (Q.Q, (Q, S)) (Q, (Q.P, {ε, P })) (Q, (Q, {ε, P.P })) (Q, (Q, {P.P 1,..., P.P k })), (Q.Q, (P, {P 1,..., P k })) (Q, (Q.P, {P 1,..., P k })) (context-target) (subnodes-epsilon) (interaction) Table I. An axiomatisation of XML keys in K P L. PL,P L + s exist p 1 q 1 [P.P ] and p 2 q 2 [P.P ] such that p 1 = v p 2. By definition, there exists some p 1 q 1 [P ] such that p 1 p 1 [P ]. Since q 1 = v q 2 it is easy to see that there exists some node p 2 q 2 [P ] such that p 1 p 2, p 1 = v p 2 and p 2 p 2[P ]. But then p 1 q[q.p ] and p 2 q[q.p ]. Hence, T also violates (Q, (Q.P, {ε, P })). Lemma 2.6. The subnodes-epsilon rule is independent from S for the implication of XML keys in K P L. PL,P L + s Proof. We need to show that there is a finite set Σ {ϕ} of XML keys in K P L such that ϕ cannot be inferred from Σ by S, but ϕ can be inferred from Σ using S and the subnodes-epsilon rule. A simple example is given by Σ = {(ε, (A.B, {ε, C}))} and ϕ = (ε, (A, {ε, B.C})). On the one hand, ϕ can be inferred from Σ by a single application of the subnodes-epsilon rule. On the other hand, however, we will show by induction on the length l of a finite inference γ = [γ 1,...,γ l ] from Σ by S that γ does not infer ϕ. Notice that for every finite set Σ of keys in K P L it is impossible to infer an XML key ϕ from Σ by S whenever ϕ is not implied by Σ. This follows immediately from the soundness of S for the implication of keys in K P L. If l = 1, then γ 1 Σ or γ 1 was inferred by an application of the epsilon rule. In both cases γ 1 ϕ. Let l > 1 and [γ 1,..., γ l ] be an inference from Σ by S that has length l. We distinguish between 8 different cases according to how γ l was obtained from [γ 1,...,γ l 1 ]. Case 1. γ l is an element of Σ or γ was inferred by an application of the epsilon rule. In both cases γ l ϕ. Case 2. We obtain γ l by an application of the prefix-epsilon rule to the premise γ i with i < l. Assume that γ l = ϕ, i.e., assume that ϕ can be inferred from Σ by S. We conclude that γ i equals (ε, (A, {ε, B})), (ε, (A, {ε})) or ϕ. For the first two cases, however, γ i is not implied by Σ, as tree T 1 in Figure 4 shows. We conclude that γ i cannot be inferred from Σ by S. This, however, is a contradiction to the inference [γ 1,..., γ i ]. Consequently, ϕ γ l. The last case is a contradiction to the

13 fficient Reasoning about XML Keys 13 induction hypothesis for i < l. Case 3. We obtain γ l by an application of the superkey rule to the premise γ i with i < l. Assume that γ l = ϕ, i.e., assume that ϕ can be inferred from Σ by S. We conclude that γ i = (ε, (A, {B.C})) or γ i = (ε, (A, {ε})). In both cases, however, γ i is not implied by Σ, as trees T 2 and T 1 in Figure 4 show, respectively. We conclude, by the soundness of the inference rules in S, that γ i cannot be inferred from Σ by S. This, however, is a contradiction to the inference [γ 1,...,γ i ]. Consequently, γ l ϕ. Case 4. It is immediate that γ l cannot be inferred by an application of the subnodes rule to any of the premises γ i with i < l. This is because ϕ has two distinct key paths, but the subnodes rule can only generate conclusions where the key path set is a singleton. Case 5. We obtain γ l by an application of the context-path-containment rule to the premise γ i with i < l. Assume that γ l = ϕ, i.e., assume that ϕ can be inferred from Σ by S. We conclude that γ i equals (, (A, {ε, B.C})) or ϕ. For the first case, however, γ i is not implied by Σ, as tree T 3 in Figure 4 shows. We conclude that γ i cannot be inferred from Σ by S. This, however, is a contradiction to the inference [γ 1,..., γ i ]. Consequently, γ l ϕ. The second case is a contradiction to the induction hypothesis for i < l. Case 6. We obtain γ l by an application of the target-path-containment rule to the premise γ i with i < l. Assume that γ l = ϕ, i.e., assume that ϕ can be inferred from Σ by S. We conclude that γ i equals (ε, (.A, {ε, B.C})), (ε, (A., {ε, B.C})), (ε, (.A., {ε, B.C})), (ε, (, {ε, B.C})), or ϕ. In none of the first four cases is γ i implied by Σ, as tree T 4 in Figure 4 shows for (ε, (A., {ε, B.C})), and tree T 3 in Figure 4 shows for the remaining cases. We conclude, by the soundness of the inference rules in S, that γ i cannot be inferred from Σ by S. This, however, is a contradiction to the inference [γ 1,...,γ i ]. Consequently, γ l ϕ. The remaining case is a contradiction to the induction hypothesis for i < l. Case 7. We obtain γ l by an application of the context-target rule to the premise γ i with i < l. Assume that γ l = ϕ, i.e., assume that ϕ can be inferred from Σ by S. We conclude that γ i = ϕ. This, however, is a contradiction to the induction hypothesis for i < l. Case 8. We obtain γ l by an application of the interaction rule to the premises γ i and γ j with i, j < l. Assume that γ l = ϕ, i.e., assume that ϕ can be inferred from Σ by S. We conclude that γ i = ϕ or γ j = ϕ. This, however, is a contradiction to the induction hypothesis for i < l or j < l, respectively. Corollary 2.7. The set S of inference rules is incomplete for the implication of XML keys in K P L. Proof. Lemma 2.5 shows that ϕ = (ε, (A, {ε, B.C})) is implied by Σ = {(ε, (A.B, {ε, C}))}. However, ϕ cannot be inferred from Σ by S according to Lemma 2.6. Consequently, S is incomplete for the implication of XML keys in. K P L In Section 3 we will prove that the inference rules from [Buneman et al. 2003] together with the subnodes-epsilon rule are sound and complete for the implication of XML keys in K P L, i.e., we will prove the following result.

14 14 S. Hartmann and S. Link r r r r A A A A Z A B B B B A A Z Z D C F C B C B C B C B C T 1 T 2 T 3 T 4 Fig. 4. Counterexamples for the implication of keys in the proof of Lemma 2.6 Theorem 2.8. The inference rules in Table I form a finite axiomatisation for the implication of XML keys in K P L. The reader may wonder about the class K P L which also contains XML keys P L,PL s where the finite set of simple key paths may be empty. For this fragment the inference rules in Table I are incomplete, and we will briefly comment on this class in Section AN AXIOMATISATION FOR TH XML KY FRAGMNT K P L P L,PL + S The aim of this section is to prove Theorem 2.8, i.e., we will show that the finite set R of inference rules from Table I is sound and complete for the implication of XML keys in K P L. Based on Lemma 2.4 and previous soundness proofs [Buneman et al. 2003] it remains to prove the completeness of R. 3.1 Outline of the Proof Strategy For an arbitrary finite set Σ {ϕ} of keys in K P L we need to show the following: if ϕ / Σ + R, then there is some XML tree T which is a counter-example for the implication of ϕ by Σ. Next we outline our general proof strategy. In a first step, we represent ϕ in terms of a finite node-labelled tree T Σ,ϕ, which we will call the mini-tree. Certain nodes of T Σ,ϕ will be duplicated when generating the counterexample T in order to violate ϕ. The tree T Σ,ϕ has also two distinguished unique nodes: q ϕ and q ϕ which are reachable from the root of T Σ,ϕ by following a Q ϕ - path and Q ϕ.q ϕ-path, respectively. In order to generate the counter-example T successfully all the keys σ Σ must be satisfied by T. For this purpose we introduce additional upward-directed edges into T Σ,ϕ. This results in a digraph G Σ,ϕ, which we will call the witness graph. In a digraph G, a node v is reachable from a node u when there exists a path from u to v in G, that is, a sequence u = v 0,..., v m = v of mutually distinct nodes with an edge (v i 1, v i ) for each i = 1,...,m. We will show that if ϕ / Σ + R, then q ϕ is not reachable from q ϕ in G Σ,ϕ. In this case, we can always create the counter-example tree T by duplicating certain nodes in T Σ,ϕ. In the first subsection we will formally define the notions of a mini-tree and witness graph, respectively. Subsequently, we will state the main lemma that reach-

15 fficient Reasoning about XML Keys 15 ability of q ϕ from q ϕ in the witness graph G Σ,ϕ implies that ϕ can be inferred from Σ by R. Next, we will apply this lemma to prove the completeness of R for the implication of XML keys in K P L. The final subsection is devoted towards proving the main lemma. Our technique is not just limited to proving completeness. In Section 4 we will show that Σ implies ϕ if and only if q ϕ is reachable from q ϕ in G Σ,ϕ. Hence, we can characterise XML key implication in K P L by the reachability problem for fixed nodes in a suitable digraph. Moreover, this characterisation results in a surprisingly efficient and elegant decision procedure for the implication problem of XML keys in K P L. In fact, this technique allows us to establish an algorithm that decides XML key implication in K P L in time quadratic in the size of the input keys. The previous, flawed, algorithm for deciding XML key implication in K P L runs PL,P L in heptic time [Buneman et al. 2003]. Our completeness argument is different from the strategy applied in [Buneman et al. 2003] to the class K P L : for every ϕ / P L,PL Σ+ they construct a finite XML tree T that violates ϕ. Subsequently, they chase those keys in Σ which are violated by T and maintain at the same time the violation of ϕ. ventually, this results in a chased version of T which finitely satisfies all keys in Σ and violates ϕ. This would show that ϕ is not implied by Σ. However, the completeness proof is flawed for the class K P L since it requires the soundness of the subnodes rule, but the P L,PL subnodes rule is not sound for K P L P L,PL, cf. Lemma 2.3. Moreover, the completenss argument in [Buneman et al. 2003] is also incomplete for the class K P L since it does not apply the subnodes-epsilon rule which is sound for the implication of keys in K P L and independent from the inference rules of [Buneman et al. 2003], cf. Lemmata 2.5 and Tools to be Used: Mini-Trees and Witness Graphs Let Σ {ϕ} be a finite set of keys in K P L. Let L P L,PL + Σ,ϕ denote the set of all labels s l L that occur in path expressions of keys in Σ {ϕ}, and fix a label l 0 L Σ,ϕ. Further, let O ϕ and O ϕ be the PL s expressions obtained from the PL expressions Q ϕ and Q ϕ, respectively, by replacing each by l 0. Let p be an O ϕ -path from a node r ϕ to a node q ϕ, let p be an O ϕ -path from a node r ϕ to a node q ϕ and, for each i = 1,...,k ϕ, let p i be a P ϕ i -path from a node r ϕ i to a node x ϕ i, such that the paths p, p, p 1,...,p kϕ are mutually node-disjoint. From the paths p, p, p 1,..., p kϕ we obtain the mini-tree T Σ,ϕ by identifying the node r ϕ with q ϕ, and by identifying each of the nodes r ϕ i with q ϕ. Note that q ϕ is the unique node in T Σ,ϕ that satisfies q ϕ [O ϕ ], and q ϕ is the unique node in T Σ,ϕ that satisfies q ϕ q ϕ [O ϕ]. In the sequel, we will discuss how to construct an XML tree from T Σ,ϕ that could serve as a counter-example for the implication of ϕ by Σ. A major step in this construction is the duplication of certain nodes of T Σ,ϕ. To begin with, we determine those nodes of T Σ,ϕ for which we will generate two value equal copies in a possible counter-example tree. The marking of the mini-tree T Σ,ϕ is a subset M of the node set of T Σ,ϕ : if for all i = 1,...,k ϕ we have P ϕ i ε, then M consists of the leaves of T Σ,ϕ, and otherwise M consists of all descendant nodes of q ϕ in T Σ,ϕ. The nodes in M are said to be marked.

16 16 S. Hartmann and S. Link db library q ϕ book q ϕ author first last q ϕ db library book q ϕ author first last db library q ϕ book q ϕ author first last S S S S S S x x Fig. 5. The mini-tree from xample 3.1 and the two witness graphs from xamples 3.4 and 3.5. xample 3.1. The left of Figure 5 shows the mini-tree T Σ,ϕ for the key ϕ = (.book, (author, {first.s,last.s})) and some Σ, where library is the fixed label chosen from L Σ,ϕ. The marking of the mini-tree consists of its leaves (emphasised by ). We use mini-trees to calculate the impact of a key in Σ on a possible counterexample tree for the implication of ϕ by Σ. To distinguish keys that have an impact from those that do not, we introduce the notion of applicability. Definition 3.2. Let T Σ,ϕ be the mini-tree of the key ϕ with respect to Σ, and let M be its marking. A key σ is said to be applicable to ϕ if and only if there are nodes w σ [Q σ ] and w σ w σ [Q σ ] in T Σ,ϕ such that w σ [Pi σ ] M for all i = 1,...,k σ. We say that w σ and w σ witness the applicability of σ to ϕ. xample 3.3. Let Σ consist of the two keys σ 1 = (ε, (.book, {isbn})) and σ 2 = (.book, (author, {first.s,last.s})), and let ϕ = (ε, (.book.author, {first.s,last.s})). We find that σ 1 is not applicable to ϕ, while σ 2 is indeed applicable to ϕ. We define the witness graph G Σ,ϕ as the node-labelled digraph obtained from T Σ,ϕ by inserting additional edges: for each key σ Σ that is applicable to ϕ and for each pair of nodes w σ [Q σ ] and w σ w σ [Q σ ] that witness the applicability of σ to ϕ, G Σ,ϕ contains the directed edge (w σ, w σ) from w σ to w σ. Subsequently, we refer to these additional edges as witness edges, while the original edges from T Σ,ϕ are referred to as downward edges of G Σ,ϕ. This is motivated by the fact that for every witness w σ and w σ, the node w σ is a descendant node of w σ in T Σ,ϕ, and thus the witness edge (w σ, w σ ) is an upward edge or loop in G Σ,ϕ. xample 3.4. Let Σ = {σ 1, σ 2 } be as in xample 3.3, and let ϕ be the key (ε, (.book.author, {first.s,last.s})). The witness graph G Σ,ϕ is illustrated in the middle of Figure 5. It contains a witness edge arising from σ 2. xample 3.5. Let Σ consist of the single key σ = (.book, (author, {ε})), and let ϕ = (.book, (author, {first.s,last.s})). The witness graph G Σ,ϕ is illustrated in

17 fficient Reasoning about XML Keys 17 the right of Figure 5. It does not contain any witness edges since σ is not applicable to ϕ due to q ϕ [ε] M =. 3.3 Completeness of the Inference Rules In this subsection we will first state the main lemma that reachability of q ϕ from q ϕ in the witness graph G Σ,ϕ implies that ϕ can be inferred from Σ by R. Subsequently, we will apply this result to prove the completeness of R for the implication of XML keys in K P L. The main lemma itself will be proven in the next subsection. PL,P L + s Lemma 3.6. Let Σ {ϕ} be a finite set of keys in K P L. If q PL,P L + ϕ is reachable s from q ϕ in the witness graph G Σ,ϕ, then (Q ϕ, (Q ϕ, {P ϕ 1,...,P ϕ k ϕ })) Σ + R. w σ u w σ u w σ u w σ w σ w σ w ^ σ q ϕ P ϕ 1 P ϕ P ϕ i kϕ T Σ,ϕ q ϕ P ϕ 1 P ϕ P ϕ i kϕ G Σ,ϕ P ϕ 1 q ϕ P i ϕ P ϕ kϕ T P ϕ 1 q ^ ϕ P ϕ kϕ ϕ P i Fig. 6. Mini tree, witness graph and counter-example tree of case 2 in the completeness proof. Lemma 3.6 enables us to prove the completeness of R for the implication of XML keys in K P L. PL,P L + s Proof of Theorem 2.8. It remains to prove the completeness of R. Let Σ {ϕ} be an arbitrary finite set of keys in K P L such that ϕ / Σ + R. We will show that ϕ / Σ by constructing a finite XML tree T which satisfies all keys in Σ but does not satisfy ϕ. Let T Σ,ϕ and G Σ,ϕ denote the mini-tree and witness graph of ϕ with respect to Σ, respectively. Let ϕ = (Q, (Q, {P 1,..., P k })), and O, O PL s be the path expressions that result from Q, Q, respectively, by replacing the label by a fixed element label l 0 in L Σ,ϕ. Since ϕ / Σ + R we know by Lemma 3.6 that there is no path from q ϕ to q ϕ in G Σ,ϕ. Let u denote the bottom-most descendant node of q ϕ in T Σ,ϕ such that q ϕ is still reachable from u in G Σ,ϕ. Consequently, u is a proper ancestor of q ϕ because otherwise u and thus q ϕ were reachable from q ϕ in G Σ,ϕ according to the downward edges. Let T 0 denote a copy of the path from r to u, and T 1, T 2 denote two node-disjoint copies of the subtree of T Σ,ϕ rooted at u. We want that a node of T 1 and a node of T 2 become value equal precisely when they are copies of the same marked node in T Σ,ϕ. For attribute and text nodes this is achieved by choosing string values

REASONING ABOUT KEYS FOR XML

Pergamon - Information Systems Vol. -, No. -, pp. -, 1994 Copyright c 1994 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0306-4379/94 $7.00 + 0.00 REASONING ABOUT KEYS FOR XML Peter