FluXQuery: An Optimizing XQuery Processor

FluXQuery: An Optimizing XQuery Processor for Streaming XML Data Report by Cătălin Hriţcu (catalin.hritcu@gmail.com) International Max Planck Research School for Computer Science 1 Introduction While for languages like XPath there are many processing techniques that use very little memory, the XQuery processing engines often require huge amounts of memory, which leads to important scalability problems. Processing XQuery on XML Streams is thus a significant research challenge, and the paper addresses it by using order constraints extracted from Document Type Definitions. This report presents FluX, an intermediate query representation language that allows event-based query processing on XML Streams. The main goal of FluX is to minimize the amount of buffering by using order constraints. The authors have developed and implemented an efficient algorithm for rewriting XQueries into equivalent FluX queries that have low memory consumption. Finally, the report contains additional experiments which validate the performance claims made by the authors in the original paper. 2 XQuery XQuery [14] is a flexible standard query language for XML. It is applicable to a broad spectrum of XML data sources like XML documents and relational databases. XQuery is a functional language consisting entirely of expressions. It uses XPath expressions to address specific parts of an XML document, but it extends it with powerful SQL-like expressions that support explicit iteration and binding of variables to intermediate results. These are called FLWOR (pronounced flower ) expressions because of their five clauses: for, let, where, order by and return. FLWOR expressions are often useful for computing joins between two or more documents and for transforming data (similar to XSLT). The for and let clauses in a FLWOR expression generate a sequence of tuples of variable bindings, called a tuple stream. The where clause can be used to filter the tuple stream (similar to the where in SQL). The order by clause can be used to reorder the tuple stream (similar to the order by in SQL). Finally, the return clause is evaluated for each tuple in the filtered and reordered tuple Based on Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams, Christoph Koch, Stefanie Scherzinger, Nicole Schweikardt, Bernhard Stegmaier, VLDB 2004 [6]. More information about FluX can be found on the project s web site [7]. 1

a u th ors a u th or firs t la s t m id d le com m e n t Don a ld Kn u th Ervin com p u t... m a th e m... Th e Ar... Com p u t... An im a... Con cre... Lite ra... Th e cr... Figure 1: The XML document from Listing 1 represented as a tree stream, using the variable bindings in the respective tuples. The result of each evaluation of the return clause is one element in the sequence constituting the result of the whole FLWOR expression. Suppose we have an XML document containing authors and their publications like the one given by Listing 1 and again as a tree in Figure 1. For example, if we want to list all the books written by an author with the last name Knuth in the alphabetical order of their titles we could use the query given in Listing 2. <authors > <author> <f i r s t >Donald</ f i r s t > <l a s t >Knuth</l a s t > <middle>ervin </middle> <f i e l d >computer s c i e n c e </f i e l d > <f i e l d >mathematics</f i e l d > <book>the Art o f Computer Programming</book> <paper>computer programming as an a r t </paper> <paper>an imaginary number system </paper> <book>concrete Mathematics </book> <book>l i t e r a t e Programming</book> <comment>the c r e a t o r o f the TeX</comment> </author> </authors > <knuth> { f o r $a in / authors / author l e t $n := $a/ l a s t $b := $b/ book where $n = Knuth order by $b return <book>{$b}</book> } </knuth> Listing 1: An example XML document Listing 2: An example XQuery Despite its flexibility, XQuery is however hardly applicable to streams because it was designed with the implicit assumption of efficient random access to the XML documents (e.g. the documents are stored in main memory or in a database). Queries like the one in Listing 2 would require large parts of the XML stream to be buffered. In the case of an infinite (or large enough) XML stream this is impossible, and this is the case even for simple queries without the order by clause. 2

a u th ors a u th or n a m e s s s firs t m id d le la s t Dona ld Ervin Kn u th com p u t... m a th e m... Th e Ar... Con cre... Lite ra... Com p u t... An im a... Figure 2: The result of applying the query in Listing 3 a u th ors a u th or firs t la s t m id d le com m e n t Don a ld Kn u th Ervin com p u t... m a th e m... Th e Ar... Com p u t... An im a... Con cre... Lite ra... Th e cr... Figure 3: Processing XQuery (all dark red nodes need to be buffered while the green ones can be processed on-the-fly) <authors > { f o r $a in / authors / author return <author> <name> {$a/ f i r s t } {$a/ middle } {$a/ l a s t } </name> <f i e l d s > {$a/ f i e l d } </ f i e l d s > <books> { $a/ book} </books> <papers> { $a/ paper} </papers> </author> } </authors > Listing 3: A simple XQuery Take for example the simple query in Listing 3, which will be used as an ongoing example. It transforms the XML document in Figure 1 into the one in Figure 2. All children of the author node that are not comments are wrapped in groups of names, fields, books and papers. The order of the elements barely changes (only the middle and last elements are switched) so intuitively most of the processing should occur on the fly. However if the XQuery processor does not have any information on the order in which the elements occur in the stream, it must assume that at any given time any element can still occur in the stream. This means that while processing the stream, the only elements that are not buffered are the first name, because it is the first element to be emitted and the comments, because they are not emitted at all. All other elements need to be buffered until the XQuery processor is sure that no first names can occur in the input stream, and this can be done only when the end tag of the author element is encountered (3). We will see later that this situation can be greatly improved by using order constraints. 3 The FluX Query Language 3.1 XQuery FluX is based on XQuery, a fragment of XQuery that contains: 3

Fixed-path XPath expressions, which only use the child axis (i.e. sequences a 1 /a 2 /.../a n, where the a 1 are symbols from the DTD and n 1) Atomic conditions (i.e. exists $x/π, $x/π RelOp s, or $x/π RelOp $y/π, where $xis a variable that ranges over XML trees, s is a string, π and π are fixed paths, and RelOp {=, <,, >, }) Arbitrarily nested conditional expressions (ifs) and expressions containing the for, where and return clauses. XQuery does not contain the let or order by clauses, nor does it allow the use of more complex XPath expressions, that use for example the descendent axis or the wildcard operator (e.g. a//b or a/ /b). 3.2 FluX FluX expressions are either process-stream expressions {ps $y: handlers } or simple XQuery expressions that require no buffering. For example, <a> {$x} </a> is a simple XQuery expression while {$x} {$y} is not, because it would require to buffer the nodes used by $y). Every process-stream expression has a non-empty list of handlers separated by semicolons (;): On handlers: on a as $x return FluX expression, which fire immediately when an instance of an element a is encountered as a child of the currently processed node, so no buffering is required. On-first handlers: on-first past( S ) return XQuery expression, which fire as soon as no elements in the set S can be encountered in the input stream. All the elements on which the given XQuery expression depends on are buffered until the moment the on-first handler fires. Also notice that while on handlers can fire zero, one or more times, the on-first handlers fire exactly one time, in the worst case when the ending tag of the currently processed node is encountered. For example if we want to use FluX to process our example stream and output first the books and then the papers, we can output the books on-the-fly (on handler) and buffer all the papers. We can only output the papers when we are sure that no book can appear in the output stream (on-first handler). This moment can be exactly computed from a schema as we will later see. The complete FluX query is given in Listing 4. Notice that FluX provides a strong intuition on how memory buffers are used. { ps $author : on book as $b return {$b } ; on f i r s t past ( book, paper ) return { f o r $p in $author / paper return {$p }}}; Listing 4: We output first the books, then the papers Returning to our ongoing example, we can use the order constrains extracted from a DTD (Listing 5) to drastically reduce the amount of buffering (Figure 4). What we still need to buffer are the papers, because there is no order relation between books and papers and we want the books to appear before the papers in the output stream, and additionally the last name because in the 4

a u th ors a u th or firs t la s t m id d le com m e n t Don a ld Kn u th Ervin com p u t... m a th e m... Th e Ar... Com p u t... An im a... Con cre... Lite ra... Th e cr... Figure 4: FluX reduces the amount of buffering (only the dark red nodes need to be buffered while the green ones can be processed on-the-fly) output stream it has to appear after the middle name while in the input stream it appears before it. Listing 6 shows the FluX query that achieves this low memory consumption. <!ELEMENT authors ( author )> <!ELEMENT author ( comment?, f i r s t, l a s t, middle, f i e l d +, ( book paper ), comment?)> <!ELEMENT comment (#PCDATA)> <!ELEMENT f i r s t (#PCDATA)> <!ELEMENT middle (#PCDATA)> <!ELEMENT l a s t (#PCDATA)> <!ELEMENT f i e l d (#PCDATA)> <!ELEMENT book (#PCDATA)> <!ELEMENT paper (#PCDATA)> Listing 5: A DTD we can use to reduce buffering { ps $ROOT: on f i r s t past ( ) r e t u r n <authors >; on authors as $authors r e t u r n { ps $authors : on author as $a r e t u r n { ps $a : on f i r s t past ( ) r e t u r n <author >; on f i r s t past ( ) r e t u r n <name>; on f i r s t as $ f r e t u r n { $ f } ; on middle as $m r e t u r n {$m} ; on f i r s t past ( l a s t, middle ) r e t u r n { f o r $ l i n $a/ l a s t r e t u r n { $ l } } ; on f i r s t past ( f i r s t, l a s t, middle ) r e t u r n </name>; on f i r s t past ( f i r s t, l a s t, middle ) r e t u r n <f i e l d s >; on f i e l d as $ f r e t u r n { $ f } ; on f i r s t past ( f i r s t, l a s t, middle, f i e l d ) r e t u r n </ f i e l d s >; on f i r s t past ( f i r s t, l a s t, middle, f i e l d ) r e t u r n <books >; on book as $b r e t u r n {$b } ; on f i r s t past ( f i r s t, l a s t, middle, f i e l d, book ) r e t u r n </books >; on f i r s t past ( f i r s t, l a s t, middle, f i e l d, book ) r e t u r n <papers >; on f i r s t past ( book, paper ) r e t u r n { f o r $p i n $a/ paper r e t u r n {$p } } ; on f i r s t past ( f i r s t, l a s t, middle, f i e l d, book, paper ) r e t u r n </papers >; on f i r s t past ( f i r s t, l a s t, middle, f i e l d, book, paper ) r e t u r n </author >; } ; } ; on f i r s t past ( authors ) r e t u r n </authors >;} Listing 6: The FluX query equivalent to the XQuery in Listing 3 5

comment 1? first? last 1 middle 1 field 1 + (book 1 paper 1 ) comment 2? Figure 5: The regular expression representing the content model of the author elements for the DTD in Listing 5 Figure 6: The Glushkov automaton equivalent to the expression in Figure 5 4 Order Constraints Given a regular expression ρ over Σ, a, b symb(ρ), S Σ and u symb(ρ), we are interested in the following three order constraints: Ord ρ A binary relation over Σ, so that (a, b) Ord ρ when all occurences of a appear before all occurences of b in a valid XML stream (e.g. for ρ in Figure 5 {(field, book), (field, paper)} Ord ρ, but (book, paper) Ord ρ and (paper, book) Ord ρ ). P ast ρ,s (u) A predicate indicating that none of the symbols in S can be encountered after reading the word u from the input stream. first-past ρ,s (u) An event generator which fires the first possible time when P ast ρ,s (u) holds. These constraints can be efficiently computed from the Glushkov automaton [1], which is a DFA in the case of 1-unambiguous regular expressions (DTDs). The Glushkov automaton itself can be constructed in O( ρ 2 ) time and space (by computing the first, last and follow relations). The P ast predicate is computed in O( Q 2 ) by a simple DFS traversal of the Glushkov automaton for each state q i Q we are interested in computing the set of reachable states. P ast ρ (q i, a) q j : q # j = a q j reachable from q i The Ord relation can be computed from the P ast predicate in time O( Q symb(ρ) ) by a simple iteration over all symbols q (see Figure 8) Ord ρ (a, b) q : q # = b P ast ρ (q, a) For each set of symbols S Σ found in an on-first past handler of the FluX query, the value of the past predicate is precomputed and stored in a table (P astt able). P astt able ρ,s (q) a S : P ast ρ (q, a) first past ρ,s (ɛ) := P astt able ρ,s (q 0 ) 6

P ast ρ comment first last middle field book paper comment 1 false false false false false false false first 1 false false false false false false false last 1 false true false false false false false middle 1 false true true false false false false field1 1 false true true true false false false book 1 false true true true true false false paper 1 false true true true true false false comment 2 true true true true true true true Figure 7: The P ast predicate corresponding to the automaton in Figure 6 Figure 8: The Ord relation for the automaton in Figure 6 (nodes that can be obtained by transitivity have been omitted) first past ρ,s (u 1...u n ) := P astt able ρ,s (δ(q, u n )) P astt able ρ,s (q) We can thus compute all the three order constraints at run-time in O(1) time (several table look-ups), by using O( ρ 2 ) time and O( ρ 2 ) space for the precomputations. 5 From XQuery to FluX XQuery queries are transformed into FluX queries in two steps. First the query is normalized by using the rewriting rules in Figure 9, until no rule can be applied. The first rule converts all conditional for-loops, into a for containing an if corresponding to the condition. The second rule turns the iteration that could be implied by an XPath expression into an equivalent for loop. The third rule converts multiple-step paths into simple-step ones. The fourth pushes conditionals inside the innermost for-loops. Finally, the last two rules assure that for each subexpression of the form {ifξthenα}, α is either a fixed string or of the form {$x} for some variable $x. This step takes linear time with respect to the query size (O( Q )). <authors> { f o r $authors in $ROOT/ authors return { f o r $a in $authors / author return <author> <name> { f o r $ f i n $a / f i r s t r e t u r n { $ f }} { f o r $m in $a / middle return {$m}} { f o r $ l i n $a / l a s t r e t u r n { $ l }} </name> <f i e l d s > { f o r $ f i n $a / f i e l d r e t u r n { $ f }} </ f i e l d s > <books> { f o r $b in $a / book return {$b}} </books> <papers> { f o r $p in $a / paper return {$p}} </papers> </author >}} </authors> Listing 7: The normalized equivalent to the XQuery in Listing 3 For example, the query in Listing 3 can be normalized by breaking the multiple-step path /authors/author into to, and converting all other XPath 7

Figure 9: The rewriting rules used for normalization. expressions into fors (Listing 7). The normalized XQuery expression can be rewritten into a FluX query in time O( D 3 + Q 2 ), which is mainly given by the need to compute dependencies in order to assure safety. A FluX query is called safe for a given DTD if its subexpressions don t refer to paths that might still be encountered in an input stream that is valid with respect to the given DTD. The query rewriting algorithm converts a normalized XQuery into an equivalent FluX query that is safe for the given DTD and has low memory consumption [7]. However, no strong guarantees can be provided in the context of FluX. The authors have also developed a system using XML Stream Attribute Grammars (XSAGs) in which this kind of guarantees can be provided [8], with a high cost on query expressivity though. 6 Experiments We performed new experiments in conditions similar to the one presented in the original paper 1. The two XQuery processors used for comparisons are Galax 0.5 and SAXON 8.6.1. The test system was an Intel Pentium 4 at 3.00GHz with 1024MB of memory, running Debian GNU/Linux with kernel version 2.6.13 and Java 1.5.0 05. The XML data was generated by XMark [12], from 11MB to 100MB (212MB for Q13) in increments aproximatively 11MB large. The querries used are Q1, Q8 and Q13 from the XML Query Use Cases [15], slightly modified in order to work with FluX. 1 Note: The results in this report are not supposed to be directly compared with the one presented in the original paper. The tests used a more powerful machine and a more recent version of Galax. We also compare to SAXON, a high performance open source XQuery processor instead of the anonymous commercial XQuery processor used in the paper 8

In all three cases the memory consumption of FluX is much better than the memory consumption of both Galax and SAXON. In two of the cases (Q1 and Q13) FluX did not buffer any data, while Galax and SAXON used memory in impresive ammounts and actually did run out of it for Q13 when the stream was larger than 190MB. Also the time needed to process the queries (even including preprocessing of the DTD) is very competitive, even better than SAXON for Q1 and Q13. For Q8 the time performance of FluX is worse than the one of SAXON, however the authors argue that techniques for algebraic based query optimization could be used in the future to further improve the performance of FluX. 7 Acknowledgements Stefanie Scherzinger and Prof. Christoph Koch provided helpful insights on the FluX internals and the working implementation of FluX used in the experiments. Special thanks go to Dan Olteanu for his constructive comments an earlier revision of this report. 8 Related Work There was only little related work at the time the paper was published. The only comparable approach was based on transducer networks and it lacked a working implementation [9]. Other efforts were focussed on applying XQuery algebras [13] in the streaming context [3], optimizing XQueries using a set of constraints [2] or doing stream pre-filtering [10]. Finally, a query language that allows for scalable data transformations rather than just document filtering are XML Stream Attribute Grammars (XSAGs). They run strictly in linear time with bounded memory consumption independent of the size of the stream, but are at the same time less expressive than FluX [8]. References [1] A. Bruggemann-Klein, D. Wood: One-Unambiguous Regular Languages. Information and Computation, Volume 142, Number 2, May 1998, pp. 182-206(25) [2] A. Deutsch and V. Tannen. Reformulation of XML Queries and Constraints. In Proc. ICDT 03, 2003. [3] L. Fegaras, D. Levine, S. Bose, and V. Chaluvadi. Query Processing of Streamed XML Data. In Proc. CIKM 2002, pages 126-133, 2002. [4] D. Florescu, C. Hillery, D. Kossmann, P. Lucas, F. Riccardi, T.Westmann, M. J. Carey, A. Sundararajan, and G. Agrawal. The BEA/XQRL Streaming XQuery Processor. In Proc. VLDB 2003, pages 997-1008, 2003. [5] Galax: An Implementation of XQuery. Available: http://www.galaxquery.org/ 9

[6] C. Koch, S. Scherzinger, N. Schweikardt, B. Stegmaier: Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams. VLDB 2004. [7] C. Koch, S. Scherzinger, N. Schweikardt, B. Stegmaier: The FluXQuery Engine. Last updated October 4, 2005. Available: http://www.infosys.uni-sb.de/~scherzin/fluxquery.html [8] C. Koch, S. Scherzinger: Attribute Grammars for Scalable Query Processing on XML Streams. DBPL 2003: 233-256. [9] B. Ludäscher, P. Mukhopadhyay, and Y. Papakonstantinou. A Transducer- Based XML Query Processor. In Proc. VLDB 2002, pages 227-238, 2002. [10] A. Marian and J. Simeon. Projecting XML Documents. In Proc. VLDB 2003, pages 213-224, 2003. [11] M. Kay. SAXON: The XSLT and XQuery Processor. Available: http://saxon.sourceforge.net/ [12] XMark: An XML Benchmark Project. Available: http://monetdb.cwi.nl/xml/ [13] World Wide Web Consortium. XQuery 1.0 and XPath 2.0 Formal Semantics. W3C Candidate Recommendation (3 November 2005), 2005. Available: http://www.w3.org/tr/xquery-semantics/ [14] World Wide Web Consortium. XQuery 1.0: An XML Query Language. W3C Candidate Recommendation (3 November 2005), 2005. Available: http://www.w3.org/tr/xquery/ [15] World Wide Web Consortium. XML Query Use Cases. W3C Working Draft (02 May 2003), 2003. Available: http://www.w3.org/tr/2003/wd-xquery-use-cases-20030502/ 10

$ " #$! ' #$%&! " 11

"#$!"#$ $#!"# 12

$! 11 22 34 45 55 67 78 89 100 111 122 134 145 156 167 179 190 201 212 "#$ '! " 11 22 34 45 55 67 78 89 100 111 122 134 145 156 167 179 190 201 212 #$%& 13