Data Exchange with Arithmetic Comparisons

Size: px

Start display at page:

Download "Data Exchange with Arithmetic Comparisons"

Berenice Heath
6 years ago
Views:

1 Data Exchange with Arithmetic Comparisons Foto Afrati 1 Chen Li 2 Vassia Pavlaki 1 1 National Technical University of Athens (NTUA), Greece {afrati,vpavlaki}@softlab.ntua.gr 2 University of California, Irvine, USA chenli@ics.uci.edu Abstract In this paper, the problem of data exchange in the presence of arithmetic comparisons is studied. For this purpose, tuple generating dependencies with arithmetic comparisons (tgd-ac) and arithmetic comparison generating dependencies (acgd) are introduced to capture the need to use arithmetic comparisons in source-totarget and in target constraints (both in their antecedent and in their consequent). Arithmetic comparisons are interpreted over densely totally ordered domains. A chase procedure called AC-chase is introduced to handle such constraints. AC-chase is a tree rather than a sequence and its depth is polynomial on the size of the source instance if the set of target constraints is the union of a set of target acgds with a weakly acyclic set of target tgd-acs. It is shown that AC-chase computes a universal solution which can be used to compute certain answers for unions of conjunctive queries with arithmetic comparisons (UCQAC). The complexity of existence of a solution is shown to be in NP. The complexity of computing certain answers of UCQAC queries is shown to be conp-complete. In order to identify polynomial cases, we introduce the succinct AC-chase which however does not always produce a solution. We identify cases where its result is a universal solution and we investigate the syntactic conditions of the query under which we can use the result of succinct AC-chase to compute the certain answers in polynomial time. We show that the latter is also feasible in cases where the result of succinct chase is not a universal solution. 1 Introduction Data exchange has been recognized as a fundamental problem decades ago and has recently attracted the interest of many researchers (e.g., [AL05, Ber03, FKMP03, FKP03, Got05, GN06, KPT06, Lib06, VR07]). In the data exchange problem, we want to take data structured under a schema, called the source schema, and transform it into data structured under another schema, called the target schema, for the purpose of being able to answer queries posed over the target schema. In this paper we investigate the data exchange problem in the presence of arithmetic comparisons which is admittedly a much harder problem. The following is an example of such a problem. A real estate company, called A, has a database S that stores information about apartments and owners in Orange County, California. Fig. 1 shows its two tables, namely Home and Owner. Due to business reasons, the company needs to send some of its data to another real estate company, called B, which has a database T with two tables: Apartment(aID, apartaddr, price, zip, name); Person(name, Addr). The table Apartment contains information about apartments, including the ID (aid), address (apartaddr), price (price), zip code (zip), and owner s name (name). The table Person stores information about the names (name) and addresses (Addr) of owners. Thus, in order to facilitate the data transfer from the database of company A to the database of company B, we need to transform automatically the data from the source schema to the Work partially done when all authors were visiting Stanford University. The project is co - funded by the European Social Fund (75%) and National Resources (25%) - Operational Program for Educational and Vocational Training II (EPEAEK II) and particularly the Program HERAKLEITOS.

2 target schema. Fagin et al. [FKMP03] showed that tuple-generating dependencies (tgds) and equality-generating dependencies (egds) are very powerful to define many kinds of semantic correspondences between the schemas and, hence, they can be used to transform data in a data exchange problem. Intuitively, a tuple generating dependency expresses that if some facts exist in a database, then some other facts should also exist. Thus, in our example, the following two tgds could facilitate the data transformation: d st1 : Home(I, A, P) Apartment(I, A, P, Z, N) and d st2 : Owner(I, N, A) Person(N, A). Home: aid apartaddr price Main St Alton Ave Alton Ave 800 Owner: aid name Addr 13 Tom Smith Highland Ave 54 John Parker Franklin Ave 43 Ken Johnson Culver Dr Figure 1: Source database at company A. However, suppose we want to transfer data only about apartments with price 500 thousand dollars. Suppose also that we know that the zip code of apartments in company A is less than (e.g., because we know that all apartments are in Orange County), and we should store this information in the database of company B. Tgds and egds cannot handle either of these two cases. In this paper, we introduce tuple generating dependencies that use also arithmetic comparisons, called tuple-generating dependencies with arithmetic comparisons (tgd-acs in short), by extending the class of tgds to include comparison predicates. We also extend the class of egds to allow for any arithmetic comparison on the consequent and we call them arithmetic-comparison-generating dependencies (acgds in short). The need for including arithmetic comparisons to data dependencies has been observed in earlier work [BCW99, IO93, MS96, Mah97]. In our example, the following two tgd-acs will transform the data while satisfying also the two conditions above: d st1 : Home(I, A, P) P 500 Apartment(I, A, P, Z, N) Z > and d st2 : Owner(I, N, A) Person(N, A). Finally, besides using tgd-acs and acgds to declare how we want the data to be transformed from the source schema to the target schema (source-to-target constraints), we may also use them to declare that some constraints are required to be satisfied on the target data (target constraints). The data exchange problem that we investigate is the following: We are given a source schema S, a target schema T and a set of source-to-target constraints and target constraints. For any source instance, we want to find a target instance which satisfies the constraints (existence of solution problem). Moreover, for a given query q, we want to find the certain answers of q (query answering problem). These problems and their complexity are investigated in [FKMP03] for the case of constraints that are tgds and egds and for conjunctive queries and conjunctive queries with inequalities ( ). In the present paper, we investigate the complexity of these problems for the case of constraints that are tgd-acs and acgds and for conjunctive queries with arithmetic comparisons (<,, >,,, =). Summary of Results: It is known that reasoning on logical formulas with arithmetic comparisons is more complicated than logical formulas without them. More specifically, many of the technical tools used to solve data exchange problems are closely related to query containment tests. Conjunctive query containment is recognized as a much simpler problem (NP-complete) than containment of conjunctive queries in the presence of arithmetic comparisons (Π P 2 -complete) [CM77, Klu88, Mey92]. Thus, in our setting we have to deal with all additional complications that are introduced by the presence of arithmetic comparisons. The organization of the paper and our contributions are as follows: In Section 2, we define the concepts that we use. Since arithmetic comparisons are interpreted over totally ordered domains, the concept of a tgd-ac being satisfied on a database instance which includes labeled nulls is not straightforward. Hence, new notions are needed. The notion of universal solution (which is used to compute certain answers of queries) is extended to be a set of solutions rather than a single solution. Interestingly, also in our setting universal solutions have the property that any two of them are equivalent, if viewed as unions of conjunctive queries with arithmetic comparisons, (Proposition 2.1) which is an extension of the property noticed in [FKMP03]. In Section 3, we define a novel chase procedure called AC-chase which is a tree rather than a sequence and can handle tgd-acs and acgds. We show that the result of AC-chase is a universal solution (Theorem 3.1). We

3 Constraints Query Language Complexity Reference tgds and egds UCQ with two per CQ conp-hard [FKMP03] tgds and egds CQ with seven conp-hard [AD98] tgds and egds CQ with two conp-hard [Mad05] tgds and egds UCQ with inequalities ( ) conp [FKMP03] tgd-acs and acgds UCQAC conp Section 4 tgds and egds UCQ, at most one per CQ PTIME [FKMP03] full tgd-acs UCQAC, each CQAC has the PTIME Section 5 and acgds homomorphism property tgd-acs and acgds UCQAC, each CQAC has the PTIME Section 5 no target constraints homomorphism property tgds and egds UCQAC, each CQAC uses open PTIME Section 5 (closed) LSI (RSI) comparisons only Table 1: Complexity results on query answering in data exchange for weakly acyclic dependencies. CQ = conjunctive query; UCQ = union of conjunctive queries; (U)CQAC = (union of) conjunctive query (queries) with arithmetic comparisons. extend the definition of weakly acyclic set of tgds to define weakly acyclic set of tgd-acs. We prove that for a set of weakly acyclic set of tgd-acs and acgds, the AC-chase always produces a tree of polynomial length (on the size of the input source instance). From that, we derive the complexity results, which show that the problem of existence of solution is in NP (Theorem 3.2). In Section 4, we show that we can compute certain answers of unions of conjunctive queries with arithmetic comparisons using the result of the AC-chase (Theorem 4.1(1)). Again, since arithmetic comparisons are interpreted over totally ordered domains, computing such queries on a database instance with labeled nulls is a challenge and the definition of some new concepts is needed. In particular, we first define the concept of computing certain answers on an instance. We prove that the data complexity of the problem of computing certain answers for the data exchange problem with arithmetic comparisons is co-np complete (Theorem 4.2). Interestingly, we have been able to extend another observation made in [FKMP03] about properties of universal solution (Theorem 4.1(2)). In Section 5, we identify polynomial cases of the problem. For that, we define another chase procedure called succinct AC-chase which is a sequence and whenever it uses acgds and weakly acyclic tgd-acs, its length is polynomial (on the size of the source instance) (Theorem 5.1). However, it does not produce always a universal solution. We show that universal solutions are produced by the succinct AC-chase in two cases: (a) when there are no target constraints and (b) when we use full tgd-acs (i.e., without existentially quantified variables)(theorem 5.2). On query answering we show that whenever the query has the homomorphism property then we can compute certain answers in polynomial time (Lemma 5.1) on a given instance. A consequence of that is Theorem 5.4 which settles the problem of computing certain answers in polynomial time for the cases where succinct AC-chase produces a universal solution. However, we identify an additional case where certain answers of queries with the homomorphism property can be computed in polynomial time on the result of the succinct AC-chase, although the latter is not a universal solution it is when we use only tgds and egds and the query has a certain property (Theorem 5.5) 1. A summary of all our polynomial results can be found in Section 5.5. Table 1 summarizes existing complexity results on computing certain answers. The usefulness of the homomorphism property has been observed for reducing the complexity of the query containment problem [ALM04, Klu88] in the presence of arithmetic comparisons. It contains large classes of queries such as queries which use only comparisons of the form X < c where X is a variable and c is a constant. 1 Note our notion of solution does not coincide with that in [FKMP03]. Hence, although when we use tgds and egds the succinct AC-chase reduces to the chase, its result does not qualify as solution. This remark is implicitly made in [FKMP03] by noticing that this result cannot be used to compute certain answers of queries with and hence the disjunctive chase was introduced there.

4 2 Data Exchange with Arithmetic Comparisons. In this section we define a) dependencies with arithmetic comparisons, b) data exchange setting with arithmetic comparisons, and c) the concepts of solution and universal solution. We prove the first important result which says that universal solutions as defined here have good properties in that any pair of universal solutions are equivalent if viewed as queries (Proposition 2.1). For that, we review the query containment problem for conjunctive queries with arithmetic comparisons. 2.1 Dependencies with Arithmetic Comparisons. We first define the class of arithmetic comparison generating dependencies, which extends equality-generating dependencies (egds). We then define the class of tuple-generating dependencies with arithmetic comparisons, which extends tuple-generating dependencies (tgds). We consider arithmetic comparison predicates of the form AθB, where A is a variable, B is a variable or a constant, and θ is in {,, <, >,, =}. They are interpreted over densely totally ordered domains. We call an arithmetic comparison open if its operator is < or >, and closed if its operator is or. We call an arithmetic comparison AθB semi-interval (SI in short) if either A or B is a constant. If B is a constant we call it left semi-interval and if A is a constant we call it right semi-interval. definition 2.1. (acgd & tgd-ac) Let D be a database schema. An arithmetic-comparison-generating dependency (acgd in short) is a first-order formula of the form x ( φ(x) β φ (x 1 θ x 2 )). x 1 is a variable from x and x 2 is either a variable from x or constant. A tuple-generating dependency with arithmetic comparisons (tgd-ac in short) is a first-order formula of the form x ( φ(x) β φ yψ(x, y) β ψ ). In the formulas, φ(x) is a conjunction of atomic relational formulas over D, with variables in x. Each variable in x occurs in at least one formula in φ(x). In addition, ψ(x, y) is a conjunction of atomic relational formulas with variables in x and y, and each variable in y occurs in at least one formula in ψ(x, y). Also β φ and β ψ are each a conjunction of comparison predicates with variables from φ(x) and ψ(x, y), respectively. In general, a conjunctive formula with arithmetic comparisons is denoted as F = F 0 + β F, where F 0 contains the relational atoms and β F the comparisons of F. A conjunctive formula is in compact form if the conjunction of its comparisons β F do not imply equalities. 2.2 Preliminary Definitions. Let S ={S 1,..., S n } and T = {T 1,..., T m } be two disjoint schemas, where each element in the sets is a relation. A source-to-target dependency in Σ st is a tgd-ac of the form: x ( ) φ S (x) β φ yψ T (x, y) β ψ. A target dependency is a dependency over the target schema T. Every target dependency in Σ t is either a tgd-ac of the form x ( ) φ T (x) β φ yψ T (x, y) β ψ, or an acgd of the form x ( φ T (x) β φ (x 1 θx 2 )). In source-to-target and target dependencies, φ S (x), ψ T (x, y), and φ T (x) are each a conjunction of atomic relational formulas over S or T respectively, and β φ and β ψ are conjunctions of comparison predicates. A data exchange setting with arithmetic comparisons, denoted M = (S, T, Σ st, Σ t ), consists of a source schema S, a target schema T, a set Σ st of source-to-target dependencies, and a set Σ t of target dependencies. We use Const to denote the set of constants that may occur in the database. We assume constants are from an infinite totally ordered dense domain such as the rationals or reals. We assume an infinite set Var of values, called labeled nulls, such that Const Var =. We define instances as conjunctive formulas with arithmetic comparisons using constants and labeled nulls. For an instance K, we use Var(K) and Const(K) to denote the set of labeled nulls and the set of constants in K, respectively. An instance K where Var(K) is empty, is called ground instance. We use Const(Σ st Σ t ) to denote the set of constants appearing in source-to-target and target dependencies. If M = (S, T, Σ st, Σ t ) is a data exchange setting with arithmetic comparisons, and I is a ground instance of S, then an instance K is called a t-instance with respect to M and I, or simply a t-instance (if M and I are understood from the context), if the arithmetic comparisons of K define a total order with respect to C = Const(K) Var(K) Const(Σ st Σ t ) Const(I). A t-instance K is called a t-instance induced by an instance K if K results from K by adding to K additional arithmetic comparisons so that a total order is defined among the labeled nulls and constants in C, possibly with some labeled nulls or constants equated. We define the

5 concept of t-homomorphism as a homomorphism which a) satisfies the arithmetic comparisons and b) it maps a conjunctive formula on a t-instance. The formal definition follows. definition 2.2. (t-homomorphism) Let F = F 0 + β F be a conjunctive formula and K = K 0 + β K be a t-instance in compact form over the same schema. A t-homomorphism from F to K is a mapping h from Const(F ) Var(F ) to Const(K) Var (K) with the following properties: Every constant in F is mapped to itself. For every fact R(x 1,..., x k ) in F we have R(h(x 1 ),..., h(x k )) in K. β K logically implies h(β F ), denoted by β K h(β F ). We say F t-homomorphically maps to K if there is a t-homomorphism from F to K. Note that a t-homomorphism h from F to K directly implies a homomorphism h 0 from F 0 to K 0. We call h 0 the underlined homomorphism of h. There might not be a ground instance corresponding to an instance. We say an instance K is consistent if there is a non-empty ground instance K such that there is a t-homomorphism from K to K. We check whether an instance K is consistent by constructing the inequality graph using the arithmetic comparisons and finding cycles in this graph [Klu88]. Using the cycles in the inequality graph, we can also convert a t-instance to a compact t-instance. It can be done in polynomial time. Example 2.1 illustrates the concepts defined here such as instance, t-instance, ground instance. example 2.1. Let M = (S, T, Σ st, Σ t ) be a data exchange setting with arithmetic comparisons in which S has a relation E(A, B), T a relation H(C, D), Σ t =, and Σ st = {E(X, Y), X < Y Z(H(X, Z), H(Z, Y), Z < X)}. Let I = {E(2, 5), E(4, 7)} be a ground instance of the source S. Consider the following instances. J 1 = {H(2, z 1), H(z 1, 5), H(4, z 2), H(z 2, 7), z 2 < z 1 < 2}; J 2 = {H(2, z 1 ), H(z 1, 5), H(4, z 2 ), H(z 2, 7), z 1 = z 2 < 2}; J 3 = {H(2, z 1 ), H(z 1, 5), H(4, z 2 ), H(z 2, 7), z 1 < z 2 < 2}; J 4 = {H(2, z 1), H(z 1, 5), H(4, z 2), H(z 2, 7), z 1 < 2 = z 2}; J 5 = {H(2, z 1 ), H(z 1, 5), H(4, z 2 ), H(z 2, 7), z 1 < 2 < z 2 < 4}; J 6 = {H(2, z 1), H(z 1, 5), H(4, z 2), H(z 2, 7), z 1 < 2, z 2 > 7}; J 7 = {H(2, z 1 ), H(z 1, 5), H(4, z 2 ), H(z 2, 7), z 2 < z 1 < 3}; J 8 = {H(2, z 1), H(z 1, 5), H(4, z 2), H(z 2, 7), z 1 < 2, z 2 2}; J 9 = {H(2, z 1 ), H(z 1, 5), H(4, z 2 ), H(z 2, 7), z 1 < 2, z 2 < 4}; J 10 = {H(2, 1), H(1, 5), H(4, 0), H(0, 7)}; J 11 = {H(2, 2), H(2, 5), H(4, 0), H(0, 7), H(8, 7)}. J 1 is a t-instance, since its arithmetic comparisons define a total order of its labeled nulls (z 1 and z 2 ), its constants (2, 4, 5, and 7), and the constants in I (there is no constant in the tgd-ac). Similarly, J 2, J 3, J 4, J 5, and J 6 are also t-instances. J 7, J 8, and J 9 are instances, but not t-instances, since the arithmetic comparisons of each of them do not define a total order of its values (constants and labeled nulls). Both J 10 and J 11 are ground instances over T. There is a t-homomorphism h from J 7 to J 1, where h is the identity mapping from Const Var(J 7 ) to Const Var(J 1 ). In particular, the mapping turns z 2 < z 1 < 3 to z 2 < z 1 < 3, which is implied by z 2 < z 1 < 2. Similarly, J 1 t-homomorphically maps to the ground instance J 10. There is no t-homomorphism from J 6 to J 1. Finally, J 1, J 2, J 3, and J 4 are all the t-instances induced by J 8. J 1 is a compact t-instance, since its comparisons do not imply equalities. J 2 is not, since it has an equality z 1 = z 2. A compact t-instance of J 2 is J 2 = {H(2, z 1 ), H(z 1, 5), H(4, z 1 ), H(z 1, 7), z 1 < 2}, which is obtained from J 2 by replacing z 2 with z 1. Let d be a tgd-ac or an acgd, and K be a t-instance of the corresponding schema. K satisfies d if whenever there is a t-homomorphism h from the left-hand side (lhs in short) of d to K, there exists an extension h of h, where h is a t-homomorphism from the conjunction of the lhs and the right-hand side (rhs in short) of d to K. Notice that for an egd or a tgd, there is always a non-empty database on which the dependency is satisfied. This property does not hold for tgd-acs and acgds. We call a tgd-ac or an acgd consistent if there exists a non-empty database instance on which this dependency is satisfied.

6 2.3 T-solutions, universal t-solutions, universal solutions. Let M = (S, T, Σ st, Σ t ) be a data exchange setting with arithmetic comparisons, and I a ground instance of the source schema S. We say an instance J over T is a t-solution for I under M (or, simply a t-solution if I and M are clear from the context), if J is a t-instance, and the instance I, J = I J satisfies Σ st Σ t. A t-solution that is a ground instance is called a ground solution. We say an instance J of T is a solution for I under M (or, simply a solution if I and M are clear from the context), if every t-instance induced by J is a t-solution. A universal solution for I under M, or simply a universal solution is a set J = {J 1,..., J m } of solutions for I, such that for every ground solution K, there is a solution J i J that t-homomorphically maps to K. A universal t-solution is a universal solution which contains only t-solutions. example 2.2. In Example 2.1, the set J 1 = {J 1,..., J 5 } is a universal t-solution. J 2 = {J 5, J 8 } and J 3 = {J 9 } are two universal solutions. It is easy to check that the source-to-target constraint is satisfied on the union of I and every t-instance in J 1. Also, since J 1,..., J 5 are all the induced instances of J 8, the constraint is also satisfied on J 8. How to prove that J 1 is a universal solution is explained in the next section (Example 3.3). How to prove that J 3 is a universal solution is explained in Section 5 (Example 5.2). If J 3 is a universal solution, then J 2 is also a universal solution because if there is a t-homomorphism from an instance F to a t-instance K, then there is some induced t-instance of F which t-homomorphically maps on K (the proof can be found in Lemma 3.2). To every instance K we associate a boolean conjunctive query with arithmetic comparisons q K, whose body contains exactly all the facts in K as subgoals (labeled nulls are replaced by variables), and the (possibly partial) order is added to q K as arithmetic comparison predicates. The predicate in the head uses a fresh predicate name. We call q K the corresponding boolean query of K. To a set of instances we associate a corresponding boolean query by creating a boolean query for each instance with the same head predicate and taking the union. Proposition 2.1 shows that the universal solution has good properties. In what follows we use SOL(M, I) to denote the set of all solutions for I. Proposition 2.1. Let M = (S, T, Σ st, Σ t ) be a data exchange setting with arithmetic comparisons. The following hold. 1. If J and J are two universal solutions for a source instance I, then the corresponding boolean queries q J and q J are equivalent as queries. 2. Let I and I be two source instances, J be a universal solution for I, and J be a universal solution for I under the same data exchange setting M. Then, SOL(M, I) SOL(M, I ) iff q J is contained in q J. Moreover, SOL(M, I) = SOL(M, I ) iff q J and q J are equivalent as queries. The proof of this proposition uses techniques from query containment tests. Subsections 2.4 and 2.5 review the techniques for checking containment between two conjunctive query with arithmetic comparisons (and between unions of them) and in Subsection 2.6 we prove Proposition 2.1. We consider query containment over densely totally ordered domains. The following is an example which illustrates why containment is different if interpreted over discrete domains. example 2.3. Consider queries Q 1 () : r(x), 3 < X < 4 and Q 2 () : r(x), 13 < X < 14. If the comparisons are interpreted over the integers then they are both empty on all databases, hence they are equivalent. However if comparisons are interpreted over the reals then on database {r(3.5)}, query Q 1 computes to true whereas query Q 2 computes to false. 2.4 Containment of Conjunctive Queries with Arithmetic Comparisons (CQACs). In this section we review the problem of checking containment between two conjunctive queries with arithmetic comparisons (CQACs). Kolaitis [Kol05] observed that tgds can be seen as containment between two conjunctive queries. In that sense the intuition behind the check for containment between two CQACs appears in many proofs in the present paper. A conjunctive query with arithmetic comparisons (CQAC) is of the form h(x) : e 1 (X 1 ),..., e k (X k ), C 1,..., C m where each C i is an arithmetic comparison. In general, testing containment between two CQACs is more complex than checking containment between two conjunctive queries. The main reason is that when checking containment of two conjunctive queries with

7 arithmetic comparisons (CQACs), a single containment mapping in general does not suffice, as it is the case for conjunctive queries. The following example illustrates the complexities that arise when checking containment of two CQACs. example 2.4. Consider the following two queries, which are graphically illustrated in Figure 2. Q 1 () : r(x 1, X 2 ), r(x 2, X 3 ), r(x 3, X 4 ), r(x 4, X 5 ), r(x 5, X 1 ), X 1 < X 2. Q 2 () : r(x 1, X 2 ), r(x 2, X 3 ), r(x 3, X 4 ), r(x 4, X 5 ), r(x 5, X 1 ), X 1 < X 3. ½ ½ Ö Ö ¾ Ö Ö ¾ Ö Ö Ö Ö Ö Ö É ½ ½ ¾ É ¾ ½ Figure 2: Graph representation of Q 1 and Q 2. Although the two queries have different comparisons, surprisingly, Q 1 Q 2. To show Q 1 Q 2, we consider the five mappings from the five ordinary subgoals of Q 2 to the five of Q 1. Each mapping corresponds to a rotation of the variables. Under these mappings, the ACs of Q 2 become X 1 < X 3, X 2 < X 4, X 3 < X 5, X 4 < X 1, and X 5 < X 2, respectively. According to the containment test between CQACs of Gupta and Zhang- Ozsoyoglu [Gup94, ZO93] (see below), 2 it is easy, to see that Q 1 Q 2 since if the right hand side of the following implication is false, then X 1 = X 2 : (X 1 < X 2 ) (X 1 < X 3 ) (X 2 < X 4 ) (X 3 < X 5 ) (X 4 < X 1 ) (X 5 < X 2 ). Similarly, we can prove that Q 2 Q 1. Notice that a single containment mapping does not suffice to check containment of two CQACs. Let Q 1 and Q 2 be two CQACs. There are two algorithms in the literature to test whether Q 2 Q 1 : a) the test of canonical databases and b) the test based on logical implication [GSUW94, Klu88]. We briefly present each of them for completeness (in our proofs we make explicit use of the test based on the canonical databases). a) Containment Test based on Canonical Databases. This test is a generalization of the Levy-Sagiv test [LS93], and it is due to Klug [Klu88]. The test produces an exponential number of canonical databases, each of which could be a counterexample to the containment. Suppose we want to test Q 1 Q 2. We do the following: 1. Consider all partitions of the variables of Q 1. For each partition, consider all possible total orders of the blocks of this partition, and assign to each block of the partition a unique constant so that the assigned constants satisfy the total order. Then substitute the variables in each block of the partition by the corresponding constant. Thus we obtain a number of canonical databases D 1, D 2,..., D k, one database for each different order in each partition. Each D i consists of the frozen subgoals of Q 1 excluding the subgoals having comparison predicates. 2. For all D i (together with the corresponding total order) that make the body of Q 1 true, test whether Q 2 (D i ) includes the frozen head of Q 1. The frozen head of Q 1 is obtained by making the same substitution of constants for variables that yielded D i. 3. Q 1 Q 2 if and only if (2) holds. 2 Technically the containment testing requires normalization of the two queries. We can show that in this example the argument also holds after the normalization.

8 b) Containment Test based on Logical Implication. This test was introduced by Gupta [Gup94] and Zhang-Ozsoyoglu [ZO93] in parallel work. First we need to normalize both queries Q 1 and Q 2 to Q 1, Q 2 respectively as follows: 1. Replace each occurrence of a shared variable X in the ordinary subgoals except the first occurrence, by a new distinct variable X i, and add X = X i to the comparisons of the query. 2. Replace each constant c in the query by a new distinct variable Z, and add Z = c to the comparisons of the query. The following theorem states the main claim of the second algorithm. Theorem 2.1. [GSUW94, Klu88] Let Q 1, Q 2 be CQACs and Q 1 = Q 1,0 + β 1, Q 2 = Q 2,0 + β 2 be the queries after normalization. Let µ 1,..., µ k be all the mappings (homomorphisms) from Q 1,0 to Q 2,0. Then Q 2 Q 1 if and only if the following logical implication ϕ is true: ϕ : β 2 µ 1 (β 1)... µ k (β 1). That is, the comparisons in the normalized query Q 2 logically imply (denoted ) the disjunction of the images of the comparisons of the normalized query Q 1 under these mappings. Notice that in Theorem 2.1, the or operation ( ) in the implication is critical, since there might not be a single mapping µ i from Q 1,0 to Q 2,0, such that β 2 µ i (β 1 ). Another important observation is the following. To test the containment of two queries Q 1 and Q 2, using the result in Theorem 2.1, we need to normalize them first. Introducing more comparisons to the queries in the normalization can make the implication test computationally more expensive. Thus, we may want to have a containment result that does not require the queries to be normalized. [ALM04] presented two cases, in which even if Q 1 and Q 2 are not normalized, we still have Q 2 Q 1 if and only if β 2 µ 1 (β 1 )... µ k (β 1 ), where the µ i s are all the homomorphisms from the normal subgoals in Q 1 to the normal subgoals in Q Containment test for unions of CQACs. To prove Proposition 2.1 we need to use a test for checking containment between two unions of CQACs. Figure 3 shows a procedure to test if Q 1 Q 2, where Q 1, Q 2 are unions of CQACs, which was first presented in [Klu88]. Input: Two unions of CQACs: Q 1 = Q i 1, Q 2 = Q j 2. Output: Boolean value to show if Q 1 Q 2. Method: (1) For every CQAC query Q i 1 in Q 1 do: (a) Consider all partitions of the variables of Q i 1. For each partition, consider all possible total orders of the blocks of this partition, and assign to each block of the partition a unique constant so that the assigned constants satisfy the total order. Then substitute the variables in each block of the partition by the corresponding constant. Thus we obtain a number of canonical databases D 1, D 2,..., D k, one database for each different order in each partition. Each D j consists of the frozen subgoals of Q i 1 excluding the subgoals having comparison predicates. (b) For all D j (together with the corresponding total order) that make the body of Q i 1 true, test whether Q 2 (D j ) includes the frozen head of Q i 1. The frozen head of Q i 1 is obtained by making the same substitution of constants for variables that yielded D j. (2) If step (1) holds for all Q i 1 in Q 1, then output yes, otherwise, output no. Figure 3: Procedure for testing containment of unions of CQACs Theorem 2.2. The procedure in Figure 3 can correctly test if Q 1 Q 2 [Klu88]. Proof. Suppose the test output yes. Let D be a database and let t be a tuple in the answers of query Q 1 on D, i.e., t is in Q 1 (D). Then, there is a CQAC query Q i 1 in Q 1, such that t is in Q i 1(D). Hence there is a

9 homomorphism from the ordinary subgoals in the body of Q i 1 to D, which maps the head of Q i 1 on t and satisfies the arithmetic comparisons. The image of this homomorphism is isomorphic to one of the databases D j that make the body of Q i 1 true. Thus the answers of Q 2 on D also contain the tuple t. Therefore, Q 1 Q 2. Suppose Q 1 Q 2. Then every database D j that makes the body of some Q i 1 true is such that the frozen head of Q i 1 is in the answers of Q 1. Since Q 1 Q 2, the frozen head of Q i 1 is in the answers of Q 2 too. Hence the test will report yes. 2.6 Proof of Proposition 2.1. First we need the following lemma which proves a property of q J. Lemma 2.1. Let K be an instance and q K its corresponding boolean query. Then the following hold: a) for every canonical database D of q K such that q K (D) is true, there is a t-instance induced by K such that there is a surjective and injective t-homomorphism from K to D; b) for every t-instance K i induced by K, there is a canonical database D of q K, where q K (D) is true, and there is a surjective and injective t-homomorphism from K i to D. Let J be a universal solution and q J its corresponding boolean query (a union of CQACs). Then the following hold: a) for every CQAC query qi J in q J it holds: for every canonical database D of qi J such that qi J (D) is true, there is a t-instance J j k induced by a solution J j in J, such that there is a surjective and injective t-homomorphism from J jk to D; and b) for every solution J j in J it holds: for every t-instance J jk induced by J j, there is a CQAC qi J in q J, such that there is a canonical database D of qi J where qi J (D) is true, and there is a surjective and injective t-homomorphism from J jk to D. Proof. We will prove each item separately. (a) Let D be a canonical database of q K. Hence there is a t-homomorphism h from q K to D which is surjective. Since q K is the corresponding boolean query of K, K essentially is obtained after renaming the variables of q K. Hence t-homomorphism h implies a t-homomorphism h from K to D, which is surjective. (b) Let K D be a t-instance induced by K. Hence there is a t-homomorphism h from K to K D which is surjective. Since q K is the corresponding boolean query of K, for the same reason as above, t-homomorphism h implies a t-homomorphism h from q K to K D, which is surjective. This is a consequence of the claims in the first item and the way we extend the definition of the corresponding boolean query to refer also to unions of solutions. We proceed to the proof of Proposition 2.1. Proof. First item: By the definition of universal solution, the following holds: For every t-solution J 1 induced from a solution in J, there exists a solution J 1 in J such that there exists a t-homomorphism from J 1 to J 1. By Lemma 2.1, this statement is equivalent to the containment test for unions of CQACs presented in Section 2.5. Hence q J is contained in q J. The same reasoning also applies to the other direction to prove q J q J. Second item: Suppose SOL(I) SOL(I ). For every t-solution J i induced from a solution in J, since J i is in SOL(I ), there is a t-homomorphism from a solution in J to J i, because J is a universal solution for I. Hence, according to the containment test of union of CQACs presented in Section 2.5, q J is contained in q J. For the other direction, assume q J is contained in q J. By Lemma 2.1, for every t-solution J j induced by a solution in J, there is a solution J j in J and a t-homomorphism h that maps J j to J j (denoted by (I) in Figure 4). Let J be a solution for I. We must show that J is a solution for I, which amounts to showing that I, J satisfies Σ st and J satisfies Σ t. Since J is a solution for I, we know J satisfies Σ t. So it suffices to show that I, J satisfies Σ st. Consider a tgd-ac d : ϕ(x) β ϕ ψ(x, y) β ψ in Σ st. To show I, J satisfies d, we need to show that for every t-solution Ji induced from J, I, Ji satisfies d. For every t-solution J i induced from a solution in J, I, J i satisfies d since we deal with source-to-target constraints (i.e. the schemata are disjoint), it follows that for every vector a of constants such that I satisfies

10 ϕ( a) β ϕ, there is a vector b of elements of J i such that J i satisfies ψ( a, b) β ψ (denoted by (II) in Figure 4). Since J is a universal solution for I, by definition of universal solution, for t-solution Ji induced from J there is a t-homomorphism h from a t-solution J j in J to Ji (denoted by (III) in Figure 4). Also, as we observed at the beginning of this direction, for J j there is a solution J j in J such that there is a t-homomorphism h from J j to J j (denoted by (I) in Figure 4). Note that atomic formulas are preserved under t-homomorphisms. So applying t-homomorphism h followed by h (i.e, the composition h h) to a we get again a because a contains constants only. It follows that I, Ji satisfies the tgd-ac d because J i satisfies ψ( a, h h b) β ψ (denoted by (IV) in Figure 4). This argument applies to every t-solution induced from J, hence J satisfies the tgd-ac d. II III I II IV Figure 4: Proof of Proposition Computing Universal Solutions 3.1 AC-Chase. We define a chase procedure, called AC-chase, which deals with arithmetic comparisons and computes a universal solution. The procedure shares the main idea of the chase procedure in the literature [MMS79, BV84b, BV84a]. However, AC-chase produces a tree rather than a sequence for the following reason. Since the result of an AC-chase step on a t-instance may not be a t-instance, whenever necessary, we need to replace this resulting instance with the set of its induced t-instances. definition 3.1. (AC-Chase Step) Let K be a t-instance. We have three stages: Stage I: We construct an instance K by adding relational facts and comparisons to K. We have two possible cases: a) (acgd) Let d be an acgd φ(x) β φ (x 1 θx 2 ). Let h be a t-homomorphism from φ(x) β φ to K = K 0 + β K which cannot be extended. We say that d can be applied to K with t-homomorphism h. We add h(x 1 θx 2 ) to K and produce K. b) (tgd-ac) Let d : φ(x) β φ ψ(x, y) β ψ be a tgd-ac. Let h be a t-homomorphism from φ(x) β φ to K, such that there is no extension of h to a t-homomorphism from φ(x) β φ ψ(x, y) β ψ to K. We say that d can be applied to K with t-homomorphism h. Let h 0 be the underlined homomorphism of h. Let K 0 be the union of K 0 with the set of facts obtained by: (a) extending h 0 to a homomorphism h 0, such that each variable in y is assigned a fresh labeled null, followed by (b) taking the image of the atoms on the rhs of d under h 0. We add to K 0 all the comparisons h 0(x 1 θx 2 ) for each comparison x 1 θx 2 in β ψ and obtain K. Stage II: We check K for consistency. We distinguish two cases: 1. If K is not consistent we say that the result of applying d to K with h is failure and write K d,h. 2. Otherwise, we say that the result of applying d to K with h is K, and write K d,h K. Stage III: We produce all induced t-instances K 1, K 2,..., K n of K in compact form and write also (equivalently) K d,h {K 1, K 2,..., K n}. The following example illustrates an AC-chase step. example 3.1. Consider the following tgd-ac d and a t-instance K: d: R(A) S(B), A < B; K: {R(x), T (y, 3), x < y < 3}.

11 x and y are labeled nulls. There is a t-homomorphism h from the lhs of d to K which maps A to x. There is no extension of this mapping to a t-homomorphism from the rhs of d to K. Applying d to K under h derives one fact S(b) with a comparison x < b, where b is a fresh labeled null. The new instance is K : {R(x), T (y, 3), S(b), x < y < 3, x < b}. Stage II checks that K is consistent, which it is, because there are the following values that satisfy the comparisons: x = 1, y = 2, b = 5. Hence this is a successful AC-chase step. K is not a t-instance, since it does not define a total order of b w.r.t. the labeled null b and the constant 3. Hence in stage III, we produce all induced t-instances of K. In this case, we have five induced t-instances K i where each is as follows: K i : {R(x), T (y, 3), S(b), x < y < 3, x < b, C i}, where C i is a conjunction of comparisons: C 1 : b < y, C 2 : b = y, C 3 : y < b < 3, C 4 : b = 3, C 1 : b > 3. definition 3.2. (AC-Chase) Let Σ be a set of tgd-acs and acgds, and J be a t-instance on the same database schema. An AC-chase tree of J with Σ is a tree (finite or infinite) such that: The root is J. Each node in the tree is a t-instance. For every node K in the tree, let {K 1,..., K l } be the set of its children. Then there must exist a tgd-ac or an acgd d in Σ, a t-homomorphism h from the lhs of d to K, and an instance K, such that K d,h K, and K 1,..., K l are all the t-instances induced by K. A finite AC-chase of J with Σ is a finite chase tree such that each leaf is either, or satisfies Σ. The result of a finite AC-chase is the union of all the leaf nodes that are not. If the result is not empty, we refer to it as the result of a successful AC-chase. Theorem 3.1 states that AC-chase can be used to compute a universal solution. Theorem 3.1. Let M = (S, T, Σ st, Σ t ) be a data exchange setting with arithmetic comparisons, and I be a source ground instance. 1. Let L = { I, J 1, I, J 2,..., I, J k } be the result of a successful finite AC-chase of I, with Σ = Σ st Σ t, then J = {J 1, J 2,..., J k } is a universal t-solution. 2. If the result of a finite chase of I, with Σ = Σ st Σ t is empty, then there is no solution. The proof of the Theorem 3.1 makes use of a basic property of an AC-chase step, as described in the following lemma. This property has been used and proven in [BV84b, MUV84, FKMP03] in different contexts or settings that are more restricted than ours. Lemma 3.1 essentially proves that the instance K produced in Stage II of the AC-chase Step t-homomorphically maps on a solution E as long as the instance K that created K in this step, t-homomorphically maps on E. However, in Stage III of the AC-chase Step we replace K by its induced instances. Thus, we need to prove that at least one of those induced instances t-homomorphically maps to E. This is done in Lemma 3.2. Lemma 3.1. Let d be a tgd-ac of the form, K be a t-instance, and K d,h K be an AC-chase step (where K ) with a t-homomorphism h from the lhs of d to K. Let E be a t-instance such that: (i) E satisfies d; and (ii) K t-homomorphically maps to E. Then, K t-homomorphically maps to E. Proof. Let d be: φ(x) β φ ψ(x, y) β ψ. Since the result of composing two t-homomorphisms is still a t- homomorphism, µ h is a t-homomorphism from φ(x) β φ to E, as illustrated in Figure 5. Since E satisfies d, there exists an extension of µ h, denoted by h, which is a t-homomorphism from φ(x) β φ ψ(x, y) β ψ to E. For each variable y y, denote by Λ y the labeled null replacing y in the AC-chase step. Define ν on Var(K ) to E as follows: ν(λ) = µ(λ) if Λ Var(K), and ν(λ y ) = h (y) for y y. Now we show that ν is a t-homomorphism from K to E. 1. ν maps the relational facts of K to relational facts of E. This claim is true for those relational facts of K coming from K, since µ is a t-homomorphism. Let T (x 0, y 0 ) be an arbitrary relational atom in the conjunction ψ(x, y). (Here x 0 and y 0 contain variables in x and y, respectively.) Then K contains, in addition to those relational facts of K, a relational fact T (h(x 0 ), Λ y0 ). The image under ν of this fact is, by definition of ν, the fact T (µ(h(x 0 )), h (y 0 )). Since h (x 0 ) = µ(h(x 0 )), this is the same as T (h (x 0 ), h (y 0 )). But h homomorphically maps all relational atoms in φ(x) ψ(x, y), in particular, T (x 0, y 0 ), into relational facts of E.

12 d: φ( x) β yψ( x, y) ϕ β ψ h µoh h,d K K' h': extension of µoh µ ν t-instance: E Figure 5: Proof of Lemma ν preserves the order of K. We use the symbol θ to denote a comparison operator, including <,, =,, >, or. Let α 1 θα 2 be an arbitrary comparison in the conjunction β ψ. Without loss of generality, we have to consider the following cases: α 1 x and α 2 x. Then K contains the comparison h(α 1 )θh(α 2 ). Similarly, the image under ν of this comparison is, by definition of ν, the comparison µ(h(α 1 ))θµ(h(α 2 )). This is the same as h (α 1 )θh (α 2 ). By the definition of h, this image is true in E. α 1 x and α 2 y. Then K contains the comparison h(α 1 )θh(λ α2 ). Then the image under ν of this comparison is, by definition of ν, the comparison µ(h(α 1 ))θh (α 2 ), which is still the same as h (α 1 )θh (α 2 ). Using the same reasoning above, this image is also true in E. α 1 y and α 2 y. The reasoning is similar to the above. Thus, ν is a t-homomorphism from K to E. Lemma 3.2. Let K be an instance (not necessarily t-instance) and E be a t-instance. If there is a t-homomorphism h from K to E, then there is a t-instance K i induced by K, such that h is also a t-homomorphism from K i to E. Proof. For every two values (constant or labeled null) α 1 and α 2 in K, let h(α 1 )θh(α 2 ) be the order between their images in E under the mapping h. For each comparison C in K, we can see that α 1 θα 2 is not contradictory to C, since otherwise the image of the comparison C under h will be contradictory to the comparison h(α 1 )θh(α 2 ) in E. If α 1 θα 2 is not in K, we add it to K. We repeat this addition step for every pair of values in K, and get a t-instance induced by K. Clearly h is also a t-homomorphism from this t-instance to E. We proceed to the proof of Theorem 3.1. Proof. Claim 1: By Definition 3.2, every leaf node I, J i in the finite AC-chase tree is a t-instance that satisfies Σ. By definition of t-solution, J i is also a t-solution. Now consider every ground solution E. Notice that the identity mapping is a t-homomorphism from I, to I, E. By Lemmas 3.1 and 3.2, there exists a J i in J, such that there is a t-homomorphism from I, J i to I, E. Thus, J i can t-homomorphically map to E. Therefore, J is a universal t-solution. Claim 2: If the result of a finite AC-chase of I, with Σ = Σ st Σ t is empty, then each leaf node in the corresponding AC-chase tree, which is, corresponds to a set of derived facts that contain contradictory facts. Suppose there exists a t-solution E. Following the same argument as in the proof of Claim 1, there is a leaf node L, such that there exists a t-homomorphism ν from I, F L to I, E, in which F L is the set of facts corresponding to the leaf node L. Since the image of the pair of contradictory comparisons under ν is still a pair of contradictory comparisons, we have that E contains a pair of contradictory comparisons, which is a contradiction to the fact that E is a ground instance. The following is an example that a) illustrates an AC-chase tree and b) is used in Proposition 3.1 to prove that in this simple case, there is no universal solution which contains a single instance.

13 example 3.2. Consider the following data exchange setting. Target acgd d 1 : E(X, Y, Z), X < Z X = Y Target acgd d 2 : E(X, Y, Z), X = Z X = 5 Source-to-target tgd-ac d 3 : A(Y ) E(X, Y, Z). Let I = {A(8)} be the source instance. Then, after chasing with d 3, we get E(X, 8, Z). When we chase with d 1, we get the solution {E(8, 8, Z), 8 < Z}, whereas when we chase with d 2 we get the solution {E(5, 8, 5)}. These two solutions together with {E(X, 8, Z), X > Z} comprise a universal solution. Still, there is no singleton universal solution as Proposition 3.1 proves. In Figure 6 we have the AC-chase tree for this example. The three t-solutions that comprise the universal solution are the leaf nodes in the oval. A(8) d3 E(X,8,Z) E(X,8,Z),X<Z E(X,8,Z),X=Z E(X,8,Z),X>Z d1 d2 E(8,8,Z), 8<Z E(5,8,5) Figure 6: AC-Chase Tree for Example 3.1. Proposition 3.1. There is a data exchange setting with arithmetic comparisons which uses a single tgd in the source-to-target constraints and two acgds as target constraints such that there is no universal solution which contains one instance. Proof. We use the Example 3.2. We have proved that any two universal solutions are equivalent if viewed as Boolean queries. Hence if there exists a single instance universal solution then its corresponding Boolean query q J is equivalent to the union of Boolean queries: Q 1 : E(8, 8, Z), 8 < Z, Q 2 : E(5, 8, 5) and Q 3 : E(X, 8, Z), X > Z. Let J be this single universal solution. Then J must have no E atom with a constant in the first argument because otherwise there is either no mapping on Q 1 or no mapping on Q 2 since these two have different constants in the first arguments. Thus let E(A, C, B) be an atom of J. Then (using the containment test with canonical databases), there is a canonical database D of q J in which A < B. Hence, there is no mapping from Q 3. However, there is no mapping from Q 1 or Q 2 either since constants should map to the same constants. The following is an example continued from Example 2.2, where we mentioned that J 1 is a universal solution of the setting of Example 2.1. example 3.3. To prove that J 1 is a universal solution of the setting of Example 2.1, apply an AC-chase step to the source instance I = {E(2, 5), E(4, 7)} with t-homomorphism: X 2, Y 5. This step will produce a number of t-instances (as a consequence of stage III). To each of these t-instances, apply an AC-chase step with t-homomorphism: X 4, Y Complexity. We investigate the conditions under which an AC-chase procedure terminates and we study the complexity of the problem. We show that an AC-chase tree is finite when the target dependencies is a set of

14 weakly acyclic tgd-acs and a set of acgds. The definition of weak acyclicity for tgd-acs is an extension of the definition of weak acyclicity for tgds as in [FKMP03] 3. definition 3.3. (Weakly acyclic set of tgd-acs) Let Σ be a set of tgd-acs over a database schema in compact form. We construct a directed graph (called the dependency graph) as follows: (1) There is a node for every pair (R, A) with a relation symbol R of the schema and an attribute A of R. We call such a pair (R, A) a position. (2) Add edges as follows: for every tgd-ac φ(x) β φ ψ(x, y) β ψ in Σ and for every x in x that occurs in ψ(x, y) β ψ, we call x a propagated variable. For such a propagated variable x, for every occurrence of x in φ(x) in position (R, A i ), do the following: (i) For every occurrence of x in ψ(x, y) in position (S, B j ), add an edge (R, A i ) (S, B j ) (if it does not already exist); (ii) In addition, for every existentially quantified variable y in y and for every occurrence of y in ψ(x, y) in position (T, C k ), add a special edge (R, A i ) (T, C k ) (if it does not already exist). We say Σ is weakly acyclic if the dependency graph has no cycle going through a special edge. The above definition differs from the one in [FKMP03] in that a variable from the lhs of a tgd-ac that appears in an arithmetic comparison on the rhs affects acyclicity even if this variable does not appear in any relational atom on the rhs. Example 3.4 shows such a case. Theorem 3.2. Let M = (S, T, Σ st, Σ t ) be a data exchange setting with arithmetic comparisons, such that Σ t is the union of a weakly acyclic set of tgd-acs and a set of acgds and let I be a ground source instance. Then the following hold: (1) There is a polynomial in the size of I that bounds the depth of every AC-chase beginning with I. (2) There is a nondeterministically polynomial-time algorithm such that it tests whether a t-solution for I exists. (3) AC-chase takes exponential time to compute a universal t-solution for I. Before proceeding to the proof of Theorem 3.2 we start with an example about weak acyclicity which shows why we had to modify the definition in the presence of arithmetic comparisons. example 3.4. Consider a database schema with a relation R(A) and a relation S(B). We consider first the following set of tgds: Σ = {d 1, d 2 }, where d 1 : R(X) S(Y ) and d 2 : S(Y ) R(Y ). The dependency graph is weakly acyclic so and a chase on any input terminates. Now we change d 1 slightly, i.e., we consider the following set of tgd-acs: Σ = {d 1, d 2 }, where d 1 : R(X) S(Y ), X < Y. The dependency graph of Σ is not weakly acyclic. In fact, it is easy to see that every AC-chase on the instance {R(1)} using Σ does not terminate because every time we apply tgd-ac d 1 we must add an atom with a value for its argument which is larger than the value in the atom we added in the previous application of d 1. Theorem 3.2 covers interesting cases that arise in many applications. For instance, the class of weakly acyclic tgd-acs includes the class of full tgd-acs (no existentially quantified variables). To illustrate the technical part of the proof of Theorem 3.2, we first prove part (1) of the theorem (as Theorem 3.3) and then the other parts can be easily proved from part (1), and are included again in separate statements (Theorem 3.4 and Theorem 3.5). Theorem 3.3. Let Σ be a set of weakly cyclic tgd-acs. Then, there is a polynomial in the size of a t-instance K that bounds the depth of every AC-chase tree of K with Σ. Proof. We start with introducing some notations used in the proof. Every node in the dependency graph of Σ is of the form (R, A), where R is a relation symbol of the schema and A an attribute of R. We call such a pair (R, A) a position. An incoming path for (R, A) is a path (finite or infinite) ending in (R, A). We define the rank of (R, A), denoted by rank(r, A), as the maximum number of special edges on any incoming path. rank(r, A) is finite since Σ is weakly acyclic. We use the following notations in the proof. r: maximum rank rank(r, A) over all positions (R, A); n: number of distinct values (constants or labeled nulls) in the t-instance K; D: number of dependencies in Σ. 3 The concept of weak acyclicity for the case of tgds also appears in [DT03] as constraints with stratified witness.

The Structure of Inverses in Schema Mappings

To appear: J. ACM The Structure of Inverses in Schema Mappings Ronald Fagin and Alan Nash IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 Contact email: fagin@almaden.ibm.com Abstract A schema