Dependency-Preserving Normalization of Relational and XML Data (Appendix)

Dependency-Preserving Normalization of Relational and XML Data (Appendix) Solmaz Kolahi Department of Computer Science, University of Toronto solmaz@cs.toronto.edu Proof of Theorem ( ) Suppose (R,Σ) is in 3NF and I inst(r,σ). Having INF I (p Σ) < for some p = (R, t,a) in Pos(I) means that there is redundant information in position p. Since we assume Σ only contains FDs, there must be an FD X A Σ + and a different tuple t in I, such that t[x] = t [X] and therefore t[a] = t [A]. This can only happen when X is not a key. Thus, A is a prime attribute. ( ) Assume that there is an FD X A Σ +, such that X is not a key for R and A is not prime. We show that there is an instance I inst(r,σ) and position p Pos(I) such that INF I (p Σ) <. Let I be an instance of (R,Σ) containing two tuples t,t 2 defined as follows. For every B sort(r), t [B] =. If B X +, t 2 [B] =, otherwise t 2 [B] = 2. It is easy to see that I satisfies Σ, and for position p = (R,t,A) we have INF I (p Σ) <. This contradicts the assumption that for every non-prime attribute A and position p = (R, t,a), we have INF I (p Σ) =. Proof of Claim 3.2 For proving this claim, we use the following lemma from [5]: Lemma. Let Σ be a set of FDs over a relational schema R, I inst(r,σ), p Pos(I) and ā Ω(I, p). Then lim k log k a [,k] P(a ā)log P(a ā) is either 0 or. Let n = Pos(I) in Example. By definition, for position p: INF I (p Σ) = lim k log k = 2 n ā Ω(I,p) lim k ā Ω(I,p) 2 n log k a [,k] a [,k] P(a ā)log P(a ā)log P(a ā) P(a ā).

2 Solmaz Kolahi Therefore, according to Lemma we have to count the number of ā s in Ω(I,p) for which lim k log k a [,k] P(a ā) log P(a ā) = in order to find the information content of position p. It is easy to see that this limit is zero iff the value in position p is forced by the constants in ā and the FDs in Σ; i.e. for only one value a in the active domain of I we have p(a ā) = and for a a, p(a ā) = 0. Fix ā Ω(I,p). Let t 0 denote the tuple in ā corresponding to the first row of the instance. Suppose {B j t 0 [B j ] is constant, j [,m]} = i. If an arbitrary tuple t t 0 in ā does not force a value for p, either of the following cases should hold: t[a] is a variable. In this case t can have either constants or variables for other attributes, so we can have 2 m+ different shapes for t. t[a] is a constant. In this case t can only have variables for the attributes in {B j t 0 [B j ] is constant, j [, m]} and constants or variables for other attributes. Therefore, we can have 2 m i+ different shapes for t. Now we count the number of ā s that only contain tuples of the above form. We can have 2 ( ) m i different t0 s, and for each t 0, each of the other tuples can be in 2 m+ +2 m i+ different shapes. Furthermore, i can range over [0,m]. Therefore, when the number of tuples in the instance is tup, the number of ā s that do not force a value for position p is: m ( ) m 2 (2 m+ + 2 m i+ ) tup. i i=0 To obtain the information content of position p, we should divide this number by 2 n which is the total number of ā s. Note that n = tup (m + 2). Thus: INF I (p Σ) = Proof of Theorem 3 = 2 tup (m+2) i=0 ( m m i i=0 m ( m 2 i ) ( + 2 i ) tup. 2 tup +m ) (2 m+ + 2 m i+ ) tup ( ) Suppose (D, Σ) is in XNF and the FD X p.@l is in (D, Σ) +. Then the FD X p is also in (D, Σ) +. Let T be an arbitrary XML tree conforming to D and satisfying Σ. For every two tree tuples t,t 2 in T, if t and t 2 agree on all paths in X (t (q) = t 2 (q) for all q X), then t (p) = t 2 (p) = v. Trivially, t (v ) = t 2 (v ) for every node v an ancestor of v in tree T, so for every path q a prefix of p and every attribute @m defined for last(q), T satisfies the FDs X q and X q.@m. Therefore, X q.@m is in (D, Σ) +. ( ) The proof of the other direction follows from the FDs resulting from hierarchical representation of relational attributes. Suppose an FD X p.@l is in (D, Σ) +. Then for every prefix q of p and the attribute @m defined for last(q) the

Appendix 3 FD X q.@m is also in (D, Σ) +. Let T be an arbitrary XML tree conforming to D and satisfying Σ. If two tree tuples t,t 2 from T agree on all the attributes of elements from the root to last(p), they will agree on the nodes corresponding to element types from the root to last(p) as well. This is because of the FDs of the form {p,p.τ i.@l i } p.τ i that are added to Σ during the construction of (D, Σ). Therefore, for all paths q a prefix of p the FD X q is in (D, Σ) +. In particular, X p (D, Σ) +, and hence (D, Σ) is in XNF. Proof of Theorem 4 Consider the relational schema R = (A, B,C, D,F) and the following set F of FDs over it: ABCD F FD A FC B It was shown in Example 3 that we cannot find an appropriate ordering of the attributes in order to give a hierarchical XNF representation of (R, F). Now consider an arbitrary non-xnf hierarchical translation (D, Σ) of (R, F): E = {r,a, B,C, D,F }. A = {@a,@b,@c, @d, @f}. P(r) = A, P(A) = B, P(B) = C, P(C) = D, P(D) = F, P(F) = ǫ. R(r) =, R(A) = {@a}, R(B) = {@b}, R(C) = {@c}, R(D) = {@d}, R(F) = {@f}. Σ = { {r.a.@a} r.a, {r.a, r.a.b.@b} r.a.b, {r.a.b, r.a.b.c.@c} r.a.b.c, {r.a.b.c, r.a.b.c.d.@d} r.a.b.c.d, {r.a.b.c.d, r.a.b.c.d.f.@f} r.a.b.c.d.f, {r.a.@a, r.a.b.@b, r.a.b.c.@c, r.a.b.c.d.@d} r.a.b.c.d.f.@f, {r.a.b.c.d.f.@f, r.a.b.c.d.@d} r.a.@a, {r.a.b.c.d.f.@f, r.a.b.c.@c} r.a.b.@b}. Now we informally show that no matter how we restructure (D, Σ) into another XML specification (D,Σ ), either the new XML specification is not in XNF, or the FDs are not preserved. Suppose there is a dependency-preserving XNF decomposition (D,Σ ), obtained from the above (D,Σ). Assume that p A.@a, p B.@b, p C.@c, p D.@d and p F.@f in paths(d ) are mapped to r.a.@a, r.a.b.@b, r.a.b.c.@c, r.a.b.c.d.@d, and r.a.b.c.d.f.@f in paths(d) respectively. Neither of p A and p B can be a prefix of the other, because if so, either of the functional dependencies {p F.@f, p C.@c} p A.@a or {p F.@f,p D.@d} p B.@b must also be in (D,Σ ) +, which is not desirable. Therefore, assuming p AB is the longest common prefix of p A, p B and τ = last(p AB ),A = last(p A ),B = last(p B ), one of the following cases must be true:

4 Solmaz Kolahi For every XML tree T = (D, Σ ), and every element node corresponding to τ in T, there is only one pair of nodes corresponding to elements A and B, i.e. p AB {p A,p B }. Since the decomposition is dependencypreserving, {p F.@f,p D.@d} p A.@a (D,Σ ) +. Since (D,Σ ) is in XNF, {p F.@f, p D.@d} p A is also in (D,Σ ) +, and because p AB is a prefix of p A, {p F.@f, p D.@d} p AB (D, Σ ) +. Thus the FDs {p F.@f, p D.@d} p B and {p F.@f,p D.@d} p B.@b, which are spurious, should be in (D,Σ ) + as well. For every XML tree T = (D,Σ ), and every element node corresponding to τ in T, there can be more than one pair of nodes corresponding to elements A and B. Here, there is no mechanism to bind @a s and @b s, so spurious tree tuples will appear. Since these two attributes participate in a single functional dependency, these extra tuples may lead the XML tree not to satisfy the functional dependency. Therefore, the decomposition cannot be dependency-preserving. Proof of Proposition We first need to prove the following claim. Claim. The FD A i...a ik A i is in F + iff the FD {r.g.@a i,...,r.g.@a ik } r.g.@a i is in (D R,Σ F ) +. The proof of this claim follows from the fact that for each instance I of R, there is an XML tree T I conforming to D R such that I = F iff T I = Σ F. Moreover, for each XML tree T conforming to D R and satisfying the FD {r.g.@a,...,r.g.@a m } r.g, there is an instance I T of R such that T = Σ F iff I T = F. Now we prove the proposition for the case of 3NF and X3NF. ( ) Suppose that (R, F) is in 3NF. we prove that (D R,Σ F ) is in X3NF. Suppose that there is a nontrivial FD S r.g.@a i in (D R,Σ F ) +. The element paths r and r.g cannot be in S, because if two tree tuples agree on element paths r or r.g, they agree on every other path, and the FD will be trivially satisfied. Therefore, S r.g.@a i is of the form {r.g.@a i,...,r.g.@a ik } r.g.@a i. By the above claim, there is an FD in F of the form A i...a ik A i. Since (R, F) is in 3NF, at least one of the following should be true that implies (D R,Σ F ) is in X3NF. The attributes A i...a ik form a key, and hence for every j [,m], the FD {r.g.@a i,...,r.g.@a ik } r.g.@a j is in (D R,Σ F ) +. Since (D R,Σ F ) + contains the FD {r.g.@a,...,r.g.@a m } r.g, it should also contain {r.g.@a i,...,r.g.@a ik } r.g. Therefore S r.g (D R, Σ F ) +. The attribute A i is prime, so there is an FD in F + of the form A l...a lt A i A...A m. Therefore, there is an FD S G in (D R,Σ F ) + such that r.g.@a i S. Moreover, since A l...a lt is not a key, S {r.g.@ai} G (D R,Σ F ) +. Thus r.g.@a i is prime.

Appendix 5 ( ) Suppose that (D R, Σ F ) is in X3NF. We prove that (R, F) is in 3NF. Let A i...a ik A i be a nontrivial FD in F +. Then there must be an FD {r.g.@a i,...,r.g.@a ik } r.g.@a i in (D R,Σ F ) + by the above claim. Since (D R, Σ F ) is in X3NF, at least one of the following should be true that implies (R, F) is in 3NF. The FD {r.g.@a i,...,r.g.@a ik } r.g is in (D R,Σ F ) + which easily implies A i...a ik is a key. The attribute path r.g.@a i is prime, so there is a nontrivial FD S p in (D R,Σ F ) + s.t. r.g.@a i S, S is minimal, and p is an element path. Since this FD is not trivial, the element path p cannot be r, and there is no element path in S, so S p is of the form {r.g.@a l,...,r.g.@a lt, r.g.@a i } r.g. Thus, for every j [,m], the FD {r.g.@a l,...,r.g.@a lt,r.g.@a i } r.g.@a j is in (D R,Σ F ) +, and this implies A l...a lt,a i is a key for R. Since S is minimal, A l...a lt, A i is a candidate key, and hence A i is prime. Proof of Proposition 2 Let (D, Σ) be a hierarchical translation of relational specification (R, F) obtained from any ordering of attributes in R. We first need to prove the following claims. Claim. Any FD S p in (D, Σ) + is equivalent to an FD of the form S p where S only contains attribute paths. The proof of this claim follows from the following fact: for every XML tree T conforming to D and satisfying Σ and every two tree tuples t,t 2 in T, t,t 2 agree on an element path q paths(d), iff they agree on every attribute path q.@m such that q is a prefix of q. This is because of the FDs of the form {p,p.τ.@l} p.τ that are added to Σ during the construction of (D,Σ). Claim. The FD A i...a ik A ik+ is in F + iff the FD {p i.@l i,...,p ik.@l ik } p ik+.@l ik+ is in (D R,Σ F ) +, where for every j [,k + ], the last element of p ij corresponds to relational attribute A ij. The proof of this claim follows from the fact that in our translation of relational data into XML, tree tuples fully represent relational tuples. Now we prove the proposition for the case of 3NF and X3NF. Suppose there is an FD of the form S p.@l in (D,Σ) +. By the first claim, this FD can be written as {p i.@l i,...,p ik.@l ik } p.@l, and by the second claim, the FD A i...a ik A should be in F +. Since (R, F) is in 3NF, one of the following cases is true that implies (D, Σ) is in X3NF. A i...a ik forms a key for R, i.e. the values of A i...a ik identify a unique tuple in any instance I of (R, F). Every relational tuple corresponds to a single tree tuple in the XML tree T that represents the relational instance I. This means if two tree tuples t,t 2 in T agree on {p i.@l i,...,p ik.@l ik }, they are equal. Thus, {p i.@l i,...,p ik.@l ik } implies every element path q and in particular p. Therefore, S p is in (D, Σ) +.

6 Solmaz Kolahi A is prime. Equivalently, p.@l is contained in minimal set S that implies every element path q, and hence p.@l is a prime attribute path.