Dependency-Preserving Normalization of Relational and XML Data (Appendix)

Similar documents
Introduction to Data Management CSE 344

Functional Dependencies and Normalization

Relational Design Theory II. Detecting Anomalies. Normal Forms. Normalization

Chapter 3 Design Theory for Relational Databases

Chapter 11, Relational Database Design Algorithms and Further Dependencies

CSC 261/461 Database Systems Lecture 13. Spring 2018

DESIGN THEORY FOR RELATIONAL DATABASES. csc343, Introduction to Databases Renée J. Miller and Fatemeh Nargesian and Sina Meraji Winter 2018

Normal Forms Lossless Join.

Chapter 3 Design Theory for Relational Databases

Comp 5311 Database Management Systems. 5. Functional Dependencies Exercises

LOGICAL DATABASE DESIGN Part #1/2

Relational Design Theory

CSC 261/461 Database Systems Lecture 11

CSE 344 MAY 16 TH NORMALIZATION

11/1/12. Relational Schema Design. Relational Schema Design. Relational Schema Design. Relational Schema Design (or Logical Design)

Fall Inverse of a matrix. Institute: UC San Diego. Authors: Alexander Knop

Relational-Database Design

Design Theory for Relational Databases

UVA UVA UVA UVA. Database Design. Relational Database Design. Functional Dependency. Loss of Information

Design Theory for Relational Databases. Spring 2011 Instructor: Hassan Khosravi

Relational Database Design

COMMUTING ELEMENTS IN GALOIS GROUPS OF FUNCTION FIELDS. Fedor Bogomolov and Yuri Tschinkel

Relational Database Design

Design theory for relational databases

12/3/2010 REVIEW ALGEBRA. Exam Su 3:30PM - 6:30PM 2010/12/12 Room C9000

Database Design and Implementation

CSE 344 AUGUST 3 RD NORMALIZATION

4 4 N v b r t, 20 xpr n f th ll f th p p l t n p pr d. H ndr d nd th nd f t v L th n n f th pr v n f V ln, r dn nd l r thr n nt pr n, h r th ff r d nd

10/12/10. Outline. Schema Refinements = Normal Forms. First Normal Form (1NF) Data Anomalies. Relational Schema Design

CSE 344 AUGUST 6 TH LOSS AND VIEWS

Practice and Applications of Data Management CMPSCI 345. Lecture 16: Schema Design and Normalization

Relational Design Theory I. Functional Dependencies: why? Redundancy and Anomalies I. Functional Dependencies

Functional Dependencies

1 The Basics: Vectors, Matrices, Matrix Operations

5. Data dependences and database schema design. Witold Rekuć Data Processing Technology 104

Provenance Semirings. Todd Green Grigoris Karvounarakis Val Tannen. presented by Clemens Ley

CSE 544 Principles of Database Management Systems

Introduction to Management CSE 344

CSC 261/461 Database Systems Lecture 8. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Relational Database Design

CSC 261/461 Database Systems Lecture 10 (part 2) Spring 2018

Introduction to Data Management. Lecture #6 (Relational DB Design Theory)

Lectures 6. Lecture 6: Design Theory

n r t d n :4 T P bl D n, l d t z d th tr t. r pd l

Ma 227 Review for Systems of DEs

Vr Vr

CSC 261/461 Database Systems Lecture 12. Spring 2018

Schema Refinement & Normalization Theory

Information Systems for Engineers. Exercise 8. ETH Zurich, Fall Semester Hand-out Due

Schema Refinement & Normalization Theory: Functional Dependencies INFS-614 INFS614, GMU 1

Inverses and Elementary Matrices

1. Suppose that a, b, c and d are four different integers. Explain why. (a b)(a c)(a d)(b c)(b d)(c d) a 2 + ab b = 2018.

CSE 303: Database. Outline. Lecture 10. First Normal Form (1NF) First Normal Form (1NF) 10/1/2016. Chapter 3: Design Theory of Relational Database

Section 1 (closed-book) Total points 30

Problems and Solutions

Design Theory for Relational Databases

Quantum Computing Lecture 2. Review of Linear Algebra

0 t b r 6, 20 t l nf r nt f th l t th t v t f th th lv, ntr t n t th l l l nd d p rt nt th t f ttr t n th p nt t th r f l nd d tr b t n. R v n n th r

Database Design: Normal Forms as Quality Criteria. Functional Dependencies Normal Forms Design and Normal forms

On Independence and Determination of Probability Measures

Connectivity of addable graph classes

Connectivity of addable graph classes

Intrinsic products and factorizations of matrices

Introduction to Data Management. Lecture #6 (Relational Design Theory)

Phys 201. Matrices and Determinants

Relational Design: Characteristics of Well-designed DB

α-acyclic Joins Jef Wijsen May 4, 2017

Lossless Joins, Third Normal Form

Th n nt T p n n th V ll f x Th r h l l r r h nd xpl r t n rr d nt ff t b Pr f r ll N v n d r n th r 8 l t p t, n z n l n n th n rth t rn p rt n f th v

MATH1050 Greatest/least element, upper/lower bound

4 8 N v btr 20, 20 th r l f ff nt f l t. r t pl n f r th n tr t n f h h v lr d b n r d t, rd n t h h th t b t f l rd n t f th rld ll b n tr t d n R th

Determinants of Partition Matrices

FUNCTIONAL DEPENDENCY THEORY II. CS121: Relational Databases Fall 2018 Lecture 20

,. *â â > V>V. â ND * 828.

Today s topics. Binary Relations. Inverse Relations. Complementary Relations. Let R:A,B be any binary relation.

Topics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington

Data Bases Data Mining Foundations of databases: from functional dependencies to normal forms

arxiv: v1 [math.gm] 2 Jun 2018

Exercises 1 - Solutions

46 D b r 4, 20 : p t n f r n b P l h tr p, pl t z r f r n. nd n th t n t d f t n th tr ht r t b f l n t, nd th ff r n b ttl t th r p rf l pp n nt n th

Denotational Semantics


Graph Transformations T1 and T2

CS54100: Database Systems

DECOMPOSITION & SCHEMA NORMALIZATION

Chapter 10. Normalization Ext (from E&N and my editing)

P = 1 F m(p ) = IP = P I = f(i) = QI = IQ = 1 F m(p ) = Q, so we are done.

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 2: HILBERT S NULLSTELLENSATZ.

Functional Dependency Theory II. Winter Lecture 21

Introduction to Data Management CSE 344

The coincidence Nielsen number for maps into real projective spaces

Properties of Context-free Languages. Reading: Chapter 7

Optimization problems on the rank and inertia of the Hermitian matrix expression A BX (BX) with applications

ϕ : Z F : ϕ(t) = t 1 =

Version January Please send comments and corrections to

On the Limiting Distribution of Eigenvalues of Large Random Regular Graphs with Weighted Edges

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

OQ4867. Let ABC be a triangle and AA 1 BB 1 CC 1 = {M} where A 1 BC, B 1 CA, C 1 AB. Determine all points M for which ana 1...

CS322: Database Systems Normalization

Difference Sets are Not Multiplicatively Closed

Transcription:

Dependency-Preserving Normalization of Relational and XML Data (Appendix) Solmaz Kolahi Department of Computer Science, University of Toronto solmaz@cs.toronto.edu Proof of Theorem ( ) Suppose (R,Σ) is in 3NF and I inst(r,σ). Having INF I (p Σ) < for some p = (R, t,a) in Pos(I) means that there is redundant information in position p. Since we assume Σ only contains FDs, there must be an FD X A Σ + and a different tuple t in I, such that t[x] = t [X] and therefore t[a] = t [A]. This can only happen when X is not a key. Thus, A is a prime attribute. ( ) Assume that there is an FD X A Σ +, such that X is not a key for R and A is not prime. We show that there is an instance I inst(r,σ) and position p Pos(I) such that INF I (p Σ) <. Let I be an instance of (R,Σ) containing two tuples t,t 2 defined as follows. For every B sort(r), t [B] =. If B X +, t 2 [B] =, otherwise t 2 [B] = 2. It is easy to see that I satisfies Σ, and for position p = (R,t,A) we have INF I (p Σ) <. This contradicts the assumption that for every non-prime attribute A and position p = (R, t,a), we have INF I (p Σ) =. Proof of Claim 3.2 For proving this claim, we use the following lemma from [5]: Lemma. Let Σ be a set of FDs over a relational schema R, I inst(r,σ), p Pos(I) and ā Ω(I, p). Then lim k log k a [,k] P(a ā)log P(a ā) is either 0 or. Let n = Pos(I) in Example. By definition, for position p: INF I (p Σ) = lim k log k = 2 n ā Ω(I,p) lim k ā Ω(I,p) 2 n log k a [,k] a [,k] P(a ā)log P(a ā)log P(a ā) P(a ā).

2 Solmaz Kolahi Therefore, according to Lemma we have to count the number of ā s in Ω(I,p) for which lim k log k a [,k] P(a ā) log P(a ā) = in order to find the information content of position p. It is easy to see that this limit is zero iff the value in position p is forced by the constants in ā and the FDs in Σ; i.e. for only one value a in the active domain of I we have p(a ā) = and for a a, p(a ā) = 0. Fix ā Ω(I,p). Let t 0 denote the tuple in ā corresponding to the first row of the instance. Suppose {B j t 0 [B j ] is constant, j [,m]} = i. If an arbitrary tuple t t 0 in ā does not force a value for p, either of the following cases should hold: t[a] is a variable. In this case t can have either constants or variables for other attributes, so we can have 2 m+ different shapes for t. t[a] is a constant. In this case t can only have variables for the attributes in {B j t 0 [B j ] is constant, j [, m]} and constants or variables for other attributes. Therefore, we can have 2 m i+ different shapes for t. Now we count the number of ā s that only contain tuples of the above form. We can have 2 ( ) m i different t0 s, and for each t 0, each of the other tuples can be in 2 m+ +2 m i+ different shapes. Furthermore, i can range over [0,m]. Therefore, when the number of tuples in the instance is tup, the number of ā s that do not force a value for position p is: m ( ) m 2 (2 m+ + 2 m i+ ) tup. i i=0 To obtain the information content of position p, we should divide this number by 2 n which is the total number of ā s. Note that n = tup (m + 2). Thus: INF I (p Σ) = Proof of Theorem 3 = 2 tup (m+2) i=0 ( m m i i=0 m ( m 2 i ) ( + 2 i ) tup. 2 tup +m ) (2 m+ + 2 m i+ ) tup ( ) Suppose (D, Σ) is in XNF and the FD X p.@l is in (D, Σ) +. Then the FD X p is also in (D, Σ) +. Let T be an arbitrary XML tree conforming to D and satisfying Σ. For every two tree tuples t,t 2 in T, if t and t 2 agree on all paths in X (t (q) = t 2 (q) for all q X), then t (p) = t 2 (p) = v. Trivially, t (v ) = t 2 (v ) for every node v an ancestor of v in tree T, so for every path q a prefix of p and every attribute @m defined for last(q), T satisfies the FDs X q and X q.@m. Therefore, X q.@m is in (D, Σ) +. ( ) The proof of the other direction follows from the FDs resulting from hierarchical representation of relational attributes. Suppose an FD X p.@l is in (D, Σ) +. Then for every prefix q of p and the attribute @m defined for last(q) the

Appendix 3 FD X q.@m is also in (D, Σ) +. Let T be an arbitrary XML tree conforming to D and satisfying Σ. If two tree tuples t,t 2 from T agree on all the attributes of elements from the root to last(p), they will agree on the nodes corresponding to element types from the root to last(p) as well. This is because of the FDs of the form {p,p.τ i.@l i } p.τ i that are added to Σ during the construction of (D, Σ). Therefore, for all paths q a prefix of p the FD X q is in (D, Σ) +. In particular, X p (D, Σ) +, and hence (D, Σ) is in XNF. Proof of Theorem 4 Consider the relational schema R = (A, B,C, D,F) and the following set F of FDs over it: ABCD F FD A FC B It was shown in Example 3 that we cannot find an appropriate ordering of the attributes in order to give a hierarchical XNF representation of (R, F). Now consider an arbitrary non-xnf hierarchical translation (D, Σ) of (R, F): E = {r,a, B,C, D,F }. A = {@a,@b,@c, @d, @f}. P(r) = A, P(A) = B, P(B) = C, P(C) = D, P(D) = F, P(F) = ǫ. R(r) =, R(A) = {@a}, R(B) = {@b}, R(C) = {@c}, R(D) = {@d}, R(F) = {@f}. Σ = { {r.a.@a} r.a, {r.a, r.a.b.@b} r.a.b, {r.a.b, r.a.b.c.@c} r.a.b.c, {r.a.b.c, r.a.b.c.d.@d} r.a.b.c.d, {r.a.b.c.d, r.a.b.c.d.f.@f} r.a.b.c.d.f, {r.a.@a, r.a.b.@b, r.a.b.c.@c, r.a.b.c.d.@d} r.a.b.c.d.f.@f, {r.a.b.c.d.f.@f, r.a.b.c.d.@d} r.a.@a, {r.a.b.c.d.f.@f, r.a.b.c.@c} r.a.b.@b}. Now we informally show that no matter how we restructure (D, Σ) into another XML specification (D,Σ ), either the new XML specification is not in XNF, or the FDs are not preserved. Suppose there is a dependency-preserving XNF decomposition (D,Σ ), obtained from the above (D,Σ). Assume that p A.@a, p B.@b, p C.@c, p D.@d and p F.@f in paths(d ) are mapped to r.a.@a, r.a.b.@b, r.a.b.c.@c, r.a.b.c.d.@d, and r.a.b.c.d.f.@f in paths(d) respectively. Neither of p A and p B can be a prefix of the other, because if so, either of the functional dependencies {p F.@f, p C.@c} p A.@a or {p F.@f,p D.@d} p B.@b must also be in (D,Σ ) +, which is not desirable. Therefore, assuming p AB is the longest common prefix of p A, p B and τ = last(p AB ),A = last(p A ),B = last(p B ), one of the following cases must be true:

4 Solmaz Kolahi For every XML tree T = (D, Σ ), and every element node corresponding to τ in T, there is only one pair of nodes corresponding to elements A and B, i.e. p AB {p A,p B }. Since the decomposition is dependencypreserving, {p F.@f,p D.@d} p A.@a (D,Σ ) +. Since (D,Σ ) is in XNF, {p F.@f, p D.@d} p A is also in (D,Σ ) +, and because p AB is a prefix of p A, {p F.@f, p D.@d} p AB (D, Σ ) +. Thus the FDs {p F.@f, p D.@d} p B and {p F.@f,p D.@d} p B.@b, which are spurious, should be in (D,Σ ) + as well. For every XML tree T = (D,Σ ), and every element node corresponding to τ in T, there can be more than one pair of nodes corresponding to elements A and B. Here, there is no mechanism to bind @a s and @b s, so spurious tree tuples will appear. Since these two attributes participate in a single functional dependency, these extra tuples may lead the XML tree not to satisfy the functional dependency. Therefore, the decomposition cannot be dependency-preserving. Proof of Proposition We first need to prove the following claim. Claim. The FD A i...a ik A i is in F + iff the FD {r.g.@a i,...,r.g.@a ik } r.g.@a i is in (D R,Σ F ) +. The proof of this claim follows from the fact that for each instance I of R, there is an XML tree T I conforming to D R such that I = F iff T I = Σ F. Moreover, for each XML tree T conforming to D R and satisfying the FD {r.g.@a,...,r.g.@a m } r.g, there is an instance I T of R such that T = Σ F iff I T = F. Now we prove the proposition for the case of 3NF and X3NF. ( ) Suppose that (R, F) is in 3NF. we prove that (D R,Σ F ) is in X3NF. Suppose that there is a nontrivial FD S r.g.@a i in (D R,Σ F ) +. The element paths r and r.g cannot be in S, because if two tree tuples agree on element paths r or r.g, they agree on every other path, and the FD will be trivially satisfied. Therefore, S r.g.@a i is of the form {r.g.@a i,...,r.g.@a ik } r.g.@a i. By the above claim, there is an FD in F of the form A i...a ik A i. Since (R, F) is in 3NF, at least one of the following should be true that implies (D R,Σ F ) is in X3NF. The attributes A i...a ik form a key, and hence for every j [,m], the FD {r.g.@a i,...,r.g.@a ik } r.g.@a j is in (D R,Σ F ) +. Since (D R,Σ F ) + contains the FD {r.g.@a,...,r.g.@a m } r.g, it should also contain {r.g.@a i,...,r.g.@a ik } r.g. Therefore S r.g (D R, Σ F ) +. The attribute A i is prime, so there is an FD in F + of the form A l...a lt A i A...A m. Therefore, there is an FD S G in (D R,Σ F ) + such that r.g.@a i S. Moreover, since A l...a lt is not a key, S {r.g.@ai} G (D R,Σ F ) +. Thus r.g.@a i is prime.

Appendix 5 ( ) Suppose that (D R, Σ F ) is in X3NF. We prove that (R, F) is in 3NF. Let A i...a ik A i be a nontrivial FD in F +. Then there must be an FD {r.g.@a i,...,r.g.@a ik } r.g.@a i in (D R,Σ F ) + by the above claim. Since (D R, Σ F ) is in X3NF, at least one of the following should be true that implies (R, F) is in 3NF. The FD {r.g.@a i,...,r.g.@a ik } r.g is in (D R,Σ F ) + which easily implies A i...a ik is a key. The attribute path r.g.@a i is prime, so there is a nontrivial FD S p in (D R,Σ F ) + s.t. r.g.@a i S, S is minimal, and p is an element path. Since this FD is not trivial, the element path p cannot be r, and there is no element path in S, so S p is of the form {r.g.@a l,...,r.g.@a lt, r.g.@a i } r.g. Thus, for every j [,m], the FD {r.g.@a l,...,r.g.@a lt,r.g.@a i } r.g.@a j is in (D R,Σ F ) +, and this implies A l...a lt,a i is a key for R. Since S is minimal, A l...a lt, A i is a candidate key, and hence A i is prime. Proof of Proposition 2 Let (D, Σ) be a hierarchical translation of relational specification (R, F) obtained from any ordering of attributes in R. We first need to prove the following claims. Claim. Any FD S p in (D, Σ) + is equivalent to an FD of the form S p where S only contains attribute paths. The proof of this claim follows from the following fact: for every XML tree T conforming to D and satisfying Σ and every two tree tuples t,t 2 in T, t,t 2 agree on an element path q paths(d), iff they agree on every attribute path q.@m such that q is a prefix of q. This is because of the FDs of the form {p,p.τ.@l} p.τ that are added to Σ during the construction of (D,Σ). Claim. The FD A i...a ik A ik+ is in F + iff the FD {p i.@l i,...,p ik.@l ik } p ik+.@l ik+ is in (D R,Σ F ) +, where for every j [,k + ], the last element of p ij corresponds to relational attribute A ij. The proof of this claim follows from the fact that in our translation of relational data into XML, tree tuples fully represent relational tuples. Now we prove the proposition for the case of 3NF and X3NF. Suppose there is an FD of the form S p.@l in (D,Σ) +. By the first claim, this FD can be written as {p i.@l i,...,p ik.@l ik } p.@l, and by the second claim, the FD A i...a ik A should be in F +. Since (R, F) is in 3NF, one of the following cases is true that implies (D, Σ) is in X3NF. A i...a ik forms a key for R, i.e. the values of A i...a ik identify a unique tuple in any instance I of (R, F). Every relational tuple corresponds to a single tree tuple in the XML tree T that represents the relational instance I. This means if two tree tuples t,t 2 in T agree on {p i.@l i,...,p ik.@l ik }, they are equal. Thus, {p i.@l i,...,p ik.@l ik } implies every element path q and in particular p. Therefore, S p is in (D, Σ) +.

6 Solmaz Kolahi A is prime. Equivalently, p.@l is contained in minimal set S that implies every element path q, and hence p.@l is a prime attribute path.