Schema Refinement Yanlei Diao UMass Amherst Slides Courtesy of R. Ramakrishnan and J. Gehrke 1
Revisit a Previous Example ssn name Lot Employees rating hourly_wages hours_worked ISA contractid Hourly_Emps Contract_Emps v Consider relation obtained from Hourly_Emps: Hourly_Emps(ssn, name, lot, rating, hrly_wages, hrs_worked) Denote the schema by listing all its attributes: SNLRWH 2
Example (Contd.) S N L R W H 123-22-3666 Attishoo 48 8 10 40 231-31-5368 Smiley 22 8 10 30 131-24-3650 Smethurst 35 5 7 30 434-26-3751 Guldu 35 5 7 32 612-67-4134 Madayan 35 8 10 40 v Rating (R) determines hourly wages (W): Redundant storage Update: Can we change W in just the 1st tuple of rating 8? Insertion: Insert an employee without knowing the hourly wage for his rating? Insert the hourly wage for rating 10 with no employee? Deletion: What if we have deleted all employees with rating 5. 3
Will Two Smaller Tables be Better? S N L R W H 123-22-3666 Attishoo 48 8 10 40 231-31-5368 Smiley 22 8 10 30 131-24-3650 Smethurst 35 5 7 30 434-26-3751 Guldu 35 5 7 32 612-67-4134 Madayan 35 8 10 40 Hourly_Emps2 S N L R H 123-22-3666 Attishoo 48 8 40 231-31-5368 Smiley 22 8 30 131-24-3650 Smethurst 35 5 30 434-26-3751 Guldu 35 5 32 612-67-4134 Madayan 35 8 40 Wages R W 8 10 5 7 4
The Evils of Redundancy v Redundant storage causes several operation anomalies: Insert and delete anomalies v Functional dependencies, a new type of integrity constraint, can be used to identify schemas with such problems. IC s we have seen: domain constraints, key constraints, foreign key constraints, general constraints (checks or assertions) A new type of IC, functional dependencies, which can be checked using SQL checks or assertions. 5
1. Functional Dependencies (FDs) v A functional dependency X à Y holds over relation R if: X and Y are two sets of attributes of R; allowable instance r of R: t1 r, t2 r, π X (t1) = π X (t2) implies π Y (t1) = π Y (t2) E.g., X = {R}, Y = {W}, and R à W v An FD holds for all allowable instances of a schema. v Key constraint is a special from of FD: K is a super key for R means that K à R. K à R does not require K to be minimal! 6
FDs in the Hourly_Emps Example v Hourly_Emps(ssn, name, office, rating, hrly_wages, hrs_worked) Denoted by SNLRWH v Some FDs on Hourly_Emps: ssn is the key: S à SNLRWH rating determines hrly_wages: R à W 7
Reasoning About FDs v Given some FDs, we can usually infer additional FDs: {ssn à did, did à building} implies ssn à building v Given a set of FDs F, closure of F (F + ) is the set of all FDs that are implied by F. All FDs in F + hold over the relation R. 8
Axioms and Rules v Armstrong s Axioms (X, Y, Z are sets of attributes): Reflexivity: If X Y, then Y à X Augmentation: If X à Y, then XZ à YZ for any Z Transitivity: If X à Y and Y à Z, then X à Z v Couple of additional rules (that follow from AA): Union: If X à Y and X à Z, then X à YZ Decomposition: If X à YZ, then X à Y and X à Z v Computing the closure F + using the axioms/rules: Compute for all FD s. Size of closure is exponential in number of attributes! 9
Attribute Closure v What if we just want to check if a given FD X à Y is in F +? v Simple algorithm for attribute closure X + : X + := {X} DO if there is U à V in F, s.t. U X +, then X + = X + V UNTIL no change Check if a given FD X à Y is in F + : Simply check if Y X +. v Does F = {A à B, B à C, C D à E } imply A à E? Is A à E in the closure F +? Equivalently, is E in A +? 10
2. Normal Forms v Role of FDs in detecting redundancy: R(A,B,C) Given A à B: No FDs (except key constraints) hold: Two tuples have the same A value will have the same B value! No redundancy here. v Normal forms: If a reln does not have certain kinds of FDs, certain redundancy related problems are known not to occur. 11
Boyce-Codd Normal Form (BCNF) v Rewrite every FD in the form of X à A, X is a set of attributes, A is a single attribute Use the decomposition rule v Reln R with FDs F is in BCNF if X à A in F + : 1. A X (called a trivial FD), or 2. X is a superkey (i.e., contains a key) for R. v In BCNF, the only non-trivial FDs are key constraints! 12
Boyce-Codd Normal Form (contd.) v Can we infer the value marked by? If X A, then the relation is not in BCNF A reln. in BCNF can t have X A X Y A x y 1 a x y 2? v Relation in BCNF: Every field of every tuple records information that can t be inferred using FD s from other fields. No redundancy can be detected using FDs! 13
Another Example v When is redundancy possible? Reserves( Sailor, Boat, Date, Credit_card) with S à C, C à S Keys are SBD and CBD. It is not in BCNF. For each reservation of sailor S, same (S, C) is stored. 14
3. Decomposing a Relation Scheme v A decomposition of R breaks R into two or more relns s.t. Each new reln contains a subset of the attributes of R. Every attribute of R appears in at least one new reln. v Decompositions should be used only when: R has redundancy related problems (not in BCNF), We can afford the joins in queries later (performance penalty). 18
Example Decomposition v Hourly_Emps (SNLRWH) FDs: S à SNLRWH and R à W. R à W violates BCNF. And it causes repeated (R,W) storage. v To fix this, create a relation RW, remove W from the main schema. (SNLRWH) (SNLRH) and (RW). 19
(1) Lossless Join Decompositions v Decomposition of schema R into R1 and R2 is lossless-join w.r.t. a set of FDs F if reln. instance r that satisfies F: r = π R1 (r) π R2 (r) v It is always true that r π R1 (r) π R2 (r). A bad decomposition can cause r π R1 (r) π R2 (r). A B C 1 2 3 4 5 6 7 2 8 A B 1 2 4 5 7 2 B C 2 3 5 6 2 8 A B C 1 2 3 4 5 6 7 2 8 1 2 8 7 2 3 20
A Simple Test for Lossless Join v Theorem: Decomposition of R into R1 and R2 is losslessjoin wrt F iff the F + contains: R1 R2 à R1 or R1 R2 à R2 (intersection of R1 and R2 is a (super) key of one of them.) v Algorithm for a lossless join is based on the above result: If U à V holds over R and violates a BCNF definition, the decomposition into UV and R - V is lossless-join. 21
(2) Dependency Preserving Decomposition v Contracts(Contractid, Supplierid, projectid, Deptid, Partid, Qty, Value), CSJDPQV, with FDs: C is key. JP à C: a project buys a given part using a single contract. SD à P: a department buys at most one part from a supplier. v What are the keys? Is it in BCNF? Keys: C, JP, SDJ. Not in BCNF due to SDà P v Lossless-join BCNF decomposition: SDP, CSJDQV Problem: Checking JP à C requires an assertion (using join) How to write this assertion in SQL? 22
Dependency Preserving Decomposition v The projection of a FD set F onto a decomposed reln R1, F R1 = F + R1, contains all U à V s.t. (i) U, V are both in R1, (ii) U à V is in closure F +. v Decomposition of R into R1, R2 is dependency preserving if (F R1 UNION F R2 ) + = F + v Important to consider F + (not F!) in this definition: ABC, A à B, B à C, C à A, decomposed into AB and BC. Is this dependency preserving? Is C à A preserved? 23
Algorithm for Decomposition into BCNF v Relation R with FDs F. If X à Y violates BCNF, decompose R into R1 = XY and R2 = R Y. For each Ri, compute F Ri and check if it is in BCNF. If not, pick a FD violating BCNF and keep composing Ri. v Repeated application of this process yields a lossless join decomposition into BCNF relations. But not necessarily dependency-preserving! So need to check whether some FDs are lost. 24
Steps of BCNF Decomposition v Contracts(CSJDPQV), key C, JP à C, SD à P, J à S. 1. Keys. C, JP, DJ. 2. Normal form. Not in BCNF; SD à P and J à S violate BCNF. 3. Decomposition. To deal with SD à P, decompose into SDP, CSJDQV. n SDP is in BCNF. But CSJDQV is not because: 1. Projection of FDs and keys. Projection of FDs: keys C and DJ, J à S. 2. Normal form. Not BCNF; J à S violates BCNF. 3. Decomposition. For J à S, decompose CSJDQV into JS and CJDQV. n JS is in BCNF. So is CJDQV. v If several FDs violate BCNF, the order of dealing with them could lead to very different sets of relations! 25
BCNF and Dependency Preservation v Is a lossless-join BCNF decomposition dependency-preserving? CSJDPQV with key C, JP à C, SD à P and J à S. CSJDPQV decomposed into SDP, JS, and CJDQV. What about the lost FD, JP à C? a) Use assertion. Bad performance on inserts/updates. b) Adding JPC as a new relation to preserve JP à C introduces redundancy across relations. In general, the new relation may not be in BCNF. c) Don t decompose. Deal with insertion/deletion anomalies. DBA should examine the tradeoffs. v There may not exist a lossless join, dependency-preserving decomposition into BCNF. But there is always a lossless join, dependency-preserving decomposition into 3NF (more see the textbook) 26