DESIGN THEORY FOR RELATIONAL DATABASES. csc343, Introduction to Databases Renée J. Miller and Fatemeh Nargesian and Sina Meraji Winter 2018

Similar documents
Chapter 3 Design Theory for Relational Databases

Information Systems for Engineers. Exercise 8. ETH Zurich, Fall Semester Hand-out Due

Design theory for relational databases

Design Theory for Relational Databases

Relational Database Design

Relational Design Theory

CSC 261/461 Database Systems Lecture 13. Spring 2018

CS54100: Database Systems

Lossless Joins, Third Normal Form

SCHEMA NORMALIZATION. CS 564- Fall 2015

Database Design and Implementation

Functional Dependencies and Normalization

Design Theory for Relational Databases

Chapter 3 Design Theory for Relational Databases

UVA UVA UVA UVA. Database Design. Relational Database Design. Functional Dependency. Loss of Information

CSC 261/461 Database Systems Lecture 11

Design Theory for Relational Databases. Spring 2011 Instructor: Hassan Khosravi

Normal Forms. Dr Paolo Guagliardo. University of Edinburgh. Fall 2016

Lectures 6. Lecture 6: Design Theory

10/12/10. Outline. Schema Refinements = Normal Forms. First Normal Form (1NF) Data Anomalies. Relational Schema Design

Design Theory. Design Theory I. 1. Normal forms & functional dependencies. Today s Lecture. 1. Normal forms & functional dependencies

Normalization. October 5, Chapter 19. CS445 Pacific University 1 10/05/17

CS122A: Introduction to Data Management. Lecture #13: Relational DB Design Theory (II) Instructor: Chen Li

CSE 344 AUGUST 6 TH LOSS AND VIEWS

Consider a relation R with attributes ABCDEF GH and functional dependencies S:

CSC 261/461 Database Systems Lecture 10 (part 2) Spring 2018

Consider a relation R with attributes ABCDEF GH and functional dependencies S:

Introduction. Normalization. Example. Redundancy. What problems are caused by redundancy? What are functional dependencies?

INF1383 -Bancos de Dados

Chapter 11, Relational Database Design Algorithms and Further Dependencies

CSC 261/461 Database Systems Lecture 8. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Schema Refinement and Normal Forms

FUNCTIONAL DEPENDENCY THEORY. CS121: Relational Databases Fall 2017 Lecture 19

CMPT 354: Database System I. Lecture 9. Design Theory

CS 186, Fall 2002, Lecture 6 R&G Chapter 15

Normal Forms (ii) ICS 321 Fall Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

DECOMPOSITION & SCHEMA NORMALIZATION

Relational Database Design Theory Part II. Announcements (October 12) Review. CPS 116 Introduction to Database Systems

Functional Dependency Theory II. Winter Lecture 21

Relational Database Design

Schema Refinement and Normal Forms. The Evils of Redundancy. Schema Refinement. Yanlei Diao UMass Amherst April 10, 2007

Schema Refinement & Normalization Theory

Introduction to Data Management. Lecture #7 (Relational DB Design Theory II)

Functional Dependency and Algorithmic Decomposition

Introduction to Data Management CSE 344

Constraints: Functional Dependencies

Functional Dependencies. Applied Databases. Not all designs are equally good! An example of the bad design

CS322: Database Systems Normalization

Schema Refinement and Normal Forms

CSC 261/461 Database Systems Lecture 12. Spring 2018

CSE 544 Principles of Database Management Systems

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Schema Refinement and Normal Forms. Why schema refinement?

Lecture #7 (Relational Design Theory, cont d.)

Databases 2012 Normalization

Schema Refinement and Normal Forms. Case Study: The Internet Shop. Redundant Storage! Yanlei Diao UMass Amherst November 1 & 6, 2007

Relational Design Theory II. Detecting Anomalies. Normal Forms. Normalization

11/1/12. Relational Schema Design. Relational Schema Design. Relational Schema Design. Relational Schema Design (or Logical Design)

CSE 303: Database. Outline. Lecture 10. First Normal Form (1NF) First Normal Form (1NF) 10/1/2016. Chapter 3: Design Theory of Relational Database

CSE 344 AUGUST 3 RD NORMALIZATION

Practice and Applications of Data Management CMPSCI 345. Lecture 16: Schema Design and Normalization

Schema Refinement and Normal Forms

FUNCTIONAL DEPENDENCY THEORY II. CS121: Relational Databases Fall 2018 Lecture 20

11/6/11. Relational Schema Design. Relational Schema Design. Relational Schema Design. Relational Schema Design (or Logical Design)

The Evils of Redundancy. Schema Refinement and Normalization. Functional Dependencies (FDs) Example: Constraints on Entity Set. Refining an ER Diagram

Introduction to Data Management CSE 344

CSE 344 MAY 16 TH NORMALIZATION

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Example (Contd.

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Refining an ER Diagram

Schema Refinement and Normal Forms

Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) CIS 330, Spring 2004 Lecture 11 March 2, 2004

Chapter 10. Normalization Ext (from E&N and my editing)

Chapter 7: Relational Database Design

Schema Refinement and Normal Forms. Chapter 19

Relational-Database Design

Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) [R&G] Chapter 19

Constraints: Functional Dependencies

The Evils of Redundancy. Schema Refinement and Normal Forms. Functional Dependencies (FDs) Example: Constraints on Entity Set. Example (Contd.

Schema Refinement. Yanlei Diao UMass Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Schema Refinement and Normal Forms Chapter 19

HKBU: Tutorial 4

Chapter 7: Relational Database Design. Chapter 7: Relational Database Design

CSE 132B Database Systems Applications

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #15: BCNF, 3NF and Normaliza:on

Normal Forms 1. ICS 321 Fall Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Normaliza)on and Func)onal Dependencies

Introduction to Management CSE 344

Chapter 8: Relational Database Design

Information Systems (Informationssysteme)

Schema Refinement. Feb 4, 2010

Functional Dependencies and Normalization

Relational Design: Characteristics of Well-designed DB

Functional Dependencies and Normalization. Instructor: Mohamed Eltabakh

Review: Keys. What is a Functional Dependency? Why use Functional Dependencies? Functional Dependency Properties

Normal Forms Lossless Join.

Database Design and Normalization

Design Theory: Functional Dependencies and Normal Forms, Part I Instructor: Shel Finkelstein

Introduction to Database Systems CSE 414. Lecture 20: Design Theory

Schema Refinement & Normalization Theory: Functional Dependencies INFS-614 INFS614, GMU 1

CAS CS 460/660 Introduction to Database Systems. Functional Dependencies and Normal Forms 1.1

Transcription:

DESIGN THEORY FOR RELATIONAL DATABASES csc343, Introduction to Databases Renée J. Miller and Fatemeh Nargesian and Sina Meraji Winter 2018 1

Introduction There are always many different schemas for a given set of data. E.g., you could combine or divide tables. How do you pick a schema? Which is better? What does better mean? Fortunately, there are some principles to guide us. 2

Database Design Theory It allows us to improve a schema systematically. General idea: Express constraints on the relationships between attributes Use these to decompose the relations Ultimately, get a schema that is in a normal form that guarantees good properties. Normal in the sense of conforming to a standard. The process of converting a schema to a normal form is called normalization. 3

Part I: Functional Dependency Theory 4

A poorly designed table part manufacturer manaddress seller selleraddress price 1983 Hammers R Us 99 Pinecrest ABC 1229 Bloor W 5.59 8624 Lee Valley 102 Vaughn ABC 1229 Bloor W 23.99 9141 Hammers R Us 99 Pinecrest ABC 1229 Bloor W 12.50 1983 Hammers R Us 99 Pinecrest Walmart 5289 St Clair W 4.99 In any domain, there may be relationships between attribute values. Perhaps: Every part has 1 manufacturer Every manufacture has 1 address Every seller has 1 address If so, this table will have redundant data. 5

Principle: Avoid redundancy Redundant data can lead to anomalies. part manufacturer manaddress seller selleraddress price 1983 Hammers R Us 99 Pinecrest ABC 1229 Bloor W 5.59 8624 Lee Valley 102 Vaughn ABC 1229 Bloor W 23.99 9141 Hammers R Us 99 Pinecrest ABC 1229 Bloor W 12.50 1983 Hammers R Us 99 Pinecrest Walmart 5289 St Clair W 4.99 Update anomaly: if Hammers R Us moves and we update only one tuple, the data is inconsistent. Deletion anomaly: If ABC stops selling part 8624 and Lee Valley makes only that one part, we lose track of its 6 address.

Definition of FD Suppose R is a relation, and X and Y are subsets of the attributes of R. X Y asserts that: If two tuples agree on all the attributes in set X, they must also agree on all the attributes in set Y. We say that X Y holds in R, or X functionally determines Y. An FD constrains what can go in a relation. 7

More formally A B means: tuples t 1, t 2, (t 1 [A] = t 2 [A]) (t 1 [B] = t 2 [B]) Or equivalently: tuples t 1, t 2 such that (t 1 [A] = t 2 [A]) (t 1 [B] t 2 [B]) 8

Generalization to multiple attributes A 1 A 2 A m B 1 B 2 B n means: tuples t 1, t 2, (t 1 [A 1 ] = t 2 [A 1 ] t 1 [A m ] = t 2 [A m ]) (t 1 [B 1 ] = t 2 [B 1 ] t 1 [B n ] = t 2 [B n ] ) Or equivalently: tuples t 1, t 2 such that (t 1 [A 1 ] = t 2 [A 1 ] t 1 [A m ] = t 2 [A m ]) (t 1 [B 1 ] = t 2 [B 1 ] t 1 [B n ] = t 2 [B n ] ) 9

Why functional dependency? dependency because the value of Y depends on the value of X functional because there is a mathematical function that takes a value for X and gives a unique value for Y 10

Equivalent sets of FDs When we write a set of FDs, we mean that all of them hold. We can very often rewrite sets of FDs in equivalent ways. When we say S 1 is equivalent to S 2 we mean that: S 1 holds in a relation iff S 2 does. 11

Splitting rules for FDs Can we split the RHS of an FD and get multiple, equivalent FDs? Can we split the LHS of an FD and get multiple, equivalent FDs? 12

Coincidence or FD? An FD is an assertion about every instance of the relation. You can t know it holds just by looking at one instance. You must use knowledge of the domain to determine whether an FD holds. 13

FDs are closely related to keys Suppose K is a set of attributes for relation R. Recall definition of superkey: a set of attributes for which no two rows can have the same values. A claim about FDs: K is a superkey for R iff K functionally determines all of R. 14

FDs are a generalization of keys key: X R Every attribute Functional dependency: X Y Not necessarily every attribute An FD can be more subtle. 15

Inferring FDs Given a set of FDs, we can often infer further FDs. This will be handy when we apply FDs to the problem of database design. Big task: given a set of FDs, infer every other FD that must also hold. Simpler task: given a set of FDs, check whether a given FD must also hold. 16

Examples If A B and B C hold, must A C hold? If A H, C F, and F G A D hold, must F A D hold? must C G F H hold? If H GD, HD CE, and BD A hold, must EH C hold? Aside: we are not generating new FDs, but testing a specific possible one. 17

Method 1: Prove an FD follows using first principles You can prove it by referring back to The FDs that you know hold, and The definition of functional dependency. But the Closure Test is easier. 18

Method 2: Prove an FD follows using the Closure Test Assume you know the values of the LHS attributes, and figure out everything else that is determined. If it includes the RHS attributes, then you know that LHS RHS This is called the closure test. 19

Y is a set of attributes, S is a set of FDs. Return the closure of Y under S. Attribute_closure(Y, S): Initialize Y + to Y Repeat until no more changes occur: If there is an FD LHS RHS in S such that LHS is in Y + : Add RHS to Y + Return Y + 20

Visualizing attribute closure LHS RHS Y + new Y + If LHS is in Y + and LHS RHS holds, we can add RHS to Y + 21

S is a set of FDs; LHS RHS is a single FD. Return true iff LHS RHS follows from S. FD_follows(S, LHS RHS): Y+ = Attribute_closure(LHS, S) return (RHS is in Y + ) 22

Projecting FDs Later, we will learn how to normalize a schema by decomposing relations. This is the whole point of this theory. We will need to know what FDs hold in the new, smaller, relations. We must project our FDs onto the attributes of our new relations. 23

Example R(A1,..., An) Set of attributes: A Decompose into: - R1(B1,..., Bk) Set of attributes: B, and - R2(C1,..., Cm) Set of attributes: C B C = A, R1 R2 = R A R1 = π B (R) R2 = π C (R) R1 R2 B C 24

S is a set of FDs; L is a set of attributes. Return the projection of S onto L: all FDs that follow from S and involve only attributes from L. Project(S, L): Initialize T to {}. For each subset X of L: Compute X + Close X and see what we get. For every attribute A in X + : If A is in L: X A is only relevant if A is in L (we know X is). Return T. add X A to T. 25

A few optimizations No need to add X A if A is in X itself. It s a trivial FD. These subsets of X won t yield anything, so no need to compute their closures: the empty set the set of all attributes Neither are big savings, but 26

An important optimization If we find X + = all attributes, we can ignore any superset of X. It can only give use weaker FDs (with more on the LHS). This is a big time saver! 27

Projection is expensive Even with these optimizations, projection is still expensive. Suppose R 1 has n attributes. How many subsets of R 1 are there? 28

Minimal Basis We saw earlier that we can very often rewrite sets of FDs in equivalent ways. Example: S 1 = {A BC} is equivalent to S 2 = {A B, A C}. Given a set of FDs S, we may want to find a minimal basis: A set of FDs that is equivalent, but has no redundant FDs, and no FDs with unnecessary attributes on the LHS. 29

S is a set of FDs. Return a minimal basis for S. Minimal_basis(S): 1. Split the RHS of each FD 2. For each FD X Y where X 2: If you can remove an attribute from X and get an FD that follows from S: 3. For each FD f : Do so! (It s a stronger FD.) If S {f } implies f : Remove f from S. 30

Some comments on minimal basis Often there are multiple possible results. Depends on the order in which you consider the possible simplifications. After you identify a redundant FD, you must not use it when computing subsequent closures. 31

and less intuitive When you are computing closures to decide whether the LHS of an FD X Y can be simplified, continue to use that FD. You must do (2) and (3) in that order. Otherwise, must repeat until no changes occur. 32

Part II: Using FD Theory to do Database Design 33

Recall that poorly designed table? part manufacturer manaddress seller selleraddress price 1983 Hammers R Us 99 Pinecrest ABC 1229 Bloor W 5.59 8624 Lee Valley 102 Vaughn ABC 1229 Bloor W 23.99 9141 Hammers R Us 99 Pinecrest ABC 1229 Bloor W 12.50 1983 Hammers R Us 99 Pinecrest Walmart 5289 St Clair W 4.99 We can now express the relationships as FDs: part manufacturer manufacturer address seller address The FDs tell us there can be redundancy, thus the design is bad. That s why we care about FDs. 34

Decomposition To improve a badly-designed schema R(A 1, A n ), we will decompose it into smaller relations R1(B 1, B j ) and R2(C 1, C k ) such that: R1 = π B1, Bj (R) R2 = π C1, Ck (R) {B 1, B j } U {C 1, C k } = {A 1, A n } R1 R2 = R 35

R(A 1, A n ) Set of attributes: A Decompose into: - R1(B 1, B j ) Set of attributes: B, and - R2(C 1, C k ) Set of attributes: C B C = A, R1 R2 = R A R1 = π B (R) R2 = π C (R) R1 R2 B C 36

But which decomposition? Decomposition can definitely improve a schema. But which decomposition? There are many possibilities. And how can we be sure a new schema doesn t exhibit other anomalies? Boyce-Codd Normal Form guarantees it. 37

Boyce-Codd Normal Form We say a relation R is in BCNF if for every nontrivial FD X Y that holds in R, X is a superkey. Remember: nontrivial means Y is not contained in X. Remember: a superkey doesn t have to be minimal. [Exercise] 38

Intuition In other words, BCNF requires that: Only things that functionally determine everything can functionally determine anything. Why is the BCNF property valuable? Note: FDs are not the problem. They are facts! The schema (in the context of the FDs) is the problem. 39

R is a relation; F is a set of FDs. Return the BCNF decomposition of R, given these FDs. BCNF_decomp(R, F): If an FD X Y in F violates BCNF Compute X +. Replace R by two relations with schemas: R 1 = X + R 2 = R (X + X ) Project the FD s F onto R 1 and R 2. Recursively decompose R 1 and R 2 into BCNF. [Example] 40

Decomposition Picture 1) Start with the LHS of the violating FD. X + 2) Close the LHS to get one new relation X R - X + + X 3) Everything except the new stuff is the other new relation. X is in both new relations to make a connection between them. 41

Some comments on BCNF decomp If more than one FD violates BCNF, you may decompose based on any one of them. So there may be multiple results possible. The new relations we create may not be in BCNF. We must recurse. We only keep the relations at the leaves. How does the decomposition step help? [Exercise] 42

BCNF Every attribute depends on: The key The whole key And nothing but the key so help me Codd. 43

More speed-ups When projecting FDs onto a new relation, check each new FD: Does the new relation violate BCNF because of this FD? If so, abort the projection. You are about to discard this relation anyway (and decompose further). 44

Properties of Decompositions 45

What we want from a decomposition 1. No anomalies. 2. Lossless Join : It should be possible to a) project the original relations onto the decomposed schema b) then reconstruct the original by joining. We should get back exactly the original tuples. 3. Dependency Preservation : All the original FD s should be satisfied. 46

What is lost in a lossy join? For any decomposition, it is the case that: r r 1 r n I.e., we will get back every tuple. But it may not be the case that: r r 1 r I.e., we can get spurious tuples. [Exercise]

What BCNF decomposition offers 1. No anomalies : (Due to no redundancy) 2. Lossless Join : (Section 3.4.1 argues this) 3. Dependency Preservation : 48

The BCNF property does not guarantee lossless join If you use the BCNF decomposition algorithm, a lossless join is guaranteed. If you generate a decomposition some other way you have to check to make sure you have a lossless join even if your schema satisfies BCNF! We ll learn an algorithm for this check later.

Preservation of dependencies BCNF decomposition does not guarantee preservation of dependencies. I.e., in the schema that results, it may be possible to create an instance that: satisfies all the FDs in the final schema, but violates one of the original FDs. Why? Because the algorithm goes too far breaks relations down too much. [Exercise] 50

3NF is less strict than BCNF 3 rd Normal Form (3NF) modifies the BCNF condition to be less strict. An attribute is prime if it is a member of any key. X A violates 3NF iff X is not a superkey and A is not prime. I.e., it s ok if X is not a superkey as long as A is prime. [Exercise] 51

F is a set of FDs; L is a set of attributes. Synthesize and return a schema in 3 rd Normal Form. 3NF_synthesis(F, L): Construct a minimal basis M for F. For each FD X Y in M Define a new relation with schema X Y. If no relation is a superkey for L Add a relation whose schema is some key. [Example] 52

3NF synthesis doesn t go too far BCNF decomposition doesn t stop decomposing until in all relations: if X A then X is a superkey. 3NF generates relations where: X A and yet X is not a superkey, but A is at least prime. [Example] 53

What a 3NF decomposition offers 1. No anomalies : 2. Lossless Join : 3. Dependency Preservation : Neither BCNF nor 3NF can guarantee all three! We must be satisfied with 2 of 3. Decompose too far can t enforce all FDs. Not far enough can have redundancy. We consider a schema good if it is in either BCNF or 3NF. 54

How can we get anomalies? 3NF synthesis guarantees that the resulting schema will be in 3 rd normal form. This allows FDs with a non-superkey on the LHS. This allows redundancy, and thus anomalies. 55

How do we know? that the algorithm guarantees: 3NF: A property of minimal bases [see the textbook for more] Preservation of dependencies: Each FD from a minimal basis is contained in a relation, thus preserved. Lossless join: We ll return to this once we know how to test for lossless join. 56

Synthesis vs decomposition 3NF synthesis: We build up the relations in the schema from nothing. BCNF decomposition: We start with a bad relation schema and break it down. 57

Testing for a Lossless Join If we project R onto R 1, R 2,, R k, can we recover R by rejoining? We will get all of R. Any tuple in R can be recovered from its projected fragments. This is guaranteed. But will we get only R? Can we get a tuple we didn t have in R? This part we must check. 58

Aside: when we don t need to test for lossless Join Both BCNF decomposition and 3NF synthesis guarantee lossless join. So we don t need to test for lossless join if the schema was generated via BCNF decomposition or 3NF synthesis. But merely satisfying BCNF or 3NF does not guarantee a lossless join! 59

The Chase Test Suppose tuple t appears in the join. Then t is the join of projections of some tuples of R, one for each R i of the decomposition. Can we use the given FD s to show that one of these tuples must be t? [Example] 60

Setup for the Chase Test Start by assuming t = abc. For each i, there is a tuple s i of R that has a, b, c, in the attributes of R i. s i can have any values in other attributes. We ll use the same letter as in t, but with a subscript, for these components. 61

The algorithm 1. If two rows agree in the left side of a FD, make their right sides agree too. 2. Always replace a subscripted symbol by the corresponding unsubscripted one, if possible. 3. If we ever get a completely unsubscripted row, we know any tuple in the project-join is in the original (i.e., the join is lossless). 4. Otherwise, the final tableau is a counterexample (i.e., the join is lossy). 5. [Exercise] 62