Data Dependencies in the Presence of Difference

Similar documents
UVA UVA UVA UVA. Database Design. Relational Database Design. Functional Dependency. Loss of Information

Functional. Dependencies. Functional Dependency. Definition. Motivation: Definition 11/12/2013

Relational Database Design

CS54100: Database Systems

Design Theory: Functional Dependencies and Normal Forms, Part I Instructor: Shel Finkelstein

Functional Dependencies. Chapter 3

Constraints: Functional Dependencies

Data Bases Data Mining Foundations of databases: from functional dependencies to normal forms

Schema Refinement and Normal Forms Chapter 19

Functional Dependencies and Normalization

LOGICAL DATABASE DESIGN Part #1/2

Schema Refinement. Feb 4, 2010

Schema Refinement & Normalization Theory: Functional Dependencies INFS-614 INFS614, GMU 1

Schema Refinement and Normal Forms

Functional Dependencies. Getting a good DB design Lisa Ball November 2012

Design Theory for Relational Databases

Constraints: Functional Dependencies

Schema Refinement and Normal Forms. Why schema refinement?

Schema Refinement and Normal Forms. Case Study: The Internet Shop. Redundant Storage! Yanlei Diao UMass Amherst November 1 & 6, 2007

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

Schema Refinement & Normalization Theory

Schema Refinement and Normal Forms

Discovering Conditional Functional Dependencies

Introduction to Data Management. Lecture #6 (Relational DB Design Theory)

INF1383 -Bancos de Dados

Reasoning about Record Matching Rules

Chapter 8: Relational Database Design

Database Design and Implementation

Schema Refinement and Normal Forms

SCHEMA NORMALIZATION. CS 564- Fall 2015

Functional Dependencies and Normalization. Instructor: Mohamed Eltabakh

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Example (Contd.

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Refining an ER Diagram

Schema Refinement and Normal Forms. The Evils of Redundancy. Schema Refinement. Yanlei Diao UMass Amherst April 10, 2007

Schema Refinement and Normal Forms

Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) CIS 330, Spring 2004 Lecture 11 March 2, 2004

Detecting Functional Dependencies Felix Naumann

ECE473 Lecture 15: Propositional Logic

The Evils of Redundancy. Schema Refinement and Normalization. Functional Dependencies (FDs) Example: Constraints on Entity Set. Refining an ER Diagram

Propagating Functional Dependencies with Conditions

Knowledge Compilation and Theory Approximation. Henry Kautz and Bart Selman Presented by Kelvin Ku

Information Systems for Engineers. Exercise 8. ETH Zurich, Fall Semester Hand-out Due

A Goal-Oriented Algorithm for Unification in EL w.r.t. Cycle-Restricted TBoxes

Introduction to Data Management. Lecture #6 (Relational Design Theory)

CSC 261/461 Database Systems Lecture 10 (part 2) Spring 2018

Propositional Logic: Models and Proofs

10/12/10. Outline. Schema Refinements = Normal Forms. First Normal Form (1NF) Data Anomalies. Relational Schema Design

Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) [R&G] Chapter 19

Edinburgh Research Explorer

The Evils of Redundancy. Schema Refinement and Normal Forms. Functional Dependencies (FDs) Example: Constraints on Entity Set. Example (Contd.

Guaranteeing No Interaction Between Functional Dependencies and Tree-Like Inclusion Dependencies

CSIT5300: Advanced Database Systems

Neighborhood Semantics for Modal Logic Lecture 3

Information Systems (Informationssysteme)

Translatable Updates of Selection Views under Constant Complement

Knowledge base (KB) = set of sentences in a formal language Declarative approach to building an agent (or other system):

Outline. Complexity Theory. Introduction. What is abduction? Motivation. Reference VU , SS Logic-Based Abduction

Database Design: Normal Forms as Quality Criteria. Functional Dependencies Normal Forms Design and Normal forms

Relational-Database Design

Chapter 7: Relational Database Design

Plan of the lecture. G53RDB: Theory of Relational Databases Lecture 10. Logical consequence (implication) Implication problem for fds

Inverting Proof Systems for Secrecy under OWA

Functional Dependencies & Normalization. Dr. Bassam Hammo

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

Functional Dependencies

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Functional Dependencies and Normalization

Logic for Computer Science - Week 4 Natural Deduction

Applied Logic. Lecture 1 - Propositional logic. Marcin Szczuka. Institute of Informatics, The University of Warsaw

On the Relative Trust between Inconsistent Data and Inaccurate Constraints

arxiv: v2 [cs.db] 23 Aug 2016

On Terminological Default Reasoning about Spatial Information: Extended Abstract

Functional Dependencies. Applied Databases. Not all designs are equally good! An example of the bad design

Lectures 6. Lecture 6: Design Theory

Chapter 7: Relational Database Design. Chapter 7: Relational Database Design

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Design Theory for Relational Databases

Propositional Dynamic Logic

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

Relational Design: Characteristics of Well-designed DB

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

Schema Refinement and Normal Forms. Chapter 19

Reasoning with Inconsistent and Uncertain Ontologies

Chapter 3 Design Theory for Relational Databases

L13: Normalization. CS3200 Database design (sp18 s2) 2/26/2018

Equivalence of SQL Queries In Presence of Embedded Dependencies

11/6/11. Relational Schema Design. Relational Schema Design. Relational Schema Design. Relational Schema Design (or Logical Design)

Journal of Computer and System Sciences

CSE 132B Database Systems Applications

Ordering, Indexing, and Searching Semantic Data: A Terminology Aware Index Structure

CSE 303: Database. Outline. Lecture 10. First Normal Form (1NF) First Normal Form (1NF) 10/1/2016. Chapter 3: Design Theory of Relational Database

Extremal problems in logic programming and stable model computation Pawe l Cholewinski and Miros law Truszczynski Computer Science Department Universi

Review: Keys. What is a Functional Dependency? Why use Functional Dependencies? Functional Dependency Properties

Normal Forms. Dr Paolo Guagliardo. University of Edinburgh. Fall 2016

CS 186, Fall 2002, Lecture 6 R&G Chapter 15

Database System Concepts, 5th Ed.! Silberschatz, Korth and Sudarshan See for conditions on re-use "

CSC 261/461 Database Systems Lecture 8. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Comp487/587 - Boolean Formulas

Disjunctive Databases for Representing Repairs

arxiv: v2 [cs.db] 24 Jul 2012

Transcription:

Data Dependencies in the Presence of Difference Tsinghua University sxsong@tsinghua.edu.cn

Outline Introduction Application Foundation Discovery Conclusion and Future Work

Data Dependencies in the Presence of Difference Introduction 1/31 Motivation Data Dependencies traditionally for quality of Schema: schema design, integrity constraints, query optimization, etc. Data Dependencies recently for quality of Data: data cleaning, data repairing, record matching, etc. Table: Example instance of Empolyee name institute title salary ssn t 1 John Depp Tech. Univ. Professor 60 111 t 2 J. Depp Technical Univ. Professor 60 111 t 3 J.C. Depp Tech. University Prof. 3 111 t 4 R. Depp Western Univ. Lecturer 30 222

Data Dependencies in the Presence of Difference Introduction 2/31 Motivation Identification function in schema-oriented issues, in conventional dependencies, e.g., fds title salary t 1 [title] : Professor = t 2 [title] : Professor t 1 [salary] : 60 = t 2 [salary] : 60 Difference semantics in data-oriented practice, on numerical values or text values, e.g., similar or dissimilar. title : Professor Prof. salary: 60k v.s. 3k

Data Dependencies in the Presence of Difference Introduction 3/31 Differential Dependencies: Syntax We propose a novel type of dependencies differential dependencies (dds) in the form of φ L [X] φ R [Y] φ L [X] and φ R [Y] are differential functions, which specify distance constraints on attributes X and Y of R, respectively. Constraints on difference for any two tuples (t 1,t 2 ) from an instance of R if their value differences (measured by certain distance metric) on attributes X agree with the differential function φ L [X], (t 1,t 2 ) φ L [X] then their value differences on Y should also agree with the differential function φ R [Y], (t 1,t 2 ) φ R [Y]

Data Dependencies in the Presence of Difference Introduction 4/31 Example A dd in a credit card transaction database can be dd 1 [cardno(= 0) position( 60)] [transtime( 20)] cardno(= 0) states that two transactions have the same credit card no (the difference on attribute cardno is 0) position( 60), transtime( 20) are differential functions specified on attribute position, transtime, respectively Constraints on difference If the distance of two transaction positions of a same cardno is 60 km (e.g., two different cities) they are probably two transactions happening at different time the difference between transtime should be 20 mins. If two card transactions do not satisfy dd 1, one of the transactions could be a fraud.

Data Dependencies in the Presence of Difference Introduction 5/31 Example A dd in a price database of a flight, in decision support systems dd 2 [date( 7)] [price( 100)] states that the price difference of any two days in a week length should be less than 100 $ Instead of a week length, another dd may specify dd 3 [date(> 7, 30)] [price(> 100, 900)] the price difference constraint of two days not in a week length but in a month length Both dd 2 and dd 3 specify on the same embedded attributes date price but with different constraint semantics, i.e., week and month.

Data Dependencies in the Presence of Difference Introduction 6/31 Related Work Conditional functional dependencies (cfds) (X A,t p ) make the fds, originally hold for the whole table, valid only for a set of tuples specified by the conditions ([country, zip] [street], < Finland, >) Metric functional dependencies (mfds) X δ A similarity metrics in the right-hand-side, for violation detection name 2 address Matching dependencies (mds) [X ] [A ] similar semantics in the left-hand-side, for record matching [name ] [addr ] [tel ]

Data Dependencies in the Presence of Difference Introduction 7/31 Comparison cfds introduce condition extension, which is still on identification semantics. mfds, mds consider the similar semantics, on either determinant attributes X or dependent attributes Y Our differential dependencies dds φ L [X] φ R [Y] address more general difference constraints with various semantics similar (e.g., price( 100) in dd 2 ) dissimilar/different (e.g., transtime( 20) in dd 1 ), or even more complicated ones (e.g., date(> 7, 30) in dd 3 ) allow setting difference constraints on both determinant attributes X and dependent attributes Y

Outline Introduction Application Foundation Discovery Conclusion and Future Work

Data Dependencies in the Presence of Difference Application 8/31 Example: Violation Detection To find the tuples that violate dependencies according to dd 2 [date( 7)] [price( 100)] t 3,t 4 are detected as violations to dd 2 fds cannot express such constraints on difference t 3,t 4 cannot be detected by a fd date price t 1,t 2 are detected as violations to fd by mistake Table: Example of a price database Tuple Date Price t 1 2010.06.01 1,000 t 2 2010.06.01 1,050 t 3 2010.08.02 2,000 t 4 2010.08.03 3,000

Data Dependencies in the Presence of Difference Application 9/31 Evaluation: Violation Detection dds compared with fds with identification functions differential functions in the right-hand-side Y detect violations more accurately the detection precision is higher than fds differential functions in the left-hand-side X address more tuples with violations the detection recall by using dds is higher than fds 1 0.8 Cora 1 0.8 FDs DDs Cora Precision 0.6 0.4 Recall 0.6 0.4 0.2 FDs DDs 0 I1 I2 I3 I4 I5 I6 Data instance 0.2 0 I1 I2 I3 I4 I5 I6 Data instance

Data Dependencies in the Presence of Difference Application 10/31 Example: Data Partition To optimize data partition queries Integrity constraints (e.g., fds or candidate keys) can be utilized to optimize the evaluation of queries known as the semantic query optimization Consider a group-by query on distance conditions SELECT * FROM Employee GROUP BY institute( 5) title( 6) according to [institute( 5)] [institute( 5) title( 6)] rewrite the query by using institute( 5) only SELECT * FROM Employee GROUP BY institute( 5)

Data Dependencies in the Presence of Difference Application 11/31 Evaluation: Data Partition Using candidate differential key dependencies, cdk dependencies In x-axis, each element a/b corresponds to a pair of reduced/original differential functions for partitioning queries a denotes the cardinality of cdk b denotes the cardinality of original partition scheme the smaller the rate a/b is, the more the performance can be improved Time cost (s) 15 12 9 6 3 CiteSeer original CDKs Time cost (s) 10 8 6 4 2 Cora original CDKs 0 1/5 1/4 2/5 1/3 2/4 2/3 Partition scheme 0 2/5 3/4 1/3 2/3 3/4 2/3 Partition scheme

Data Dependencies in the Presence of Difference Application 12/31 Example: Record Linkage To identify duplicate record, a.k.a. record matching, merge-purge use dds as matching rules dd 1 [name( 5) institute( 7)] [ssn(= 0)] t 1,t 2, whose name distance is 5, and institute distance is 7, probably denote the same employee with identical ssn another valid matching rule on same attributes dd 2 [name( 3) institute( 15)] [ssn(= 0)] t 2,t 3 detected as duplicates by dd 2, not detected by dd 1 Table: Example instance of Empolyee name institute title salary ssn t 1 John Depp Tech. Univ. Professor 60 111 t 2 J. Depp Technical Univ. Professor 60.2 111 t 3 J.C. Depp Tech. University Prof. 30 111 t 4 R. Depp Western Univ. Lecturer 30 222

Data Dependencies in the Presence of Difference Application 13/31 Evaluation: Record Linkage dds compared mds, mds associate only one differential function on each attribute dds can specify various differential functions on one attribute dds address more matching rules recall of dds is significantly higher dds have comparable precision as mds, both are valid matching rules 1 Restaurant 1 Restaurant 0.8 0.8 Precision 0.6 0.4 Recall 0.6 0.4 0.2 MDs DDs 0 I1 I2 I3 I4 I5 I6 Data instance 0.2 MDs DDs 0 I1 I2 I3 I4 I5 I6 Data instance

Outline Introduction Application Foundation Discovery Conclusion and Future Work

Data Dependencies in the Presence of Difference Foundation 14/31 Differential Function: Intersection The intersection of φ 1 [Z] and φ 2 [Z] on the same attributes Z is φ 3 [Z] = φ 1 [Z] φ 2 [Z] any (t 1,t 2 ) φ 1 [Z] and (t 1,t 2 ) φ 2 [Z], then (t 1,t 2 ) φ 3 [Z] any (t 1,t 2 ) φ 1 [Z] or (t 1,t 2 ) φ 2 [Z], then (t 1,t 2 ) φ 3 [Z] [name( 9)] [name( 7)] = [name( 7)] Apply intersection between φ 1 [X] and φ 2 [Y] on different attributes X and Y Let Z = X Y φ 1 [X] φ 2 [Y] =(φ 1 [X \Z] φ 1 [Z]) (φ 2 [Z] φ 2 [Y \Z]) =φ 1 [X \Z] (φ 1 [Z] φ 2 [Z]) φ 2 [Y \Z]. [name( 5) address( 12)] [address( 10)] = [name( 5) address( 10)]

Data Dependencies in the Presence of Difference Foundation 15/31 Differential Function: Subsumption Intuitively, the semantics of similar subsumes identification any two values that are identical (with distance = 0) can always be interpreted as similar (with distance 9) Definition Let φ 1 [Z] and φ 2 [Z] be two differential functions on attributes Z If any tuple pair (t 1,t 2 ) φ 2 [Z] always agree (t 1,t 2 ) φ 1 [Z] we say that φ 1 [Z] subsumes φ 2 [Z], written φ 1 [Z] φ 2 [Z] For example φ 1 [name] = [name( 9)] subsumes φ 2 [name] = [name( 7)] denoted by [name( 9)] [name( 7)] a distance value of name that agrees 7 will always agree 9 [date( 30)] [date(> 7, 30)]; [addr( 9)] [addr(= 0)]

Data Dependencies in the Presence of Difference Foundation 16/31 Differential Dependency Consider an instance I of relation R (t 1,t 2 ) φ L [X] denotes tuples (t 1,t 2 ) having distance agreeing φ L [X] I satisfies a dd, I φ L [X] φ R [Y], if any two tuples t 1 and t 2 in I having metric distances (t 1,t 2 ) φ L [X] must agree (t 1,t 2 ) φ R [Y] I satisfies a set Σ of dds, I Σ if I φ L [X] φ R [Y] for each φ L [X] φ R [Y] Σ. Proposition For two differential functions φ L [X] and φ R [Y], if Y X and φ R [Y] φ L [Y], then φ L [X] φ R [Y]. a trivial dd, always holds [name( 5) address( 10)] [address( 12)]

Data Dependencies in the Presence of Difference Foundation 17/31 Logical Implication Example Consider two dds, dd 4 [name( 7)] [address( 1)], dd 5 [address( 5)] [salary( 50)]. any two tuples t 1 and t 2 having name distance 7, according to dd 4, their distance on address should be 1, (t 1,t 2 ) agree address( 5) as well. the salary distance of t 1 and t 2 should be 50 according to dd 5 We can imply another dd, dd 6 [name( 7)] [salary( 50)].

Data Dependencies in the Presence of Difference Foundation 18/31 Implication Problem Let Σ 1 and Σ 2 be two sets of dds. Σ 1 logically implies Σ 2, Σ 1 Σ 2 if for all relation instance I, I Σ 1 implies I Σ 2 Σ 1 and Σ 2 are equivalent, Σ 1 Σ 2 if Σ 1 Σ 2 and Σ 2 Σ 1 The implication problem given a consistent set Σ of dds and another dd φ L [X] φ R [Y] to decide whether Σ can imply this dd, Σ φ L [X] φ R [Y] For example, {dd 4,dd 5 } dd 6

Data Dependencies in the Presence of Difference Foundation 19/31 Implication based-on Subsumption Given a dd φ L [X] φ R [Y] φ 1 [Z] φ R [Y] can be implied, if X Z,φ L [X] φ 1 [X] φ L [X] φ 1 [Z] can be implied, if Z Y,φ 1 [Z] φ R [Z] For example, consider a dd [name( 7)] [address( 1)], it implies [name( 5)] [address( 1)] [name( 7)] [address( 2)]

Data Dependencies in the Presence of Difference Foundation 20/31 Differential Key Key: t 1 [R] = t 2 [R] according to t 1 [K] = t 2 [K] on a key K R A differential key φ 2 [K] relative to φ 1 [R] is a differential function that can determine φ 1 [R] a differential key dependency φ 2 [K] φ 1 [R] with K R and φ 2 [K] φ 1 [K] For example, [position( 20)] is a differential key relative to [position( 20) area( 5)] according to the following differential key dependency, [position( 20)] [position( 20) area( 5)]

Data Dependencies in the Presence of Difference Foundation 21/31 Candidate Differential Key A naïve key relative to φ 1 [R] is φ 1 [R] itself A candidate differential key (cdk) φ c [K] is A cdk an irreducible differential key relative to φ 1 [R], there does not exist any φ 2 [L] such that L K, φ 2 [L] φ c [L] and φ 2 [L] φ 1 [R]. not only has a minimal cardinality as candidate keys on fds, but also should be the one not subsumed by others. cdks are useful in applications like data partition

Outline Introduction Application Foundation Discovery Conclusion and Future Work

Data Dependencies in the Presence of Difference Discovery 22/31 Discovery Problem Discovery from data given a relation instance I discover candidate differential keys and minimal cover of differential dependencies that hold in I The hardness a minimal cover of fds, that hold in a relation instance I, can be exponentially large in the number of attributes fds are considered as special cases of dds where all the differential constraints are set to = 0 dds subsume fds, could be exponentially large as well

Data Dependencies in the Presence of Difference Discovery 23/31 Negative Pruning Motivation: pruning candidates of dds, in order to avoid evaluating all possible φ L [X] φ R [Y] in I Lemma For any φ 1 [V],φ 2 [Z] having V Z,φ 1 [V] φ 2 [V], if I φ 2 [Z] φ R [Y], then I φ 1 [V] φ R [Y] Example: if [name( 5)] φ R [Y] not hold in I, then [name( 7)] φ R [Y] not hold either without evaluation in I Worst case: all the candidates hold in the given instance I

Data Dependencies in the Presence of Difference Discovery 24/31 Positive Pruning Lemma For any φ 1 [V],φ 2 [W] having W V,φ 2 [W] φ 1 [W], if I φ 2 [W] φ R [Y], then I φ 1 [V] φ R [Y] Example: if [name( 7)] φ R [Y] holds in I, then [name( 5)] φ R [Y] must hold without evaluation in I Worst case: all the candidates do not hold in the given instance I Hybrid approach with both positive and negative pruning, used by turns.

Data Dependencies in the Presence of Difference Discovery 25/31 Instance Exclusion Motivation: avoiding evaluating the entire I. one differential function subsumes another the set of tuples agreeing on the former one should be a super set of the latter one Considers all the pairs of tuples in I. D(I) = {(t i,t j ) t i,t j I}. Given any dd φ L [X] φ R [Y], we define D(I,φ L [X], φ R [Y]) = {(t i,t j ) D(I) (t i,t j ) φ L [X],(t i,t j ) φ R [Y]}, that is, the tuple pairs agreeing φ L [X] but not agreeing φ R [Y].

Data Dependencies in the Presence of Difference Discovery 26/31 Instance Exclusion Lemma An instance I satisfies a dd, I φ L [X] φ R [Y], iff D(I,φ L [X], φ R [Y]) =. During the discovery, for a candidate φ L [X] φ R [Y], have to evaluate whether D(I,φ L [X], φ R [Y]) =.

Data Dependencies in the Presence of Difference Discovery 27/31 Instance Exclusion Lemma For any φ 1 [V],φ 2 [W] having W V,φ 2 [W] φ 1 [W], we have D(I,φ 1 [V], φ R [Y]) D(I,φ 2 [W], φ R [Y]). Suppose that a current D(I,φ 2 [W], φ R [Y]) instead of considering the entire D(I) use D(I,φ 2 [W], φ R [Y]) to compute D(I,φ 1 [V], φ R [Y])

Data Dependencies in the Presence of Difference Discovery 28/31 Experiments Evaluate the time performance of discovery approaches scale well with the increase of tuples in an instance I O(n 2 ) with respect to the number of tuples n in the instance I instance exclusion performs well Time cost (s) 140 120 100 80 60 40 20 BF BU-Ne TD-Po Hybird IE-TD-Po IE-Hybird CiteSeer 0 I1 I2 I3 I4 I5 I6 Data instance Time cost (s) 120 100 80 60 40 20 BF BU-Ne TD-Po Hybird IE-TD-Po IE-Hybird Cora 0 I1 I2 I3 I4 I5 I6 Data instance Figure: DDs discovery performance on various instance I

Data Dependencies in the Presence of Difference Discovery 29/31 Experiments discovery cost increases exponentially in the number of attributes in a schema can achieve several orders of magnitude improvement compared with brute-force one Time cost (s) 10000 1000 100 10 1 BF BU-Ne TD-Po Hybird IE-TD-Po IE-Hybird CiteSeer Time cost (s) 10000 1000 100 10 1 BF BU-Ne TD-Po Hybird IE-TD-Po IE-Hybird Cora 0.1 0.1 0.01 R1 R2 R3 R4 R5 R6 Relation schema 0.01 R1 R2 R3 R4 R5 R6 Relation schema Figure: DDs discovery performance on various schema R

Outline Introduction Application Foundation Discovery Conclusion and Future Work

Data Dependencies in the Presence of Difference Conclusion and Future Work 30/31 Conclusions We propose a novel class of dependencies, differential dependencies (dds), which specify constraints on distance. Theory Practice formal definitions of dds and differential keys subsumption order relation of differential functions reasoning about dds consistency of dds, np-complete implication of dds, co-np-complete closure of a differential function a sound and complete inference system, proof minimal cover for dds discovery of dds and differential keys from data. application of dds and differential keys.

Data Dependencies in the Presence of Difference Conclusion and Future Work 31/31 Future Work Approximate differential dependencies almost hold in a data instance evaluation measure, efficient computation Implication of approximate differential dependencies Hardness analysis of computing error measure Approximation algorithms computing error measure Experiments of approximation validation Further extensions data repairing with dds conditioning dds in a subset of tuples integrity rules in dataspaces

Data Dependencies in the Presence of Difference Tsinghua University sxsong@tsinghua.edu.cn

Data Dependencies in the Presence of Difference 31/31 Closure The closure of φ L [X] under Σ, (φ L [X]) + is also a differential function the intersection of the set of differential functions that can be determined by φ L [X] according to dds in Σ (φ L [X]) + = {φ R [Y] Σ φ L [X] φ R [Y]} the closure of [name( 7)] under {dd 4,dd 5 } is [name( 7) address( 1) salary( 50)] It is natural that φ L [X] (φ L [X]) +.

Data Dependencies in the Presence of Difference 31/31 Closure To imply a dd is essentially to compute the corresponding closure (φ L [X]) + of φ L [X] Lemma Let Σ be a set of dds and φ 1 [Z] = (φ L [X]) + be the closure of φ L [X] with respect to Σ. Consider a dd φ L [X] φ R [Y], Σ φ L [X] φ R [Y] iff Y Z and φ R [Y] φ 1 [Y]. For example, [salary( 50)] subsumes the projection on salary of the closure of [name( 7)] under {dd 4,dd 5 } it implies dd 6 [name( 7)] [salary( 50)]

Data Dependencies in the Presence of Difference 31/31 Inference System A1. If Y X and φ L [Y] = φ R [Y], then Σ I φ L [X] φ R [Y]. A2. If Σ I φ L [X] φ R [Y], then Σ I φ L [X] φ 1 [Z] φ R [Y] φ 1 [Z]. A3. If Σ I φ L [X] φ 1 [Z], φ 1 [Z] φ 2 [Z] and Σ I φ 2 [Z] φ R [Y], then Σ I φ L [X] φ R [Y]. A4. If Σ I φ L [X] φ i [B] φ R [Y],1 i k, and (Σ,φ 1 [B] φ k [B]) is inconsistent, then Σ I φ L [X] φ R [Y]. Theorem The set I of inference rules is (sound), if Σ I φ L [X] φ R [Y] then Σ φ L [X] φ R [Y], (complete), if Σ φ L [X] φ R [Y] then Σ I φ L [X] φ R [Y], for logical implication of dds.

Data Dependencies in the Presence of Difference 31/31 Example: Inference Example We consider a set Σ of dds as follows: dd 7 [d( 1, 7) p(< 10)] [a( 150)], dd 8 [p( 10)] [a( 100)]. Let dd 9 be another dd dd 9 [d( 1, 7)] [a( 150)]. We show that Σ I dd 9 can be proved by the following steps. 1. [d( 1, 7) p( 10)] [d( 1, 7) a( 100)] by A2, dd 8 2. [d( 1, 7) a( 150)] [a( 150)] by A1 3. [d( 1, 7) p( 10)] [a( 150)] by A3, 1. 2. 4. [d( 1, 7)] [a( 150)] by A4, 3. dd 7

Data Dependencies in the Presence of Difference 31/31 Minimal Cover A minimal cover Σ c for Σ is a set of dds such that Σ c is logically equivalent to Σ, i.e., Σ c Σ is minimal according to the following properties: C1. (left-reduced), for any φ L [X] φ R [Y] Σ c, there does not exist any φ 1 [W] such that W X, φ 1 [W] φ L [W] and Σ c φ 1 [W] φ R [Y]. C2. (right-subsumed), for any φ L [X] φ R [Y] Σ c, there does not exist any φ 1 [W] such that Y W, φ 1 [Y] φ R [Y] and Σ c φ L [X] φ 1 [W]. C3. (non-redundant), there does not exist a cover Σ of Σ such that Σ Σ c.

Data Dependencies in the Presence of Difference 31/31 Example: Minimal Cover Consider Σ = {dd 4,dd 5,dd 6 } in Example 2 a minimal cover can be Σ c = {dd 4,dd 5 } Σ c can imply dd 6 by removing dd 4 or dd 5 from Σ c, it is no longer a cover of Σ Consider Σ = {dd 7,dd 8,dd 9 } in Example 9 a minimal cover can be Σ c = {dd 8,dd 9 } Σ = {dd 7,dd 8 } is not a minimal cover since dd 7 is not left-reduced and can be implied by dd 9 by augmentation rule A2.