Database Design and Implementation

Similar documents
On Factorisation of Provenance Polynomials

Provenance Semirings. Todd Green Grigoris Karvounarakis Val Tannen. presented by Clemens Ley

Query Evaluation on Probabilistic Databases. CSE 544: Wednesday, May 24, 2006

Relational completeness of query languages for annotated databases

Provenance for Database Transforma1ons

Provenance for Aggregate Queries

A Dichotomy. in in Probabilistic Databases. Joint work with Robert Fink. for Non-Repeating Queries with Negation Queries with Negation

Path Queries under Distortions: Answering and Containment

Relational Algebra on Bags. Why Bags? Operations on Bags. Example: Bag Selection. σ A+B < 5 (R) = A B

CS 347 Parallel and Distributed Data Processing

Database design and implementation CMPSCI 645. Lecture 14: Data Provenance

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

CS 188: Artificial Intelligence Spring Announcements

Logic and Databases. Phokion G. Kolaitis. UC Santa Cruz & IBM Research Almaden. Lecture 4 Part 1

Databases 2011 The Relational Algebra

Topics in Probabilistic and Statistical Databases. Lecture 2: Representation of Probabilistic Databases. Dan Suciu University of Washington

Schema Refinement & Normalization Theory: Functional Dependencies INFS-614 INFS614, GMU 1

Query answering using views

GAV-sound with conjunctive queries

Announcements. CS 188: Artificial Intelligence Spring Bayes Net Semantics. Probabilities in BNs. All Conditional Independences

CSE 562 Database Systems

Topics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington

Relational Algebra and Calculus

INTRODUCTION TO RELATIONAL DATABASE SYSTEMS

A Toolbox of Query Evaluation Techniques for Probabilistic Databases

PUG: A Framework and Practical Implementation for Why & Why-Not Provenance (extended version)

Enhancing the Updatability of Projective Views

12/3/2010 REVIEW ALGEBRA. Exam Su 3:30PM - 6:30PM 2010/12/12 Room C9000

Database Design and Normalization

Query Processing. 3 steps: Parsing & Translation Optimization Evaluation

Provenance-Based Analysis of Data-Centric Processes

Desirable properties of decompositions 1. Decomposition of relational schemes. Desirable properties of decompositions 3

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

Instructor: Sudeepa Roy

Tractable Lineages on Treelike Instances: Limits and Extensions

Properties of Real Numbers

The Complexity of Causality and Responsibility for Query Answers and non-answers

Sect Properties of Real Numbers and Simplifying Expressions

UVA UVA UVA UVA. Database Design. Relational Database Design. Functional Dependency. Loss of Information

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

CMPT 354: Database System I. Lecture 9. Design Theory

Database Design and Implementation

Schema Refinement and Normal Forms

Adding and Subtracting Terms

Relational Database Design

Schema Refinement and Normal Forms

Quantifying Causal Effects on Query Answering in Databases

Precalculus Chapter P.1 Part 2 of 3. Mr. Chapman Manchester High School

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

6.830 Lecture 11. Recap 10/15/2018

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Example (Contd.

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Refining an ER Diagram

Datalog : A Family of Languages for Ontology Querying

Queries and Materialized Views on Probabilistic Databases

Schedule. Today: Jan. 17 (TH) Jan. 24 (TH) Jan. 29 (T) Jan. 22 (T) Read Sections Assignment 2 due. Read Sections Assignment 3 due.

A Comparative Study of Noncontextual and Contextual Dependencies

Provenance Analysis for Missing Answers and Integrity Repairs

Inverting Proof Systems for Secrecy under OWA

Schema Refinement and Normal Forms

Factorised Representations of Query Results: Size Bounds and Readability

Design Theory for Relational Databases. Spring 2011 Instructor: Hassan Khosravi

Schema Refinement and Normal Forms. Chapter 19

Relational Database: Identities of Relational Algebra; Example of Query Optimization

Today. Vector Clocks and Distributed Snapshots. Motivation: Distributed discussion board. Distributed discussion board. 1. Logical Time: Vector clocks

Wavelets for Efficient Querying of Large Multidimensional Datasets

Schema Refinement and Normal Forms. Why schema refinement?

Database Applications (15-415)

Tuple Relational Calculus

Count-Min Tree Sketch: Approximate counting for NLP

CS54100: Database Systems

Introduction to Data Management. Lecture #7 (Relational DB Design Theory II)

Constraints: Functional Dependencies

Annotation algebras for RDFS

Relational Algebra 2. Week 5

The Query Containment Problem: Set Semantics vs. Bag Semantics. Phokion G. Kolaitis University of California Santa Cruz & IBM Research - Almaden

Functional. Dependencies. Functional Dependency. Definition. Motivation: Definition 11/12/2013

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Exam 1. March 12th, CS525 - Midterm Exam Solutions

Schema Refinement & Normalization Theory

Compositions of Tree Series Transformations

Justifications for Logic Programming

SYMMETRIC POLYNOMIALS

Lectures 6. Lecture 6: Design Theory

Compositions of Bottom-Up Tree Series Transformations

In-Database Factorised Learning fdbresearch.github.io

LESSON 8.1 RATIONAL EXPRESSIONS I

Scalable Uncertainty Management

Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) [R&G] Chapter 19

Introduction to Data Management. Lecture #12 (Relational Algebra II)

Brief Tutorial on Probabilistic Databases

Functional Dependencies and Normalization. Instructor: Mohamed Eltabakh

review To find the coefficient of all the terms in 15ab + 60bc 17ca: Coefficient of ab = 15 Coefficient of bc = 60 Coefficient of ca = -17

Semantic Optimization Techniques for Preference Queries

CS2742 midterm test 2 study sheet. Boolean circuits: Predicate logic:

Lineage implementation in PostgreSQL

BCNF revisited: 40 Years Normal Forms

Logical Provenance in Data-Oriented Workflows (Long Version)

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Transcription:

Database Design and Implementation CS 645 Data provenance

Provenance provenance, n. The fact of coming from some particular source or quarter; origin, derivation [Oxford English Dictionary] Data provenance / lineage [BunemanKhannaTan 01]: aims to explain how a particular result was derived. Data-intensive science Worry about provenance

Motivation Data integration [WangMadnick90, LeeBressanMadnick98] Data Warehousing [CuiWidonWiener00] Scientific Data Management [BunemanKhannaTan01] Determines trust on results Ensure reliability, quality of data Repeatability/verifiability Avoid effort duplication Understanding transport of annotations

Example of data provenance A typical question: For a given database query Q, a database D and a tuple t in the output of Q(D), which parts of D contribute to t? R Emp John Susan Anna Dept D01 D02 D04 S Did D01 D02 D03 Mgr Mary Ken Ed Q Q = select r.a, r.b, s.c from R r, S s where r.b = s.b Emp Dept Mgr John D01 Mary Susan D02 Ken The question can be applied to attribute values, tables, etc.

Two approaches Eager or annotation-based Changes the transformation from Q to Q to carry extra information Source data not needed after transformation Annotation-based Q Q Extra information Lazy or non-annotation based Q is unchanged Good when extra storage is an issue Recomputation and access to source required

Types of provenance Why What DB tuples contribute to the presence of each result tuple? How By what process is each output tuple produced from the DB instance? Where Where (from what attribute of what tuple) does each output tuple value come from?

Why-provenance example a.name, DISTINCT a.phone a.name, a.phone

Lineage for an output tuple t is a subset of the input tuples which are relevant to the output tuple DISTINCT a.name, a.phone Lineage: {t1, t5, t6} Problem: Not very precise. e.g., lineage above does not specify that t5 and t6 do not both need to exist.

Why provenance DISTINCT a.name, a.phone Witness of t: Any subset of the database sufficient to reconstruct tuple t in the query result. Witness basis: Leaves of the proof tree showing how result tuple t is generated Lineage: {t1, t5, t6} {t1, t5} {t1, t6} {t1, t2, t6, t8} {{t1, t5}, {t1, t6}}

Why: query rewriting t1 t t2 t3 Why(Q, I, t): {{t 1 }} Why(Q, I, t): {{t 1 }, {t 1, t 2 }} Minimal witness basis: Minimal witnesses in the witness basis

The view deletion problem D a database instance and V=Q(D) a view defined over D. Find a set of tuples ΔD to remove from D so that a specific tuple t is removed from the view Minimize the number of side-effects in the view View side-effect problem Hard: queries with joins and projection or union PTIME: the rest Minimize the number of tuples deleted from D Source side-effect problem Same dichotomy [BunemanKhannaTan. PODS 2002]

How provenance Identifies witness tuples and the operations performed on them to produce each result tuple Expresses operations using provenance semirings MERGE (+): union or projection JOIN ( ): joins

Propagating annotations (1) R A B C a b c S D B E d b e Join (on B) R S A B C D E a b c d e The annotation means joint use of the data annotated by p and the data annotated by r

Propagating annotations (2) R A B C a b c Union R S A B C a b c p + r S A B C a b c The annotation p + r means alternative use of the data annotated by p and the data annotated by r

Propagating annotations (3) R A B C a b c 1 a b c 2 p r Project π AB R A B a b p + r + s a b c 3 s + denotes alternative use of data

An example (SPJU) R A B C a b c d b e f g e p r s Q = σ C=e π AC (π AB R π BC R π AC R π BC R) A C a c a e d c d e f e For selection, multiply with annotation 0 and 1.

Example

Example

Example

Example

Example

Example

Example

Example

Example

Example

Back to example R A B C a b c d b e f g e p r s Q A C a c a e d c d e f e

Applying the laws: polynomials R A B C a b c p Q A C a e pr d b e r d e 2r 2 + rs f g e s f e rs + 2s 2 Polynomials with coefficients in and annotation tokens as indeterminates p, r, s capture a very general form of provenance

How to read this provenance R A B C a b c p Q A C a e pr d b e r d e 2r 2 + rs f g e s f e rs + 2s 2 3 ways to derive (d e) 2 of the ways use only r, but they use it twice the 3 rd uses r once and s once

Deletion Propagation R A B C a b c p Q A C a e pr Q A C a e 0 Q A C f e 2s 2 d b e r d e 2r 2 + rs d e 0 f g e s f e rs + 2s 2 f e 2s 2 Delete (d b e) from R Set r to 0!

Some useful commutative semirings Set Semantics Bag Semantics Probabilistic events Access Control Public Top Secret

Example: access control where a c 2p 2 a b c p d b e r f g e s p=p, r=s, s=t q a d d f e c e e pr pr 2r 2 +rs 2s 2 +rs a b c d b e f g e P S T q a a d d c e c e P S S S Evaluate with p=p, r=s, s=t using min for +, max for f e T User with secret clearance

Where provenance Identifies witness cells Important for annotations SELECT * FROM R WHERE A <> 5 UNION SELECT A, 7 AS B FROM R WHERE A= 5 UPDATE R SET B=7 WHERE A=5 R A B 3. 4 5 6? A B 3. 4 5 7

Color algebra [Geerts, Kementsietsidis, Milano 06] A B 3 4 5 6 P[Q] A B 3 4 5 7 Q = SELECT * FROM R WHERE A <> 5 UNION SELECT A, 7 AS B FROM R WHERE A= 5

Color algebra A B 3 4 5 6 P[Q] A B 3 4 5 7 Q = UPDATE R SET B=7 WHERE A=5

Where provenance and semirings R u A x B y C 1 a 1 b 1 c 1 S v B 1 C 1 b z c 1 m π AC (π AB R (π BC R S)) A 1 C 1 a 1 c 1 u 2 p 2 xy 2 + uvpmxyz 1 is a neutral annotation, used when we don t bother to track data

Different annotations à Different tuples R A B C a b c d b e z f g e w p r s π C σ C=e π AC (π AB R π BC R) C e z e w pr+r 2 s 2

Wrap up: issues and directions Archiving Compression Generalizations Program Slicing [Cheney07] Negative Provenance Why Not? [SIGMOD09], Artemis [PVLDB09] Causality