Provenance Semirings. Todd Green Grigoris Karvounarakis Val Tannen. presented by Clemens Ley

Similar documents
Provenance Semirings

PUG: A Framework and Practical Implementation for Why & Why-Not Provenance (extended version)

Provenance for Aggregate Queries

7 RC Simulates RA. Lemma: For every RA expression E(A 1... A k ) there exists a DRC formula F with F V (F ) = {A 1,..., A k } and

Database Design and Implementation

TROPICAL SCHEME THEORY

Axioms of Kleene Algebra

Version January Please send comments and corrections to

Provenance-Based Analysis of Data-Centric Processes

Additional Practice Lessons 2.02 and 2.03

An Overview of Residuated Kleene Algebras and Lattices Peter Jipsen Chapman University, California. 2. Background: Semirings and Kleene algebras

CS632 Notes on Relational Query Languages I

With Question/Answer Animations. Chapter 2

Databases 2011 The Relational Algebra

1 Predicates and Quantifiers

Course 311: Michaelmas Term 2005 Part III: Topics in Commutative Algebra

Parallel-Correctness and Transferability for Conjunctive Queries

A Dichotomy. in in Probabilistic Databases. Joint work with Robert Fink. for Non-Repeating Queries with Negation Queries with Negation

Fuzzy Sets. Mirko Navara navara/fl/fset printe.pdf February 28, 2019

CS2742 midterm test 2 study sheet. Boolean circuits: Predicate logic:

Algebraic Expressions

Informal Statement Calculus

Lecture 2: Syntax. January 24, 2018

Propositional Logic: Models and Proofs

Annotation algebras for RDFS

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Scott Taylor 1. EQUIVALENCE RELATIONS. Definition 1.1. Let A be a set. An equivalence relation on A is a relation such that:

Sets and Motivation for Boolean algebra

Approximate Lineage for Probabilistic Databases. University of Washington Technical Report UW TR: #TR

3 Propositional Logic

A Logic Primer. Stavros Tripakis University of California, Berkeley. Stavros Tripakis (UC Berkeley) EE 144/244, Fall 2014 A Logic Primer 1 / 35

ACLT: Algebra, Categories, Logic in Topology - Grothendieck's generalized topological spaces (toposes)

Complete Partial Orders, PCF, and Control

Database Theory VU , SS Complexity of Query Evaluation. Reinhard Pichler

On the satisfiability problem for a 4-level quantified syllogistic and some applications to modal logic

GAV-sound with conjunctive queries

University of Aberdeen, Computing Science CS2013 Predicate Logic 4 Kees van Deemter

On Factorisation of Provenance Polynomials

A Logic Primer. Stavros Tripakis University of California, Berkeley

LESSON 7.1 FACTORING POLYNOMIALS I

Lecture Notes in Real Analysis Anant R. Shastri Department of Mathematics Indian Institute of Technology Bombay

Introduction to Kleene Algebra Lecture 13 CS786 Spring 2004 March 15, 2004

Packet #1: Logic & Proofs. Applied Discrete Mathematics

Chapter y. 8. n cd (x y) 14. (2a b) 15. (a) 3(x 2y) = 3x 3(2y) = 3x 6y. 16. (a)

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

CHAPTER 4. βs as a semigroup

Algebras. Larry Moss Indiana University, Bloomington. TACL 13 Summer School, Vanderbilt University

Datalog and Constraint Satisfaction with Infinite Templates

The non-logical symbols determine a specific F OL language and consists of the following sets. Σ = {Σ n } n<ω

U + V = (U V ) (V U), UV = U V.

Completeness of Star-Continuity

1 First-order logic. 1 Syntax of first-order logic. 2 Semantics of first-order logic. 3 First-order logic queries. 2 First-order query evaluation

MATH 145 LECTURE NOTES. Zhongwei Zhao. My Lecture Notes for MATH Fall

Herbrand Theorem, Equality, and Compactness

We define the multi-step transition function T : S Σ S as follows. 1. For any s S, T (s,λ) = s. 2. For any s S, x Σ and a Σ,

Queries and Materialized Views on Probabilistic Databases

Foundations of Logic Programming

Query answering using views

SCTT The pqr-method august 2016

An Algebraic Approach to Energy Problems I -Continuous Kleene ω-algebras

5 Set Operations, Functions, and Counting

First-Order Theorem Proving and Vampire

Topics in Probabilistic and Statistical Databases. Lecture 2: Representation of Probabilistic Databases. Dan Suciu University of Washington

Chapter 1. Sets and Numbers

review To find the coefficient of all the terms in 15ab + 60bc 17ca: Coefficient of ab = 15 Coefficient of bc = 60 Coefficient of ca = -17

Scalable Uncertainty Management

A Type Theory for Probabilistic and Bayesian Reasoning

Logic and Databases. Phokion G. Kolaitis. UC Santa Cruz & IBM Research Almaden. Lecture 4 Part 1

The Complexity of Computing the Behaviour of Lattice Automata on Infinite Trees

A MODEL-THEORETIC PROOF OF HILBERT S NULLSTELLENSATZ

Applied Logic. Lecture 1 - Propositional logic. Marcin Szczuka. Institute of Informatics, The University of Warsaw

DEPARTMENT OF MATHEMATIC EDUCATION MATHEMATIC AND NATURAL SCIENCE FACULTY

2MA105 Algebraic Structures I

Qualifying Exam Logic August 2005

THE EXT ALGEBRA AND A NEW GENERALISATION OF D-KOSZUL ALGEBRAS

Path Queries under Distortions: Answering and Containment

Introduction to Kleene Algebras

The Structure of Inverses in Schema Mappings

Closure Properties of Regular Languages. Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

Part I: Propositional Calculus

INTRODUCTION TO RELATIONAL DATABASE SYSTEMS

Propositional Logic: Part II - Syntax & Proofs 0-0

A GENERAL THEORY OF ZERO-DIVISOR GRAPHS OVER A COMMUTATIVE RING. David F. Anderson and Elizabeth F. Lewis

CHAPTER 10. Gentzen Style Proof Systems for Classical Logic

Database Theory VU , SS Ehrenfeucht-Fraïssé Games. Reinhard Pichler

Algebra I. Book 2. Powered by...

Semantic Metatheory of SL: Mathematical Induction

Equivalence of SQL Queries In Presence of Embedded Dependencies

Any Wizard of Oz fans? Discrete Math Basics. Outline. Sets. Set Operations. Sets. Dorothy: How does one get to the Emerald City?

A Weak Bisimulation for Weighted Automata

i=1 β i,i.e. = β 1 x β x β 1 1 xβ d

On the Boolean Algebra of Shape Analysis Constraints

Predicate Calculus. Formal Methods in Verification of Computer Systems Jeremy Johnson

POSITIVE POLYNOMIALS LECTURE NOTES (09: 10/05/10) Contents

Logic Part I: Classical Logic and Its Semantics

CS 154, Lecture 4: Limitations on DFAs (I), Pumping Lemma, Minimizing DFAs

Chapter 3. Rings. The basic commutative rings in mathematics are the integers Z, the. Examples

Subdirectly Irreducible Modes

UNIT-III REGULAR LANGUAGES

Transcription:

Provenance Semirings Todd Green Grigoris Karvounarakis Val Tannen presented by Clemens Ley

place of origin Provenance Semirings Todd Green Grigoris Karvounarakis Val Tannen presented by Clemens Ley

place of origin algebraic structure, e.g (N,*,+,0,1) Provenance Semirings Todd Green Grigoris Karvounarakis Val Tannen presented by Clemens Ley

Outline Data provenance by example Relational algebra for data provenance Datalog for data provenance

Data Provenance

Data Provenance Data provenance aims to explain how a particular query result was obtained.

Data Provenance R: A B C a b c p S: D B E d b e q join on B Data provenance aims to explain how a particular query result was obtained. R S: A B C D E a b c d e p * q

Data Provenance R: A B C a b c p S: D B E d b e q join on B Data provenance aims to explain how a particular query result was obtained. R S: A B C D E a b c d e p * q

Data Provenance R: A B C a b c p S: D B E d b e q join on B Data provenance aims to explain how a particular query result was obtained. R S: A B C D E a b c d e p * q means: was obtained from both p and q

Data Provenance (2) R: A B C a b c p S: A B C union R S: A B C a b c p + q a b c q

Data Provenance (2) R: A B C a b c p means: was obtained from either p or q S: A B C union R S: A B C a b c p + q a b c q

Data Provenance (3) R: A B C a b c p a b c q a b c r projection π AB (R): A B a b p + q + r means: was obtained from either p, q, or r

Data Provenance (4) Q = σ C=e π AC ( π AC R π BC R π AB R π BC R ) R: A B C a b c p d b e r f g e s Q: A C a c (p 2 + p 2 ) * 0 a e (pr) * 1 d c (rp) * 0 d e (r 2 + rs + r 2 ) * 1 f e (s 2 + rs + s 2 ) * 1

Data Provenance (4) Q = σ C=e π AC ( π AC R π BC R π AB R π BC R ) R: A B C a b c p d b e r f g e s Q: A C a c (p 2 + p 2 ) * 0 a e (pr) * 1 d c (rp) * 0 d e (r 2 + rs + r 2 ) * 1 f e (s 2 + rs + s 2 ) * 1 for selection, multiply by 1 or 0.

Why would this be useful? Q = σ C=e π AC ( π AC R π BC R π AB R π BC R ) R: A B C a b c p d b e r f g e s Q: A C a c (p 2 + p 2 ) * 0 a e (pr) * 1 d c (rp) * 0 d e (r 2 + rs + r 2 ) * 1 f e (s 2 + rs + s 2 ) * 1 for selection, multiply by 1 or 0.

Why would this be useful? Q = σ C=e π AC ( π AC R π BC R π AB R π BC R ) R: A B C a b c p := 2 d b e r := 5 f g e s := 1 Q: A C a c (p 2 + p 2 ) * 0 = 0 a e (pr) * 1 = 10 d c (rp) * 0 = 0 d e (r 2 + rs + r 2 ) * 1 = 55 f e (s 2 + rs + s 2 ) * 1 = 7 for bag semantics, consider annotations as multiplicities

Why would this be useful? Q: A C a c (p 2 + p 2 ) * 0 = ((b 1 b 1 ) (b 1 b 1 )) false a e (pr) * 1 = (b 1 b 2 ) true d c (rp) * 0 = (b 2 b 1 ) false d e (r 2 + rs + r 2 ) * 1 = (b 2 (b 2 b 3 ) b 2 ) true R: A B C f e (s 2 + rs + s 2 ) * 1 = (b 3 (b 2 b 3 ) b 3 ) true a b c p := b 1 d b e r := b 2 f g e s := b 3 for incomplete databases, consider annotations as boolean values, * as, + as, 1 as true, and 0 as false

Data Structure Relations are mappings from tuples to annotations in K; we require that R(t) 0 for only finitely many tuples t. intuitively, + means alternative use corresponds to union * means joint use and corresponds to join 0 and 1 are special annotations But what is a query languages for such relations?

Data Structure Relations are mappings from tuples to annotations in K; we require that R(t) 0 for only finitely many tuples t. intuitively, + means alternative use corresponds to union * means joint use and corresponds to join 0 and 1 are special annotations But what is (K,+,*,0,1) and how are annotations computed?

Positive Algebra

Positive Algebra Definition 3.2. Let (K, +,, 0, 1) be an algebraic structure with two binary operations and two distinguished elements. The operations of the positive algebra are defined as follows: empty relation For any set of attributes U, there is ; : U-Tup! K such that ;(t) =0. union If R 1,R 2 : U-Tup! K then R 1 [ R 2 : U-Tup! K is defined by (R 1 [ R 2 )(t) def = R 1 (t)+r 2 (t) projection If R : U-Tup! K and V U then V R : V -Tup! K is defined by X R(t 0 ) ( V R)(t) def = t=t 0 on V and R(t 0 )6=0 (here t = t 0 on V means t 0 is a U-tuple whose restriction to V is the same as the V -tuple t; note also that the sum is finite since R has finite support)!

X Positive Algebra (2) selection If R : U-Tup! K and the selection predicate P maps each U-tuple to either 0 or 1 then PR : U-Tup! K is defined by ( P R)(t) def = R(t) P(t) Which {0, 1}-valued functions are used as selection predicates is left unspecified, except that we assume that false the constantly 0 predicate, and true the constantly 1 predicate, are always available. natural join If R i : U i -Tup! K i =1, 2 then R 1 1 R 2 is the K-relation over U 1 [ U 2 defined by (R 1 1 R 2 )(t) def = R 1 (t 1 ) R 2 (t 2 ) where t 1 = t on U 1 and t 2 = t on U 2 (recall that t is a U 1 [ U 2 -tuple). renaming If R : U-Tup! K and : U! U 0 is a bijection then R is a K-relation over U 0 defined by ( R)(t) def = R(t )

What is K? Proposition 3.4. The following RA identities: union is associative, commutative and has identity ;; join is associative, commutative and distributive over union; projections and selections commute with each other as well as with unions and joins (when applicable); false(r) =; and true(r) =R. hold for the positive algebra on K-relations if and only if (K, +,, 0, 1) is a commutative semiring.

What is K? Proposition 3.4. The following RA identities: union is associative, commutative and has identity ;; join is associative, commutative and distributive over union; projections and selections commute with each other as well as with unions and joins (when applicable); false(r) =; and true(r) =R. hold for the positive algebra on K-relations if and only if (K, +,, 0, 1) is a commutative semiring. Note that the list does not contain idempotence of union and self join, as these fail for set semantics

What is K? Proposition 3.4. The following RA identities: union is associative, commutative and has identity ;; join is associative, commutative and distributive over union; projections and selections commute with each other as well as with unions and joins (when applicable); false(r) =; and true(r) =R. hold for the positive algebra on K-relations if and only if (K, +,, 0, 1) is a commutative semiring. Def. A commutative semiring is a structure (K,+,*,0,1) where + is commutative, associative, with identity 0 * is associative with identity 1 * distributes over + a * 0 = 0 * a = 0 Note that the list does not contain idempotence of union and self join, as these fail for set semantics

What is K? Proposition 3.4. The following RA identities: union is associative, commutative and has identity ;; join is associative, commutative and distributive over union; projections and selections commute with each other as well as with unions and joins (when applicable); false(r) =; and true(r) =R. hold for the positive algebra on K-relations if and only if (K, +,, 0, 1) is a commutative semiring. Def. A commutative semiring is a structure (K,+,*,0,1) where + is commutative, associative, with identity 0 * is associative with identity 1 * distributes over + a * 0 = 0 * a = 0

well as with unions and joins (when applicable); false(r) =; and true(r) =R. What is K? hold for the positive algebra on K-relations if and only if (K, +,, 0, 1) is a commutative semiring. Def. A commutative semiring is a structure (K,+,*,0,1) where + is commutative, associative, with identity 0 * is associative with identity 1 * distributes over + a * 0 = 0 * a = 0 Examples: the natural numbers: (N, +, *, 0, 1) the booleans: (B,,, true, false) subsets of a set: (P(Ω),,, Ω) the naturals with infinity: (N, +, *, 0, 1) polynomials in X: (N[X], +, *, 0, 1)

The fundamental property of RA For!every!query!!!q!!and!every!homomorphism!of! commutamve!semirings!! h : K 1! K 2 the!following! commutes : K 1 Bdata! h! K 2 Bdata! q! q! K 1 Edata! h! K 2 Bdata! Recall, semiring homomorphism is mapping h:k 1 K 2 such that h(1 K1 ) = 1 K2 h(0 K1 ) = 0 K2 h(a + K1 b) = h(a) + K2 h(b) h(a * K1 b) = h(a) * K2 h(b)

The fundamental property of RA For!every!query!!!q!!and!every!homomorphism!of! commutamve!semirings!! h : K 1! K 2 the!following! commutes : K 1 Bdata! h! K 2 Bdata! q! q! K 1 Edata! h! K 2 Bdata! Works only if q in RA +. Does not generalize e.g. to negation. Recall, semiring homomorphism is mapping h:k 1 K 2 such that h(1 K1 ) = 1 K2 h(0 K1 ) = 0 K2 h(a + K1 b) = h(a) + K2 h(b) h(a * K1 b) = h(a) * K2 h(b)

Which semiring do we choose? Definition 4.1. Let X be the set of tuple ids of a (usual) database instance I. The positive algebra provenance semiring for I is the semiring of polynomials with variables (a.k.a. indeterminates) from X and coe cients from N, with the operations defined as usual 4 : (N[X], +,, 0, 1).

Which semiring do we choose? Definition 4.1. Let X be the set of tuple ids of a (usual) database instance I. The positive algebra provenance semiring for I is the semiring of polynomials with variables (a.k.a. indeterminates) from X and coe cients from N, with the operations defined as usual 4 : (N[X], +,, 0, 1). But why?

A nice property of N[X] If!!!K!!is!a!commutaMve!semiring,!then!any!funcMon! on!tokens,!!!f!:!x!! K extends!uniquely!to!a!! homomorphism!!! h!: N[X]!!! K. Example: K= (N, +, *, 0, 1), X={p,q}, f:{p,q} K be f(p)=3 and f(q)=5. Then the extension of f is h: N[X] K and evaluates the polynomial (e.g. h(p+2q)=13)

Nice + Fundamental R: A B C a b c p a d e q homomorphism h R: A B C a b c 3 a d e 5 π A (R): query π A A a p + q homomorphism h Example: K= (N, +, *, 0, 1), X={p,q}, f:{p,q} K be f(p)=3 and f(q)=5. Then the extension of f is h: N[X] K and evaluates the π A (R): polynomial (e.g. h(p+2q)=13) query π A A a 8

Nice + Fundamental R: A B C a b c p homomorphism h R: A B C a b c 3 a d e q a d e 5 π A (R): query π A A Example: K= (N, +, *, 0, 1), X={p,q}, f:{p,q} K be f(p)=3 and f(q)=5. Then the extension of f is h: N[X] K and evaluates the polynomial (e.g. h(p+2q)=13) π A (R): query π A A a p + q homomorphism h a 8

Free the semiring! Nice implies: For every commutative semiring K, and every K-relation R, there is abstractly tagged N[X]-relation R and a homomorphism Eval v from R to R.

Free the semiring! Nice implies: For every commutative semiring K, and every K-relation R, there is abstractly tagged N[X]-relation R and a homomorphism Eval v from R to R. R: A B a b 3 a d 5 query q q(r): A B a b 15

Free the semiring! Nice implies: For every commutative semiring K, and every K-relation R, there is abstractly tagged N[X]-relation R and a homomorphism Eval v from R to R. R: A B R: A B a b p a d q Eval v a b 3 a d 5 query q q(r): A B a b 15

Free the semiring! Nice implies: For every commutative semiring K, and every K-relation R, there is abstractly tagged N[X]-relation R and a homomorphism Eval v from R to R. R: A B R: A B a b p a d q Eval v a b 3 a d 5 query q query q q(r): A B q(r): A B a b pq Eval v a b 15

Free the semiring! Nice implies: For every commutative semiring K, and every K-relation R, there is abstractly tagged N[X]-relation R and a homomorphism Eval v from R to R. R: A B R: A B a b p a d q Eval v a b 3 a d 5 query q Theorem 4.3. For any RA + query q we have q(r) = Eval v q( R) query q q(r): A B q(r): A B a b pq Eval v a b 15

Instantiation of Positive Algebra (B,,, true, false) Set semantics (N, +, *, 0, 1) Bag semantics (P(Ω),,, Ω) Probabilistic events (BoolExp(P),,, true, false) (A, min, max, 0, P) where A = P < C < S < T < 0 Conditional tables Access control levels

More nice Example:!2x 2 y!+!xy!+!5y 2!+!z! drop!coefficients! x 2 y!+!xy!+!y 2!+!z! B[X]! N[X]! Trio(X)! drop!exponents! 3xy!+!5y!+!z! drop!both!exp.!and!coeff.!!!!!!!!!!xy!+!y!+!z! collapse!terms! xyz! Lin(X)! Why(X)! PosBool(X)! apply!absorpmon! (ab!+!b!=!b)! y!+!z! A!path!downward!from!K 1!to!K 2!indicates!that!there!exists!an! onto!(surjec1ve))semiring)homomorphism)))!h!:!k 1!!K 2!!!!!!!

More nice drop!coefficients! x 2 y!+!xy!+!y 2!+!z! B[X]! drop!both!exp.!and!coeff.!!!!!!!!!!xy!+!y!+!z! collapse!terms! xyz! Example:!2x 2 y!+!xy!+!5y 2!+!z! Lin(X)! N[X]! Why(X)! Trio(X)! PosBool(X)! drop!exponents! 3xy!+!5y!+!z! apply!absorpmon! (ab!+!b!=!b)! y!+!z! most!informamve! least!informamve! A!path!downward!from!K 1!to!K 2!indicates!that!there!exists!an! onto!(surjec1ve))semiring)homomorphism)))!h!:!k 1!!K 2!!!!!!!

Datalog

Syntax and Semantics edb R: aa 2 Q: ab 3 bb 4 (b) (a) aa 2 2=4 idb ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure 6: Datalog with bag semantics

Syntax and Semantics edb R: aa 2 Q: ab 3 bb 4 (b) (a) aa 2 2=4 idb ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure Q(a,b) 6: Datalog with bag semantics R(a,a) R(a,b)

Syntax and Semantics edb R: aa 2 Q: ab 3 bb 4 (b) (a) aa 2 2=4 idb ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure Q(a,b) 6: Datalog with bag semantics Q(a,b) R(a,a) R(a,b) R(a,b) R(b,b)

Syntax and Semantics edb R: aa 2 Q: ab 3 bb 4 (b) (a) aa 2 2=4 idb ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure Q(a,b) 6: Datalog with bag semantics Q(a,b) R(a,a) R(a,b) R(a,b) R(b,b) Q(a,a) Q(b,b) R(a,a) R(a,a) R(b,b) R(b,b)

Datalog with Bag Semantics R: aa 2 ab 3 bb 4 (b) Q: (a) aa 2 2=4 ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure Q(a,b) 6: Datalog with bag semantics Q(a,b) R(a,a) R(a,b) R(a,b) R(b,b) Q(a,a) Q(b,b) R(a,a) R(a,a) R(b,b) R(b,b)

Datalog with Bag Semantics Q(a,b) R(a,a) R(a,b) R: aa 2 ab 3 bb 4 (b) Q: (a) aa 2 2=4 ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure 6: Datalog with bag semantics Q(a,b) R(a,b) R(b,b) Q(a,a) Q(b,b) R(a,a) R(a,a) R(b,b) R(b,b)

Datalog with Bag Semantics Q(a,b) R(a,a) R(a,b) R: aa 2 ab 3 bb 4 (b) Q: (a) aa 2 2=4 ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure 6: Datalog with bag semantics Q(a,b) R(a,b) R(b,b) Q(a,a) Q(b,b) R(a,a) R(a,a) R(b,b) R(b,b)

What annotations do we need? Q(a,b) R(a,a) R(a,b) Q(a,a) R: aa 2 Q: ab 3 bb 4 (b) (a) aa 2 2=4 ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure 6: Datalog with bag semantics Q(a,b) R(a,b) R(b,b) Q(b,b) R(a,a) R(a,a) R(b,b) R(b,b)

What annotations do we need? Q(a,b) R(a,a) R(a,b) Q(a,a) R: aa 2 Q: ab 3 bb 4 (b) (a) aa 2 2=4 ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure 6: Datalog with bag semantics How about: the tag of an answer tuple is the sum over all derivation trees and the product of the tags of each leaf. q(r)(t) = X yields t Y t 0 2leaves( ) R(t 0 ) Q(a,b) R(a,b) R(b,b) Q(b,b) R(a,a) R(a,a) R(b,b) R(b,b)

What annotations do we need? Q(a,b) R(a,a) R(a,b) Q(a,a) R: aa 2 Q: ab 3 bb 4 (b) (a) aa 2 2=4 ab 2 3+3 4 = 18 bb 4 4 = 16 (c) Figure 6: Datalog with bag semantics How about: the tag of an answer tuple is the sum over all derivation trees and the product of the tags of each leaf. q(r)(t) = X yields t Y t 0 2leaves( ) R(t 0 ) Q(a,b) R(a,b) R(b,b) Q(b,b) R(a,a) R(a,a) Problem: A tuple may have infinitely many derivation trees. Hence we need to work in semirings in which infinite sums are defined. R(b,b) R(b,b)

ω-continuos semirings Def. (Natural preorder) x y iff there is z such that x+z=y. Def. (Naturally ordered semiring) if the natural pre-order is an order. Def. (ω-complete) when x 1 x 2 x 2 have suprema. In naturally ordered semirings, we can make sense of infinite sums: X n2n a n 8 def = sup ( m2n mx a i ) Def. (ω-continous) when * and + preserve suprema. ( e.g. sup(a i + b i ) = sup(a i ) + b i ). Lemma. Over ω-continuos semirings, functions defined by polynomials have least fixed points. i=0

ω-continuos semirings Def. (Natural preorder) x y iff there is z such that x+z=y. Def. (Naturally ordered semiring) if the natural pre-order is an order. Def. (ω-complete) when x 1 x 2 x 2 have suprema. In naturally ordered semirings, we can make sense of infinite sums: X n2n a n 8 def = sup ( m2n mx a i ) Def. (ω-continous) when * and + preserve suprema. ( e.g. sup(a i + b i ) = sup(a i ) + b i ). Lemma. Over ω-continuos semirings, functions defined by polynomials have least fixed points. i=0 Preorder: reflexive and transitive. Not necessarily anti-symmetric (x y and y x implies x=y)

ω-continuos semirings Def. (Natural preorder) x y iff there is z such that x+z=y. Def. (Naturally ordered semiring) if the natural pre-order is an order. E.g. Z is not naturally ordered because -5 5, 5-5, but -5 5 Def. (ω-complete) when x 1 x 2 x 2 have suprema. In naturally ordered semirings, we can make sense of infinite sums: X n2n a n 8 def = sup ( m2n mx a i ) Def. (ω-continous) when * and + preserve suprema. ( e.g. sup(a i + b i ) = sup(a i ) + b i ). Lemma. Over ω-continuos semirings, functions defined by polynomials have least fixed points. i=0

ω-continuos semirings Def. (Natural preorder) x y iff there is z such that x+z=y. Def. (Naturally ordered semiring) if the natural pre-order is an order. Def. (ω-complete) when x 1 x 2 x 2 have suprema. In naturally ordered semirings, we can make sense of infinite sums: X n2n a n 8 def = sup ( m2n mx a i ) Def. (ω-continous) when * and + preserve suprema. ( e.g. sup(a i + b i ) = sup(a i ) + b i ). Lemma. Over ω-continuos semirings, functions defined by polynomials have least fixed points. i=0

Semantics of annotated Datalog Definition 5.1. Let (K, +,, 0, 1) be a commutative!-continuous semiring. To keep notation simple let q be a datalog query with one argument (it is easy to generalize to multiple arguments). For any K-relation R define q(r)(t) = X Y R(t 0 ) yields t t 0 2leaves( ) where ranges over all q-derivation trees for t and t 0 ranges over all the leaves of.

Semantics of annotated Datalog Definition 5.1. Let (K, +,, 0, 1) be a commutative!-continuous semiring. To keep notation simple let q be a datalog query with one argument (it is easy to generalize to multiple arguments). For any K-relation R define q(r)(t) = X Y R(t 0 ) yields t t 0 2leaves( ) where ranges over all q-derivation trees for t and t 0 ranges over all the leaves of. For!every!query!!!q!!and!every!homomorphism!of! commutamve!semirings!! h : K 1! K 2 the!following! commutes : K 1 Bdata! h! K 2 Bdata! q! q! K 1 Edata! h! K 2 Bdata!

Semantics of annotated Datalog Definition 5.1. Let (K, +,, 0, 1) be a commutative!-continuous semiring. To keep notation simple let q be a datalog query with one argument (it is easy to generalize to multiple arguments). For any K-relation R define q(r)(t) = X Y R(t 0 ) yields t t 0 2leaves( ) where ranges over all q-derivation trees for t and t 0 ranges over all the leaves of. Datalog ω-continuos For!every!query!!!q!!and!every!homomorphism!of! commutamve!semirings!! h : K 1! K 2 the!following! commutes : K 1 Bdata! h! K 2 Bdata! q! q! K 1 Edata! h! K 2 Bdata!

The Datalog provenance Semiring Problem: there can be infinitely many derivation trees for one tuple infinite sums in annotations In particular two kinds of infinite summations infinitely many copies of the same monomial coefficients in N = N { } infinitely many copies of different monomials formal power series K[[X]]

The Datalog provenance Semiring Problem: there can be infinitely many derivation trees for one tuple infinite sums in annotations In particular two kinds of infinite summations infinitely many copies of the same monomial coefficients in N = N { } infinitely many copies of different monomials formal power series K[[X]] Formal power series: basically polynomials with infinite summation

The Datalog provenance Semiring Problem: there can be infinitely many derivation trees for one tuple infinite sums in annotations In particular two kinds of infinite summations X infinitely many copies of the same monomial coefficients in N = N { } infinitely many copies of different monomials formal power series K[[X]] Definition 6.1. Let X be the set of tuple ids of a database instance I. The datalog provenance semiring for I is the commutative!-continuous semiring of formal power series N 1 [[X]].

Fixed Point Semantics Q(x, y) :-R(x, y) Q(x, y) :-Q(x, z), Q(z, y) R: ab ac cb bd dd ab 2 ac 3 cb 2 bd 1 dd 1 ab 8 ac 3 cb 2 bd 1 dd 1 ad 1 (a) (b) (c) m n p r s R: Q(R): ab ac cb bd dd ad x y z u v w Q(R): x = m + yz y = n z = p u = r + uv v = s + v 2 w = xu + wv (d) (e) (f) Transform immediate consequence operator of Q into a union of conjunctive queries; here R (Q 2=1 Q) Apply this RA query to R and Q. Equate! This leads to system of equations of polynomials in N 1 [[m, n, p, r, s]][x, y, z, u, v, w] As N [[m,n,p,r,s]] is omega continuos, these equations { have least fixed points that can be computed.

Decidability A tuple can have an annotations in any of the classes below. It is decidable in which class the annotation of a tuple is. N [[X]] N [X] N[[X]] N[X]

Decidability: Case N[X] Claim: Let Q be a Datalog program, D a database, and R an relation in the intensional schema of P. R(t) N[X] iff t has a derivation tree T w.r.t. Q and D of height less than (# of atoms +2) that has a path with two occurrences of the same atom a. N [[X]] N [X] N[[X]] N[X]

Decidability: Case N[X] Claim: Let Q be a Datalog program, D a database, and R an relation in the intensional schema of P. R(t) N[X] iff t has a derivation tree T w.r.t. Q and D of height less than (# of atoms +2) that has a path with two occurrences of the same atom a. Proof: Assume such a tree exists: a a

Decidability: Case N[X] Claim: Let Q be a Datalog program, D a database, and R an relation in the intensional schema of P. R(t) N[X] iff t has a derivation tree T w.r.t. Q and D of height less than (# of atoms +2) that has a path with two occurrences of the same atom a. Proof: Assume such a tree exists: a a a a a Then this is also a derivation tree: a

Decidability: Case N[X] Claim: Let Q be a Datalog program, D a database, and R an relation in the intensional schema of P. R(t) N[X] iff t has a derivation tree T w.r.t. Q and D of height less than (# of atoms +2) that has a path with two occurrences of the same atom a. Proof: In particular, trees of height (# of atoms +1) have no path with two repeated atoms.

Decidability: Case N[X] Claim: Let Q be a Datalog program, D a database, and R an relation in the intensional schema of P. R(t) N[X] iff t has a derivation tree T w.r.t. Q and D of height less than (# of atoms +2) that has a path with two occurrences of the same atom a. Proof: In particular, trees of height (# of atoms +1) have no path with two repeated atoms. Hence, by the pigeon hole principle, there are no derivation trees of height equal to (# of atoms +1).

Decidability: Case N[X] Claim: Let Q be a Datalog program, D a database, and R an relation in the intensional schema of P. R(t) N[X] iff t has a derivation tree T w.r.t. Q and D of height less than (# of atoms +2) that has a path with two occurrences of the same atom a. Proof: In particular, trees of height (# of atoms +1) have no path with two repeated atoms. Hence, by the pigeon hole principle, there are no derivation trees of height equal to (# of atoms +1).

Decidability: Case N[X] Claim: Let Q be a Datalog program, D a database, and R an relation in the intensional schema of P. R(t) N[X] iff t has a derivation tree T w.r.t. Q and D of height less than (# of atoms +2) that has a path with two occurrences of the same atom a. Proof: In particular, trees of height (# of atoms +1) have no path with two repeated atoms. Hence, by the pigeon hole principle, there are no derivation trees of height equal to (# of atoms +1). greater than or equal

Decidability: Case N[X] Claim: Let Q be a Datalog program, D a database, and R an relation in the intensional schema of P. R(t) N[X] iff t has a derivation tree T w.r.t. Q and D of height less than (# of atoms +2) that has a path with two occurrences of the same atom a. Proof: In particular, trees of height (# of atoms +1) have no path with two repeated atoms. Hence, by the pigeon hole principle, there are no derivation trees of height equal to (# of atoms +1). greater than or equal Thus there are only finitely many derivation trees.

Also decidable: given t q(i), and a monomial μ, the coefficient of μ in the power series that is the provenance of t is computable (including when it is ). testing whether all coefficients are. Not decidable: testing whether all coefficients are 1.

Conclusion A versatile framework for provenance computation. Specializes to many known systems for provenance. In a sense most general within frameworks that use Semirings. Provides semantics for positive datalog under rich semantics (e.g. bag semantics).

Thank You!