A Dichotomy. in in Probabilistic Databases. Joint work with Robert Fink. for Non-Repeating Queries with Negation Queries with Negation

Dichotomy for Non-Repeating Queries with Negation Queries with Negation in in Probabilistic Databases Robert Dan Olteanu Fink and Dan Olteanu Joint work with Robert Fink Uncertainty in Computation Simons Institute for the Theory of Computing PODS, erkeley June 24, Oct 2014 5, 2016 1 / 20 28

Outline Probabilistic Databases 101 The Dichotomy The Interesting but Hard Queries The Easy Queries Leftovers 2 / 28

Tuple-Independent Probabilistic Databases Tuple-independent database of n tuples (t i ) i [n] : Each tuple t i associated with an independent oolean random variable x i. P(x i = true) gives the probability that t i exists in the database. Possible-worlds semantics: Each possible world defined by an assignment θ of the variables (x i ) i [n] : It consists of all tuples t i for which θ(x i ) = true. It has probability P(θ) = Π i [n] P(x i = θ(x i )). tuple-independent database with n tuples has 2 n possible worlds. 3 / 28

Relational lgebra Popular database query language since Codd times. lgebra carrier: set of all finite relations lgebra operations: π (projection), (Cartesian product), (set difference), (join), σ (selection), (set union), δ (renaming) s expressive as domain relational calculus (RC) In this talk: Relational algebra fragment 1R Included: Equality joins, selections, projections, difference Excluded: Repeating relation symbols, unions Examples of (oolean) 1R queries: re there combinations of tuples in (R, T ) that are not in (U, V )? [( ) ( )] π R() T () U() V () [( ) ( )] R() T () U() V () (in RC) Does relation S hold hands with both R and T? π [ R() S(, ) T () ] [ R() S(, ) T () ] (in RC) 4 / 28

The Query Evaluation Problem For any oolean 1R query Q and tuple-independent database D: Compute the probability that Q is true in a random world of D. The case of non-oolean queries can be reduced to the oolean case. We are interested in the data complexity of this problem. Fix the query Q and take the database D as input. 5 / 28

Outline Probabilistic Databases 101 The Dichotomy The Interesting but Hard Queries The Easy Queries Leftovers 6 / 28

Data complexity of any 1R query Q in tuple-independent databases: Polynomial time if Q is hierarchical and #P-hard otherwise. 7 / 28

Hierarchical 1R Queries (oolean) 1R query Q is hierarchical if For every pair of distinct query variables and in Q, there is no triple of relation symbols R, S, and T in Q such that: R has but not, S has both and, and T has but not. 8 / 28

Examples Hierarchical queries: π [( R() S(, ) ) T (, ) ] π [( R() T () ) ( U() V () )] π [ (M() N() ) [( R() T () ) ( U() V () )] ] π [ π [ M() N() ] π [( R() T () ) ( U() V () )] ] 10 / 28

Examples Hierarchical queries: π [( R() S(, ) ) T (, ) ] π [( R() T () ) ( U() V () )] π [ (M() N() ) [( R() T () ) ( U() V () )] ] π [ π [ M() N() ] π [( R() T () ) ( U() V () )] ] Non-hierarchical queries: [ ] π R() S(, ) T () [ ( ) ] π π R() S(, ) T () [ ( ) ] π T () π R() S(, ) [ π X () [ ( )] ] R() π T () S(, ) 10 / 28

Outline Probabilistic Databases 101 The Dichotomy The Interesting but Hard Queries The Easy Queries Leftovers 11 / 28

Hardness Proof Idea Reduction from #P-hard model counting problem for positive bipartite DNF: Given a non-hierarchical 1R query Q and ny positive bipartite DNF formula Ψ over disjoint sets X and Y of random variables. #Ψ can be computed using linearly many calls to an oracle for P(Q), where Q is evaluated on tuple-independent databases of sizes linear in the size of Ψ. 12 / 28

Simple Case Input formula and query: Ψ = x 1 y 1 x 1 y 2 x 2 y 1 over sets X = {x 1, x 2}, Y = {y 1, y 2} ] Q = π [R() S(, ) T () Construct a database D such that Ψ becomes the grounding of Q wrt D: Column holds formulas over random variables. We use for true and for false. Variables also used as constants for and. S(x i, y j, ): x i y j is a clause in Ψ. R(x i, x i ) and T (y j, y j ): x i is a variable in X and y j is a variable in Y. R x 1 x 1 x 2 x 2 T y 1 y 1 y 2 y 2 S x 1 y 1 x 1 y 2 x 2 y 1 R S T x 1 y 1 x 1 y 1 x 1 y 2 x 1 y 2 x 2 y 1 x 2 y 1 π [ R S T ] () Ψ 13 / 28

Surprising Case Input formula and query: Ψ = x 1 y 1 x 1 y 2 over sets X = {x 1}, Y = {y 1, y 2} [ ( ) ] Q = π R() π T () S(, ) Construct a database D such that Ψ becomes the grounding of Q wrt D: S(a, b, ): Clause a has variable b in Ψ. R(a, ) and T (b, b): a is a clause and b is a variable in Ψ. R T S T S π (T S) R π (T S) 1 2 x 1 x 1 y 1 y 1 y 2 y 2 1 x 1 1 y 1 2 x 1 2 y 2 1 x 1 x 1 1 y 1 y 1 2 x 1 x 1 2 y 2 y 2 1 x 1 y 1 2 x 1 y 2 1 x 1 y 1 2 x 1 y 2 14 / 28

More Involved Case Input formula and query: Ψ = x 1 y 1 x 1 y 2 x 2 y 1 over sets X = {x 1, x 2}, Y = {y 1, y 2} ] Q = π [S(, ) R() T () We need a different reduction gadget: Use additional random variables Z = {z 1,..., z E }, one per clause in Ψ = ψ 1 ψ E. Construct a database D such that the grounding of Q wrt D is Υ = [ E i=1 z ] E i ψ i = i=1 (z i ψ i ). R x 1 x 1 x 2 x 2 T y 1 x 1 y 2 y 2 S x 1 y 1 z 1 x 1 y 2 z 2 x 2 y 1 z 3 S R T x 1 y 1 z 1 (x 1 y 1 ) x 1 y 2 z 2 (x 1 y 2 ) x 2 y 1 z 3 (x 2 y 1 ) π [ S R T ] () E i=1 z i ψ i Compute #Ψ using linearly many calls to the oracle for P Q = 1 P(Υ). 15 / 28

The Small Print (1/2) Ψ = (i,j) E x iy j = ψ 1 ψ E over disjoint variable sets X and Y Let Θ be the set of assignments of variables X Y that satisfy Ψ: #Ψ = 1. θ Θ:θ =Ψ Partition Θ into disjoint sets Θ 0 Θ E, such that θ Θ i if and only if θ satisfies exactly i clauses of Ψ: #Ψ = θ Θ 1 :θ =Ψ 1 + + θ Θ E :θ =Ψ 1 = Θ 1 + + Θ E. Θ 1,..., Θ E can be computed using an oracle for P Υ : E E Υ = z i ψ i or, equivalently Υ = (z i ψ i ) i=1 i=1 16 / 28

The Small Print (2/2) Express the probability of Υ = E i=1 (z i ψ i ) as a function of Θ 1,..., Θ E : Fix the probabilities of variables in X Y to 1/2 and of variables in Z to p z. Then: ( ) ( ) E exactly k clauses exactly k clauses P Υ = P Υ P k=0 of Ψ are satisfied of Ψ are satisfied }{{} p E k z = 1 X + Y E p E k z Θ k 2 k=0 } {{ } 1 X + Y Θ k 2 This is a polynomial in p z of degree E, with coefficients Θ 0,..., Θ E. The coefficients can be derived from E + 1 pairs (p z, P Υ ) using Lagrange s polynomial interpolation formula. E + 1 oracle calls to P Υ suffice to determine #Ψ = E i=0 Θ i. 17 / 28

Hard Query Patterns There are 48 (!) minimal non-hierarchical query patterns. inary trees with leaves,, and and inner nodes or. Some are symmetric and need not be considered separately: and can be exchanged, joins are commutative and associative. Still, many cases left to consider due to the difference operator. P 1.1 P 1.2 P 1.3 P 1.4 P 5.1 P 5.2 P 5.3 P 5.4............ There is a database construction scheme for each pattern. Each non-hierarchical query Q matches a pattern P x.y. 18 / 28

Non-hierarchical Queries Match Minimal Hard Patterns Each non-hierarchical query Q matches a pattern P x.y: There is a total mapping from P x.y to Q s parse tree that is identity on inner nodes and, preserves ancestor-descendant relationships, maps leaves to relations: to R(); to S(, ); and to T (). π Pattern P 5.3 Query Q X () R() π T () S(, ) The match preserves the grounding of the query pattern: Q and P x.y have the same grounding for any database using our construction scheme. 19 / 28

Outline Probabilistic Databases 101 The Dichotomy The Interesting but Hard Queries The Easy Queries Leftovers 20 / 28

Evaluation of Hierarchical 1R Queries pproach based on knowledge compilation For any database D, the probability P Q(D) of a 1R query Q is the probability P Ψ of Q s grounding Ψ. Compile Ψ into ODD(Ψ) in polynomial time. Compute probability of ODD(Ψ) in time linear in its size. Distinction from existing tractability results [O. & Huang 2008]: 1R without negation: Grounding formulas are read-once. Read-once formulas admit linear-size ODs. 1R : Grounding formulas are not read-once. They admit ODs of sizes linear in the database size but exponential in the query size. 21 / 28

The Inner Workings From hierarchical 1R to RC-hierarchical -consistent RC : Translate query Q into an equivalent disjunction of disjunction-free existential relational calculus queries Q 1 Q k. k can be very large for queries with projection under difference! RC-hierarchical: For each X (Q ), every relation symbol in Q has variable X. Each of the disjuncts yields a poly-size ODD. -consistent: The nesting order of the quantifiers is the same in Q 1,, Q k. ll ODDs have compatible variable orders and their disjunction is a poly-size ODD. The ODD width grows exponentially with k, its height stays linear in the size of the database. Width = maximum number of edges crossing the section between any two consecutive levels. 22 / 28

Query Evaluation Example (1/3) Consider the following query and tuple-independent database: Q = π [ (R() T () ) ( U() V () ) ] R T U V R T R T U V 1 r 1 2 r 2 1 t 1 2 t 2 1 u 1 2 u 2 1 v 1 2 v 2 1 1 r 1 t 1 1 2 r 1 t 2 2 1 r 2 t 1 2 2 r 2 t 2 1 1 r 1 t 1 (u 1 v 1 ) 1 2 r 1 t 2 (u 1 v 2 ) 2 1 r 2 t 1 (u 2 v 1 ) 2 2 r 2 t 2 (u 2 v 2 ) The grounding of Q is: Ψ = r 1 [ t1 ( u 1 v 1 ) t 2 ( u 1 v 2 ) ] r 2 [ t1 ( u 2 v 1 ) t 2 ( u 2 v 2 ) ]. Variables entangle in Ψ beyond read-once factorization. This is the pivotal intricacy introduced by the difference operator. 23 / 28

Query Evaluation Example (2/3) Translate Q = π [ (R() T () ) ( U() V () ) ] into RC : ( ) ( ) Q RC = R() U() T () R() T () V () }{{}}{{} Q 1 oth Q 1 and Q 2 are RC-hierarchical. Q 1 Q 2 is -consistent: Same order for Q 1 and Q 2. Query grounding: Q 2. Ψ = (r 1 u 1 r 2 u 2 ) (t 1 t 2 ) (r 1 r 2 ) (t 1 v 1 t 2 v 2 ) }{{}}{{} Ψ 1 Ψ 2. oth Ψ 1 and Ψ 2 admit linear-size ODDs. The two ODDs have compatible orders and their disjunction is an ODD whose width is the product of the widths of the two ODDs. 24 / 28

Query Evaluation Example (3/3) Compile grounding formula into ODD: Ψ = (r 1 u 1 r 2 u 2 ) (t 1 t 2 ) (r 1 r 2 ) (t 1 v 1 t 2 v 2 ) }{{}}{{} Ψ 1 Ψ 2. r 1 r 1 r 1 u 1 u 1 r 2 r 2 r 2 r 2 u 2 u 2 u 2 t 1 t 1 = t 1 t 1 v 1 v 1 t 2 t 2 t 2 t 2 v 2 v 2 Ψ 1 Ψ 2 = Ψ 25 / 28

Outline Probabilistic Databases 101 The Dichotomy The Dichotomy The Interesting but Hard Queries The Interesting but Hard Queries The Easy Queries The Easy Queries Leftovers Leftovers 18 / 20 26 / 28

Dichotomies eyond 1R Some known dichotomies Non-repeating CQ, UCQ [Dalvi & Suciu 2004, 2010] Quantified queries, ranking queries [O.& team 2011, 2012] Non-repeating relational algebra = 1R + union. Hierarchical property not enough, consistency also needed. π [(R() S 1(, ) T () S 2(, )) S(, )] is hard, though it is equivalent to a union of two hierarchical 1R queries. Non-repeating relational calculus S(x, y) R(x) is tractable, S(x, y) (R(x) T (y)) is hard. oth are non-repeating, yet not expressible in 1R. Possible (though expensive) approach: Translate to RC and check RC-hierarchical and -consistency. Full relational algebra (or full relational calculus) It is undecidable whether the union of two equivalent relational algebra queries, one hard and one tractable, is tractable. 27 / 28

Thank you! 28 / 28