Path Queries under Distortions: Answering and Containment

Similar documents
Boundedness of Regular Path Queries in Data Integration Systems. Gösta Grahne, Alex Thomo

Finite Automata and Formal Languages

Finite Automata and Formal Languages TMV026/DIT321 LP4 2012

Kleene Algebras and Algebraic Path Problems

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

Compositions of Tree Series Transformations

An Overview of Residuated Kleene Algebras and Lattices Peter Jipsen Chapman University, California. 2. Background: Semirings and Kleene algebras

Introduction to Kleene Algebras

CS 455/555: Finite automata

Regular Expressions. Definitions Equivalence to Finite Automata

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms

Peter Wood. Department of Computer Science and Information Systems Birkbeck, University of London Automata and Formal Languages

CS 410/584, Algorithm Design & Analysis, Lecture Notes 4

Nondeterministic Finite Automata

Before we show how languages can be proven not regular, first, how would we show a language is regular?

Automata Theory. Lecture on Discussion Course of CS120. Runzhe SJTU ACM CLASS

Database Design and Implementation

Closure Properties of Regular Languages. Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism

Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers

CS 154 Introduction to Automata and Complexity Theory

Provenance Semirings. Todd Green Grigoris Karvounarakis Val Tannen. presented by Clemens Ley

CS 154, Lecture 2: Finite Automata, Closure Properties Nondeterminism,

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

CS21 Decidability and Tractability

Chapter Five: Nondeterministic Finite Automata

Compositions of Bottom-Up Tree Series Transformations

Section 8.4 Closures of Relations

arxiv: v1 [cs.fl] 19 Mar 2015

3515ICT: Theory of Computation. Regular languages

Regular Expressions and Language Properties

CS311 Computational Structures Regular Languages and Regular Expressions. Lecture 4. Andrew P. Black Andrew Tolmach

Closure under the Regular Operations

Finite Universes. L is a fixed-length language if it has length n for some

Kleene Algebra and Arden s Theorem. Anshul Kumar Inzemamul Haque

Design and Analysis of Algorithms

Final exam study sheet for CS3719 Turing machines and decidability.

CS 154. Finite Automata vs Regular Expressions, Non-Regular Languages

Automata Theory for Presburger Arithmetic Logic

Warshall s algorithm

Advanced Automata Theory 10 Transducers and Rational Relations

Definition of Büchi Automata

Featured Weighted Automata. FormaliSE 2017

The limitedness problem on distance automata: Hashiguchi s method revisited

ECS 120: Theory of Computation UC Davis Phillip Rogaway February 16, Midterm Exam

CS375 Midterm Exam Solution Set (Fall 2017)

INF Introduction and Regular Languages. Daniel Lupp. 18th January University of Oslo. Department of Informatics. Universitetet i Oslo

Approximation and Relaxation of Semantic Web Path Queries

Theory of Computation p.1/?? Theory of Computation p.2/?? Unknown: Implicitly a Boolean variable: true if a word is

Theory of computation: initial remarks (Chapter 11)

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Temporal Constraint Networks

Automata-Theoretic LTL Model-Checking

Relations. Relations of Sets N-ary Relations Relational Databases Binary Relation Properties Equivalence Relations. Reading (Epp s textbook)

NP-Completeness and Boolean Satisfiability

Introduction to Finite Automaton

CS 151 Complexity Theory Spring Solution Set 5

Membership and Finiteness Problems for Rational Sets of Regular Languages

Automata theory. An algorithmic approach. Lecture Notes. Javier Esparza

Finite Automata. BİL405 - Automata Theory and Formal Languages 1

Languages, regular languages, finite automata

UNIT-VIII COMPUTABILITY THEORY

ACO Comprehensive Exam March 20 and 21, Computability, Complexity and Algorithms

Equivalence, Order, and Inductive Proof

University of Toronto Scarborough. Aids allowed: None... Duration: 3 hours.

A Dichotomy. in in Probabilistic Databases. Joint work with Robert Fink. for Non-Repeating Queries with Negation Queries with Negation

Finite Automata and Languages

CSC173 Workshop: 13 Sept. Notes

Fundamental Algorithms 11

A Weak Bisimulation for Weighted Automata

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin

CHAPTER 1 Regular Languages. Contents

Finite Automata - Deterministic Finite Automata. Deterministic Finite Automaton (DFA) (or Finite State Machine)

Computational Models - Lecture 5 1

5. Partitions and Relations Ch.22 of PJE.

Theory of Computation (I) Yijia Chen Fudan University

Chapter 9: Relations Relations

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Chapter 6. Properties of Regular Languages

Introduction to the Theory of Computation. Automata 1VO + 1PS. Lecturer: Dr. Ana Sokolova.

Regular Expressions [1] Regular Expressions. Regular expressions can be seen as a system of notations for denoting ɛ-nfa

Logarithmic space. Evgenij Thorstensen V18. Evgenij Thorstensen Logarithmic space V18 1 / 18

Sri vidya college of engineering and technology

Theory of Computation 1 Sets and Regular Expressions

Formal Definition of Computation. August 28, 2013

CS21 Decidability and Tractability

Any Wizard of Oz fans? Discrete Math Basics. Outline. Sets. Set Operations. Sets. Dorothy: How does one get to the Emerald City?

Great Theoretical Ideas in Computer Science. Lecture 4: Deterministic Finite Automaton (DFA), Part 2

Introduction to Automata

Automata, Logic and Games: Theory and Application

Introduction to Theory of Computing

CSE 135: Introduction to Theory of Computation Nondeterministic Finite Automata (cont )

CMSC 330: Organization of Programming Languages

Path Logics for Querying Graphs: Combining Expressiveness and Efficiency

Nondeterministic finite automata

Automata Theory, Computability and Complexity

Temporal Constraint Networks

Introduction to Finite-State Automata

CS 208: Automata Theory and Logic

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Finite Automata and Regular languages

Transcription:

Path Queries under Distortions: Answering and Containment Gosta Grahne Concordia University Alex Thomo Suffolk University Foundations of Information and Knowledge Systems (FoIKS 04)

Postulate 1 The world is a database, and a database is a graph Fact 1 Regular path queries are at the core of querying graph databases Fact 2 Query containment is instrumental in query optimization and information integration Postulate 2 Query optimization and information integration is the future

Viewing Data as Graphs Relational data Tuples are the nodes Foreign keys are the edges People Dept of CS Object-oriented data Foto Afrati Classes Papers Timos Sellis Linked Web pages XML Databases Automata AI: Semantic Networks

Regular Path Queries and Distortions Q: _*. Foto Afrati. Classes. Automata Will fetch the node of the automata course of Foto Afrati. However, suppose the user gives: Q: _*. Foto Afrati. Automata For this the answer is empty! Well, we could distort the query by applying an edit operation an insertion of Classes in this case. Databases Dept of CS People Foto Afrati Timos Sellis Classes Papers Automata Automata However, Foto Afrati could also have some automata papers. But, we (DBA) know that Foto Afrati is a database person, so she probably doesn t have many automata papers. On the other hand, there at NTU, it s Foto who always teaches Automata. To reflect these facts about the world, the DBA could write: _*. ( (Foto Afrati. Automata, 1, Foto Afrati. Classes. Automata) + (Foto Afrati. Automata, 5, Foto Afrati. Papers. Automata) ). _*

Graph Database DB Set of objects/nodes D, edges labeled with symbols from a database alphabet a R b a b DB S T S ans(q,db) d R c c Query Q : regular language over For example Q = ST + T + RR ans(q,db) = {(x,y) in D x D : there is a path from x to y in DB labeled by a word in Q }

Computing the Answer p 0 a DB S,R T,R p 0 p 1 p 2 S S d b T p 1 c R R T T a S c p 2 b Construct an automaton A Q with p 0 initial state Compute the set Reach a as follows. 1. Initialize Reach a to {(a, p 0 )}. 2. Repeat 3 until Reach a no longer changes. 3. Choose a pair (b,p) Reach a. If there is a transition (p,r,p ) in A Q, and there is an edge (b,r,b ) in DB, then add the pair (b',s') to Reach a. Finally, ans(q, a, DB)={(a, b) : (b,s) Reach a, and s is a final state in A Q }

Distortion Transducer T T/S,1 RT/RS,2 s 0 s 1 s 2 T/R,0 DB S d a R R T b S c Query Q = RTT R T p 0 p 1 p 2 T p 3 ans T (Q,DB) = {(a,d,2), (c,b,2)} d T (u,w) = inf{k : u goes to w through T by k distortions} ans T (Q,DB) = {(a,b,k) : k = inf{d T (u,w) : u Q, a w b DB}}

Lazy Dijkstra Algorithm on Cartesian Product R T p 0 p 1 p 2 T p 3 DB T/S,1 RT/RS,2 s 0 s 1 s 2 T/R,0 S d a R R T b S c RT/RS,2 p 0 s 0 a, 0 p 2 s 1 c, 2 T/R,0 p 3 s 2 d, 2 Although the full cartesian product has 4*3*4=48 states, we needed only 3 states starting from a.

A Sketch Construct an automaton A Q with p 0 initial state Compute the set Reach a as follows. 1. Initialize Reach a to {(p 0, s 0, a, 0, false)}. /* The boolean flag is for the membership in the set of nodes for which we know the exact cost from source */ 2. Repeat 3 until Reach a no longer changes. 3. Choose a (p, s, b, k,false) Reach a, where k is min If Then [there is a transition (p, R, p ) in A Q ] and [a transition (s, R/S, s, n) in T] and [there is an edge (b, S, b ) in DB] add (p, s, b, k+n, false) to Reach a if there is no (p, s, b, _, _) in Reach a relax the weight of any successor of (p, s, b, k, false) in Reach a. update (p, s, b, k, false) to (p, s, b, k, true). Finally, ans T (Q, a, DB)={(a, b, k) : (p, s, b, k,true) Reach a, and p is a final state in A Q, and s is a final state in T}

In other words, the priority queue of Dijkstra s algorithm is brought on demand (lazily) in memory. Complexity: If we keep the set Reach a in main memory we avoid accessing objects in secondary memory more than once. Data complexity (i.e. number of I/O s) is all we care in databases! And it is linear!

Redefining Query Containment Classical case: Q 1 Q 2 iff ans(q 1,DB) ans(q 2,DB) on any DB. We can provide the answers of Q 1 as answers for Q 2 and be certain that they will be valid for Q 2 on any DB. Suppose now that Q 1 Q 2. However, by using the distortion transducer some kind of containment might still hold.

An Example Q 1 = {R, S}, Q 2 = {U, V} T = {(U/R,1), (V/S,3)} Suppose (a,b,0) ans T (Q, DB) --- what could be the DB? a R, T DB b 1 (a,b,0) ans T (Q 2, DB 3 ) a R DB b 2 (a,b,1) ans T (Q 2, DB 1 ) a S DB b 3 (a,b,3) ans T (Q 2, DB 2 ) Q 1 Q 2. However, for any DB, if (a,b,0) ans T (Q 2,DB) then (a,b,m) ans T (Q 2,DB), where m<=0+3.

Another Example Q 1 = {RRR}, Q 2 = {RST} T is the edit transducer S/R,1 T/R,1 R/R,0 Suppose (a,b,1) ans T (Q, DB) --- what could be the DB? DB 1 a R R,S T DB 2 a U R,S R,T DB 3 a U R R,T DB 4 a U R R b b b b (a,b,0) ans T (Q 2, DB 1 ) (a,b,1) ans T (Q 2, DB 2 ) (a,b,2) ans T (Q 2, DB 3 ) (a,b,3) ans T (Q 2, DB 4 ) Q 1 Q 2. However, for any DB, if (a,b,1) ans T (Q 2,DB) then (a,b,m) ans T (Q 2,DB), where m<=1+2.

Query Containment (Continued) Q 1 (T,k) Q 2 iff (a,b,n) ans T (Q 1,DB) (a,b,m) ans T (Q 2,DB) and m <= n + k on any DB. Q 1 Q 2 Q 1 (T,1) Q 2 Q 1 (T,k) Q 2 Q 1 (T,k+1) Q 2 Q 1 T(Q 2 ) What s the k?

A tool for deciding k-containment We devise a method for constructing: Q (T,k) : the language of all Q-words distorted by T with cost at most k. Clearly Q (T,k-1) Q (T,k) In this way we control how bigger we need to make Q 2. Suppose, that k is the smallest number, such that Q 1 Q 2 (T,k). If d T satisfies the triangle inequality property, we show that: Q 1 (T,k) Q 2 iff Q 1 Q 2 (T,k).

About the Triangle Property of T There are transducers, whose word distance doesn t satisfy the triangle property. E.g. {(R,1,S), (S,2,T), (R,5,T)}. R/S,1 S/T,2 R/T,5 d T (R,S)=1, d T (S,T)=2, but d T (R,T)=5>3 Nevertheless, there are large classes which, posses the triangle propety. The pure edit distance transducers. E.g. {(R,1,S), (S,1,T), (R,1,T), (S,1,R), (R,1, ), (,1,R) }. Transducers whose input and output of distortions do not have intersection. Such tranducers are idempotent wrt composition. (T T id ) (T T id )= (T T) (T T id ) (T id T) (T id T id ) = T T id In general, an idempotent transducer has the triangle property. utv vtw ut Tw utw Hence, d T (u,w) = d T T (u,w) <= d T (u,v)+d T (v,w).

Triangle Property (Continued) The class of range(t) dom(t)= transducers is indeed practical: Recall that it is the DBA who writes the reg. expr. for the distortion transducer. It is common sense that DBA has surely an idea about the DB. Hence, we can consider that all the words in range(t) match to DB paths. On the other hand, the words of the dom(t) can be considered not having a direct match on the database; otherwise why the system administrator would like them to be translated.

Q 1 0 (T,k) Q 2 However, if we restrict ourselves in reasoning about those tuples in Q 1 with weight 0, then we don t need the triangle property for T. We obtain a relaxed definition for the k-containment: Q 1 0 (T,k) Q 2 iff (a,b,0) ans T (Q 1,DB) (a,b,m) ans T (Q 2,DB) and m <= k on any DB. Clearly, (a,b,0) ans T (Q 1,DB) mainly correspond to the tuples of the pure answer of Q 1 on DB. We are able to prove that Q 1 0 (T,k) Q 2 iff Q 1 Q 2 (T,k). (Even when the triangle property doesn t hold).

Computing Q (T,k) - I First we obtain a weighted transduction of Q by T. Let A Q = (P Q,, Q, p Q,0, F Q ) be an -free NFA for Q Let T = (P T,, T, p T,0, F T ) in standard form We construct the weighted transduction automaton of Q by T as A = (P,,, p 0, F), where P = P Q P T, p 0 = p Q,0 p T,0, F = F Q F T = { ( (p,q), S, k, (p',q') ) : (p,r,p') Q, (q,r,s,k,q') T } { ( (p,q), S, k, (p,q') ) : (q,,s,k,q') T } Now, we should find all the paths in A, such that their weight is less than k. We denote it k(a).

Computing Q (T,k) - II Let A h be the sub-automaton consisting of all the paths with weight h. k(a) = A 0 A 1 A k We suppose that all the weights in A are 0 or 1. If not, e.g. (p,r,m,q) we replace by (p,r,1,r 1 ),, (r m-1,r,1,q) We number the states of A: 1,2,,n A ij is A, but with initial state i and final j. 0(A) keeping only the 0-weighted transitions in A. 1 ij (A) elementary two state (i and j) automata with the 1-weighted transitions from i to j.

k(a) = A 0 A 1 A k Computing Q (T,k) - III A 0 = 0(A), and for 1 <= h <= k A h = i S,j F A h ij A h ij = m {1,,n} A h/2 im. A h/2 mj m {1,,n} A (h-1)/2 im. A (h+1)/2 mj for h even for h odd A 1 ij = {m,l} {1,,n} 0(A) im. 1 ml (A). 0(A) lj A 1 ij consists of A-paths starting from state i and traversing any number of 0- weighted arcs up to some state m, then a 1-weighted arc going some state i, and after that, any number of 0-weighted arcs ending up in state j. A h/2 im all the h/2-weighted paths of A going from state i to some state m. A h/2 mj all the h/2-weighted paths of A going from that some state m to state j. Since m ranges over all the possible states, A h ij consists of all the possible h- weighted paths from state i to state j.

Computing Q (T,k) - IV E.g. Suppose that A is:,0 S,0,0 1 2 0(A) :,0 S,0 1 2,0 R,1 1 12 (A): 1 2 R,1 A 1 12 = 0(A) 12. 1 22 (A). 0(A) 22 0(A) 11. 1 12 (A). 0(A) 22 = {R} A 1 11 =, A 1 22 = {SR}, A 1 21 = A 1 = A 1 12 ={R} A 2 12 = A 1 12. A 1 22 A 1 11. A 1 12 = { R.SR}

Computing Q (T,k) - V From A h ij = m {1,,n} A h/2 im. A h/2 mj (for simplicity assume h is power of 2) A 2 ij is a union of n automata of size 2p (p is polynomial in n) A 4 ij is a union of n automata of size 4np A 8 ij is a union of n automata of size 8n 2 p A h ij is a union of n automata of size 4n logh-1 p Hence, the size of A h ij is 4n logh p. Had we used the equivalent A h ij = m {1,,n} A h-1 im. A 1 mj we would get pn h! Conclusion: Computing Q (T,k) is polynomial in n and sub-exponential in k.

A broader perspective semirings In the transducer, the weights were natural numbers and the specific operations were addition (+) along a path, and minimum (min) applied to path weights. This can be generalized to other weight sets, and to other operations. The weights, elements of a set K, can be multiplied along a path using an operation, and then summarized using an operation. Semirings: (K,,, 0, 1) (K,, 0) commutative monoid with 0 as the identity element. (K,, 1) monoid with 1 as the identity element for. distributes over : (a b) c = (a c) (b c), c (a b) = (c a) (c b) 0 is anihilator for : a 0 = 0.

In general, we can apply the Approximate Answering algorithm with any transducer whose weights are from a bounded semiring. The on focus semiring Tropical Semiring: (K,,, 0, 1), where K=N, =min, =+, 0=, 1=0 (a b) c = min(a, b) + c = min(a+c, b+c) = (a c) (b c), hence distributes over. Why does Dijkstra s algorithms work? It is based on the assumption that no shortest path needs to traverse a cycle! This is true for the Tropical Semiring, because it is a bounded semiring. Boundedness is defined as: 1 a = 1 for each a, (i.e. min(0, a) = 0). Hence, if we have a cycle with weight a, we don t gain anything traversing it: 1 a a a + a a a + = 1

Other semirings Probabilistic: ([0,1], max,, 0, 1) Fuzzy: ([0,1], max, min, 0, 1) Both of them are bounded. However, if we define the probabilistic semiring as: (R, +,, 0, 1), then we haven t a bounded semiring. Note: If C* is the weight of the shortest path, we produce as the answer from the Dijkstra algorithm the min(c*, 1). In such cases, we can use the Floyd-Warshall algorithm, which doesn t require boundedness.

Future work The Floyd-Warshall algorithm is impractical for sparse graphs, and modifying it for secondary memory is not known. Extending the algorithm for computing Q (T,k) in other semirings.

References Gösta Grahne, Alex Thomo. Query Answering and Containment for Regular Path Queries under Distortions. FoIKS 2004: 98-115 Gösta Grahne, Alex Thomo. Approximate Reasoning in Semistructured Data. KRDB 2001