Representing and Querying Correlated Tuples in Probabilistic Databases

Similar documents
Representing and Querying Correlated Tuples in Probabilistic Databases

Probabilistic Databases

Bayesian Networks BY: MOHAMAD ALSABBAGH

StarAI Full, 6+1 pages Short, 2 page position paper or abstract

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Bayesian belief networks. Inference.

Probabilistic and Bayesian Analytics

TDT70: Uncertainty in Artificial Intelligence. Chapter 1 and 2

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Introduction to Probabilistic Graphical Models

6.867 Machine learning, lecture 23 (Jaakkola)

The PITA System for Logical-Probabilistic Inference

What is (certain) Spatio-Temporal Data?

Symmetry Breaking for Relational Weighted Model Finding

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Belief Update in CLG Bayesian Networks With Lazy Propagation

Published in: Tenth Tbilisi Symposium on Language, Logic and Computation: Gudauri, Georgia, September 2013

Average-case analysis of a search algorithm for estimating prior and posterior probabilities in Bayesian networks with extreme probabilities

Bayesian networks: approximate inference

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Machine Learning for Data Science (CS4786) Lecture 24

CS Lecture 3. More Bayesian Networks

A Tutorial on Learning with Bayesian Networks

Human-level concept learning through probabilistic program induction

Graphical models: parameter learning

Bagging During Markov Chain Monte Carlo for Smoother Predictions

arxiv: v1 [stat.ml] 15 Nov 2016

Uncertainty and Bayesian Networks

CS 188: Artificial Intelligence. Bayes Nets

Bayes Nets III: Inference

Prof. Dr. Ralf Möller Dr. Özgür L. Özçep Universität zu Lübeck Institut für Informationssysteme. Tanya Braun (Exercises)

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

BLOG: Probabilistic Models with Unknown Objects. Brian Milch Harvard CS 282 November 29, 2007

Artificial Intelligence Bayes Nets: Independence

Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford

Probabilistic Reasoning Systems

Machine Learning 4771

Uncertain Knowledge and Bayes Rule. George Konidaris

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Undirected Graphical Models

Sparse Forward-Backward for Fast Training of Conditional Random Fields

The Parameterized Complexity of Approximate Inference in Bayesian Networks

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

CS 188: Artificial Intelligence Spring Announcements

Bayesian Networks Inference with Probabilistic Graphical Models

Possibilistic Networks: Data Mining Applications

Learning in Markov Random Fields An Empirical Study

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Announcements. CS 188: Artificial Intelligence Spring Bayes Net Semantics. Probabilities in BNs. All Conditional Independences

Probabilistic Graphical Networks: Definitions and Basic Results

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Fractional Belief Propagation

Dynamic Structures for Top-k Queries on Uncertain Data

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

Probabilistic Graphical Models

On First-Order Knowledge Compilation. Guy Van den Broeck

Planning by Probabilistic Inference

Partial inconsistency and vector semantics: sampling, animation, and program learning

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Efficient Inference for Complex Queries on Complex Distributions

PMR Learning as Inference

Towards an extension of the PC algorithm to local context-specific independencies detection

k-selection Query over Uncertain Data

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

A Stochastic l-calculus

Natural Gradients via the Variational Predictive Distribution

Kyle Reing University of Southern California April 18, 2018

12 : Variational Inference I

Improved Dynamic Schedules for Belief Propagation

Sampling Methods (11/30/04)

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

The Veracity of Big Data

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Finding Frequent Items in Probabilistic Data

Probabilistic Representation and Reasoning

Directed Graphical Models or Bayesian Networks

Bayesian belief networks

QSQL: Incorporating Logic-based Retrieval Conditions into SQL

Black-box α-divergence Minimization

Lossless Online Bayesian Bagging

Bayesian networks. Chapter Chapter

Probabilistic Partial Evaluation: Exploiting rule structure in probabilistic inference

Learning Semi-Markovian Causal Models using Experiments

On analysis of the unicity of Jeffrey s rule of conditioning in a possibilistic framework

PROBABILISTIC REASONING SYSTEMS

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Bayesian Machine Learning

Lecture 21: Spectral Learning for Graphical Models

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

Transcription:

Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen Amol Deshpande Department of Computer Science University of Maryland, College Park. International Conference on Data Engineering, 2007 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 1/26

Introduction Abundance of uncertain data. Numerous approaches proposed to handle uncertainty. However, most models make assumptions about data uncertainty that restricts applicability. Need for a Database model with a model of uncertainty that can capture correlations simple and intuitive semantics that are readily understood and defines precise answers to every query Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 2/26

Introduction Abundance of uncertain data. Numerous approaches proposed to handle uncertainty. However, most models make assumptions about data uncertainty that restricts applicability. Need for a Database model with a model of uncertainty that can capture correlations simple and intuitive semantics that are readily understood and defines precise answers to every query Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 2/26

Motivations for correlated data Applications: Dirty databases [CP87, DS96, AFM06]: Arise while trying to integrate data from various sources. DB1: John $1200 DB2: John $1600 Name Salary John $1200 } mutually John $1600 exclusive Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 3/26

Motivations for correlated data Applications: Dirty databases [CP87, DS96, AFM06]: Arise while trying to integrate data from various sources. Sensor Networks [DGM + 04]: Often show strong spatial correlations, e.g., nearby sensors report similiar values. 3 1 2 5 4 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 3/26

Motivations for correlated data Applications: Dirty databases [CP87, DS96, AFM06]: Arise while trying to integrate data from various sources. Sensor Networks [DGM + 04]: Often show strong spatial correlations, e.g., nearby sensors report similiar values. More applications: Pervasive computing, approximate string matching in DB systems [DS04] etc. Additional motivation: Correlated tuples arise while evaluating queries [DS04]. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 3/26

Outline 1 Motivation. 2 Probabilistic databases with correlations. 3 Query evaluation. 4 Qualitative comparison with other approaches. 5 Experiments. 6 Conclusion and future work. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 4/26

Representing correlated tuples: Ingredients Which theory to use? Probability theory, Dempster-Schafer theory, fuzzy Logic, logic. At what level do we represent uncertainty? Tuple level, attribute level. Approach to representing correlations? Probabilistic graphical models (include as special cases: Bayesian networks and Markov networks) Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 5/26

Representing correlated tuples: Ingredients Which theory to use? Probability theory, Dempster-Schafer theory, fuzzy Logic, logic. At what level do we represent uncertainty? Tuple level, attribute level. Approach to representing correlations? Probabilistic graphical models (include as special cases: Bayesian networks and Markov networks) Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 5/26

Review: Independent tuple-based probabilistic databases possible worlds semantics [DS04] Example from Dalvi and Suciu [DS04]: S A B prob s 1 m 1 0.6 s 2 n 1 0.5 T C D prob t 1 1 p 0.4 possible worlds instance probability {s 1, s 2, t 1 } 0.12 {s 1, s 2 } 0.18 {s 1, t 1 } 0.12 {s 1 } 0.18 {s 2, t 1 } 0.08 {s 2 } 0.12 {t 1 } 0.08 0.12 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 6/26

Representing correlations: Basic ideas Two concepts from probabilistic graphical models literature [Pea88]: Tuple-based Random Variables Associate every tuple t with a boolean valued random variable X t. Factors f (X) is a function of a (small) set of random variables X. 0 f (X) 1 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 7/26

Representing correlations: Basic ideas Two concepts from probabilistic graphical models literature [Pea88]: Tuple-based Random Variables Associate every tuple t with a boolean valued random variable X t. Factors f (X) is a function of a (small) set of random variables X. 0 f (X) 1 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 7/26

Representing correlations: Basic ideas Two concepts from probabilistic graphical models literature [Pea88]: Tuple-based Random Variables Associate every tuple t with a boolean valued random variable X t. Factors f (X) is a function of a (small) set of random variables X. 0 f (X) 1 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 7/26

Representing correlations: Basic ideas - contd. Associate with each tuple in the probabilistic database a random variable. Define factors on (sub)sets of tuple-based random variables to encode correlations. The probability of an instantiation of the database is given by the product of all the factors. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 8/26

Representing correlations: Mutual exclusivity example Suppose we want to represent mutual exclusivity between tuples s 1 and t 1. In particular, let us try to represent the possible worlds: S A B prob s 1 m 1 0.6 s 2 n 1 0.5 T C D prob t 1 1 p 0.4 possible worlds instance probability {s 1, s 2, t 1 } 0 {s 1, s 2 } 0.3 {s 1, t 1 } 0 {s 1 } 0.3 {s 2, t 1 } 0.2 {s 2 } 0 {t 1 } 0.2 0 X t1 X s1 f 1 0 0 0 0 1 0.6 1 0 0.4 1 1 0 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 9/26

Representing correlations: Mutual exclusivity example Suppose we want to represent mutual exclusivity between tuples s 1 and t 1. In particular, let us try to represent the possible worlds: S A B prob s 1 m 1 0.6 s 2 n 1 0.5 T C D prob t 1 1 p 0.4 possible worlds instance probability {s 1, s 2, t 1 } 0 {s 1, s 2 } 0.3 {s 1, t 1 } 0 {s 1 } 0.3 {s 2, t 1 } 0.2 {s 2 } 0 {t 1 } 0.2 0 X t1 X s1 f 1 0 0 0 0 1 0.6 1 0 0.4 1 1 0 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 9/26

Representing correlations: Mutual exclusivity example Suppose we want to represent mutual exclusivity between tuples s 1 and t 1. In particular, let us try to represent the possible worlds: S A B prob s 1 m 1 0.6 s 2 n 1 0.5 T C D prob t 1 1 p 0.4 possible worlds instance probability {s 1, s 2, t 1 } 0 {s 1, s 2 } 0.3 {s 1, t 1 } 0 {s 1 } 0.3 {s 2, t 1 } 0.2 {s 2 } 0 {t 1 } 0.2 0 X t1 X s1 f 1 0 0 0 0 1 0.6 1 0 0.4 1 1 0 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 9/26

Representing correlations: Mutual exclusivity example Suppose we want to represent mutual exclusivity between tuples s 1 and t 1. In particular, let us try to represent the possible worlds: S A B prob s 1 m 1 0.6 s 2 n 1 0.5 T C D prob t 1 1 p 0.4 possible worlds instance probability {s 1, s 2, t 1 } 0 {s 1, s 2 } 0.3 {s 1, t 1 } 0 {s 1 } 0.3 {s 2, t 1 } 0.2 {s 2 } 0 {t 1 } 0.2 0 X t1 X s1 f 1 0 0 0 0 1 0.6 1 0 0.4 1 1 0 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 9/26

Representing correlations: Mutual exclusivity example Suppose we want to represent mutual exclusivity between tuples s 1 and t 1. In particular, let us try to represent the possible worlds: S A B prob s 1 m 1 0.6 s 2 n 1 0.5 T C D prob t 1 1 p 0.4 possible worlds instance probability {s 1, s 2, t 1 } 0 {s 1, s 2 } 0.3 {s 1, t 1 } 0 {s 1 } 0.3 {s 2, t 1 } 0.2 {s 2 } 0 {t 1 } 0.2 0 X t1 X s1 f 1 0 0 0 0 1 0.6 1 0 0.4 1 1 0 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 9/26

Representing correlations: Mutual exclusivity example Suppose we want to represent mutual exclusivity between tuples s 1 and t 1. In particular, let us try to represent the possible worlds: S A B prob s 1 m 1 0.6 s 2 n 1 0.5 T C D prob t 1 1 p 0.4 possible worlds instance probability {s 1, s 2, t 1 } 0 {s 1, s 2 } 0.3 {s 1, t 1 } 0 {s 1 } 0.3 {s 2, t 1 } 0.2 {s 2 } 0 {t 1 } 0.2 0 X t1 X s1 f 1 0 0 0 0 1 0.6 1 0 0.4 1 1 0 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 9/26

Representing correlations: Mutual exclusivity example Suppose we want to represent mutual exclusivity between tuples s 1 and t 1. In particular, let us try to represent the possible worlds: S A B prob s 1 m 1 0.6 s 2 n 1 0.5 T C D prob t 1 1 p 0.4 possible worlds instance probability {s 1, s 2, t 1 } 0 {s 1, s 2 } 0.3 {s 1, t 1 } 0 {s 1 } 0.3 {s 2, t 1 } 0.2 {s 2 } 0 {t 1 } 0.2 0 X t1 X s1 f 1 0 0 0 0 1 0.6 1 0 0.4 1 1 0 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 9/26

Representing correlations: Positive correlation example Suppose we want to represent positive correlation between t 1 and s 1. In particular, let us try to represent the possible worlds: S A B s 1 m 1 s 2 n 1 T C D t 1 1 p possible worlds instance probability {s 1, s 2, t 1 } 0.2 {s 1, s 2 } 0.1 {s 1, t 1 } 0.2 {s 1 } 0.1 {s 2, t 1 } 0 {s 2 } 0.2 {t 1 } 0 0.2 X t1 X s1 f 1 0 0 0.4 0 1 0.2 1 0 0 1 1 0.4 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 10/26

Representing correlations: Positive correlation example Suppose we want to represent positive correlation between t 1 and s 1. In particular, let us try to represent the possible worlds: S A B s 1 m 1 s 2 n 1 T C D t 1 1 p possible worlds instance probability {s 1, s 2, t 1 } 0.2 {s 1, s 2 } 0.1 {s 1, t 1 } 0.2 {s 1 } 0.1 {s 2, t 1 } 0 {s 2 } 0.2 {t 1 } 0 0.2 X t1 X s1 f 1 0 0 0.4 0 1 0.2 1 0 0 1 1 0.4 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 10/26

Representing correlations: Positive correlation example Suppose we want to represent positive correlation between t 1 and s 1. In particular, let us try to represent the possible worlds: S A B s 1 m 1 s 2 n 1 T C D t 1 1 p possible worlds instance probability {s 1, s 2, t 1 } 0.2 {s 1, s 2 } 0.1 {s 1, t 1 } 0.2 {s 1 } 0.1 {s 2, t 1 } 0 {s 2 } 0.2 {t 1 } 0 0.2 X t1 X s1 f 1 0 0 0.4 0 1 0.2 1 0 0 1 1 0.4 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 10/26

Representing correlations: Positive correlation example Suppose we want to represent positive correlation between t 1 and s 1. In particular, let us try to represent the possible worlds: S A B s 1 m 1 s 2 n 1 T C D t 1 1 p possible worlds instance probability {s 1, s 2, t 1 } 0.2 {s 1, s 2 } 0.1 {s 1, t 1 } 0.2 {s 1 } 0.1 {s 2, t 1 } 0 {s 2 } 0.2 {t 1 } 0 0.2 X t1 X s1 f 1 0 0 0.4 0 1 0.2 1 0 0 1 1 0.4 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 10/26

Representing correlations: Positive correlation example Suppose we want to represent positive correlation between t 1 and s 1. In particular, let us try to represent the possible worlds: S A B s 1 m 1 s 2 n 1 T C D t 1 1 p possible worlds instance probability {s 1, s 2, t 1 } 0.2 {s 1, s 2 } 0.1 {s 1, t 1 } 0.2 {s 1 } 0.1 {s 2, t 1 } 0 {s 2 } 0.2 {t 1 } 0 0.2 X t1 X s1 f 1 0 0 0.4 0 1 0.2 1 0 0 1 1 0.4 X s2 f 2 0 0.5 1 0.5 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 10/26

Probabilistic graphical model representation Definition A probabilistic graphical model is graph whose nodes represent random variables and edges represent correlations [Pea88]. X t1 X s1 X t1 X s1 X t1 X s1 X s2 Complete Ind. Example X s2 Mutual Exclusivity Example X s2 Positive Correlation Example Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 11/26

Outline 1 Motivation. 2 Probabilistic databases with correlations. 3 Query evaluation. 4 Qualitative comparison with other approaches. 5 Experiments. 6 Conclusion and future work. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 12/26

Query evaluation: Basic ideas Treat intermediate tuples as regular tuples. Carefully represent correlations between intermediate tuples, base tuples and result tuples to construct a probabilistic graphical model. Cast the probability computations resulting from query evaluation to inference in probabilistic graphical models. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 13/26

Query evaluation example Compute D (S B=C T ) S: A B s 1 m 1 s 2 n 1 f s1, f s2 T : C D t 1 1 p S B=C T f AND i 1,s 1,t 1, f AND i 2,s 2,t 1 A B C D i 1 m 1 1 p i 2 n 1 1 p D (S B=CT ) r 1 D p f t1 f OR r 1,i 1,i 2 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 14/26

Query evaluation example Compute D (S B=C T ) S: A B s 1 m 1 s 2 n 1 f s1, f s2 T : C D t 1 1 p S B=C T f AND i 1,s 1,t 1, f AND i 2,s 2,t 1 A B C D i 1 m 1 1 p i 2 n 1 1 p D (S B=CT ) r 1 D p f t1 f OR r 1,i 1,i 2 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 14/26

Query evaluation example Compute D (S B=C T ) S: A B s 1 m 1 s 2 n 1 f s1, f s2 T : C D t 1 1 p S B=C T f AND i 1,s 1,t 1, f AND i 2,s 2,t 1 A B C D i 1 m 1 1 p i 2 n 1 1 p D (S B=CT ) r 1 D p f t1 f OR r 1,i 1,i 2 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 14/26

Query evaluation example Compute D (S B=C T ) S: A B s 1 m 1 s 2 n 1 f s1, f s2 T : C D t 1 1 p S B=C T f AND i 1,s 1,t 1, f AND i 2,s 2,t 1 A B C D i 1 m 1 1 p i 2 n 1 1 p D (S B=CT ) r 1 D p f t1 f OR r 1,i 1,i 2 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 14/26

Query evaluation example Compute D (S B=C T ) S: A B s 1 m 1 s 2 n 1 f s1, f s2 T : C D t 1 1 p S B=C T f AND i 1,s 1,t 1, f AND i 2,s 2,t 1 A B C D i 1 m 1 1 p i 2 n 1 1 p D (S B=CT ) r 1 D p f t1 f OR r 1,i 1,i 2 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 14/26

Query evaluation example Compute D (S B=C T ) S: A B s 1 m 1 s 2 n 1 f s1, f s2 T : C D t 1 1 p S B=C T f AND i 1,s 1,t 1, f AND i 2,s 2,t 1 A B C D i 1 m 1 1 p i 2 n 1 1 p D (S B=CT ) r 1 D p f t1 f OR r 1,i 1,i 2 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 14/26

Query evaluation example: Probabilistic graphical model Query evaluation problem in Prob. Databases: Compute the probability of the result tuple summed over all possible worlds of the database [DS04]. Equivalent problem in prob. graph. models: marginal probability computation. Thus we can use inference algorithms (e.g., VE [ZP94]). X s1 0 0.4 1 0.6 X t1 0 0.4 1 0.4 X s2 0 0.5 1 0.5 X s1 X t1 X s2 X i1 X i2 X r1 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 15/26

Query evaluation example: Probabilistic graphical model Query evaluation problem in Prob. Databases: Compute the probability of the result tuple summed over all possible worlds of the database [DS04]. Equivalent problem in prob. graph. models: marginal probability computation. Thus we can use inference algorithms (e.g., VE [ZP94]). X s1 0 0.4 1 0.6 X t1 0 0.4 1 0.4 X s2 0 0.5 1 0.5 X s1 X t1 X s2 X i1 X i2 X s1, X s2, X i1, X i2, X t1 X r1 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 15/26

Query evaluation example: Probabilistic graphical model Query evaluation problem in Prob. Databases: Compute the probability of the result tuple summed over all possible worlds of the database [DS04]. Equivalent problem in prob. graph. models: marginal probability computation. Thus we can use inference algorithms (e.g., VE [ZP94]). X t1 X s2 X i1 X t1 0 0 1 0 1 0.4 1 0 0 1 1 0.6 0 0.4 1 0.4 X t1 0 0.5 1 0.5 X s2 X i1 X i2 X s1, X s2, X i1, X i2, X t1 X r1 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 15/26

Outline 1 Motivation. 2 Probabilistic databases with correlations. 3 Query evaluation. 4 Qualitative comparison with other approaches. 5 Experiments. 6 Conclusion and future work. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 16/26

Qualitative comparison with other probability computation approaches Compute {}(L J R): L A B l 1 m 1 J B C j 1 1 p j 2 1 q X r1 X j1 X l1 X j2 X i1 X i2 X r2 R C D r 1 p 1 r 2 q 1 X i3 X res X i4 Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 17/26

Qualitative comparison with extensional semantics [FR97, DS04] X j1 X l1 X j2 X r1 X i1 X i2 X r2 Is not guaranteed to match possible world semantics. X i3 X i4 Safe plans [DS04] return tree-structured graphical models. X res Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 18/26

Qualitative comparison with extensional semantics [FR97, DS04] 0.5 0.5 0.5 X j1 X l1 X j2 0.5 0.5 X r1 X i1 X i2 X r2 Is not guaranteed to match possible world semantics. X i3 X i4 Safe plans [DS04] return tree-structured graphical models. X res Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 18/26

Qualitative comparison with extensional semantics [FR97, DS04] 0.5 0.5 0.5 X j1 X l1 X j2 0.5 X r1 X i1 0.25 0.25 X i2 0.5 X r2 Is not guaranteed to match possible world semantics. 0.125 X 0.125 i3 X i4 Safe plans [DS04] return tree-structured graphical models. 0.234375 X res Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 18/26

Qualitative comparison with intensional semantics [FR97, DS04] X j1 X l1 X j2 X r1 X i1 X i2 X r2 Work with a subgraph of the graph. model. X i3 X i4 Inference algorithms can exploit graph structure better. X res (X r1 X j1 X l1 ) (X l1 X j2 X r2 ) Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 19/26

Outline 1 Motivation. 2 Probabilistic databases with correlations. 3 Query evaluation. 4 Qualitative comparison with other approaches. 5 Experiments. 6 Conclusion and future work. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 20/26

Experiments with correlated bibliographic data 1 0.8 Recall 0.6 0.4 0.2 0 0 100 200 300 400 500 600 700 800 900 N Database contains 860 publications from CiteSeer [GBL98]. Searched for publications for given (misspelt) author name. Naturally involves mutual exclusivity correlations. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 21/26

Experiments with TPC-H data Run-times (sec) 70 60 50 40 30 20 Full Query Bare Query 10 0 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Ran experiments on randomly generated TPC-H dataset of size 10MB. Q7 and Q8 are queries without safe plans [DS04]. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 22/26

Aggregate Computations probability 0.0014 0.02516 0.0012 0.001 0.0008 0.0006 0.0004 0.0002 0 1.7 1.8 1.9 2 2.1 2.2 average Computes the average of an attribute of 500 synthetic tuples. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 23/26

Outline 1 Motivation. 2 Probabilistic databases with correlations. 3 Query evaluation. 4 Qualitative comparison with other approaches. 5 Experiments. 6 Conclusion and future work. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 24/26

Conclusion & future work Introduced probabilistic databases with correlated tuples. Utilized ideas from probabilistic graphical models to represent such correlations. Cast the query evaluation problem as an inference problem. Future Work: Add more representation power: Attribute uncertainty and correlated attributes. Ways to restructure graphical model to speed up inference. Share/reuse computation to speed up inference. Explore the use of approximate inference methods [GRS96, JGJS99]. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 25/26

Thank You. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 26/26

References Periklis Andritsos, Ariel Fuxman, and Renee J. Miller. Clean answers over dirty databases. In International Conference on Data Engineering, 2006. Roger Cavallo and Michael Pittarelli. The theory of probabilistic databases. In International Conference on Very Large Data Bases, 1987. Amol Deshpande, Carlos Guestrin, Sam Madden, Joseph M. Hellerstein, and Wei Hong. Model-driven data acquisition in sensor networks. In International Conference on Very Large Data Bases, 2004. Debabrata Dey and Sumit Sarkar. A probabilistic relational model and algebra. ACM Transactions on Database Systems., 1996. Nilesh Dalvi and Dan Suciu. Efficient query evaluation on probabilistic databases. In International Conference on Very Large Data Bases, 2004. Norbert Fuhr and Thomas Rolleke. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 26/26

A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 1997. C. Lee Giles, Kurt Bollacker, and Steve Lawrence. Citeseer: An automatic citation indexing system. In Conference on Digital Libraries, 1998. Walter R. Gilks, Sylvia Richardson, and David J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman & Hall, 1996. Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 1999. Judaea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. Nevin Lianwen Zhang and David Poole. A simple approach to bayesian network computations. In Canadian Conference on Artificial Intelligence, 1994. Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen and Amol Deshpande 26/26