Corroborating Information from Disagreeing Views

Similar documents
The Veracity of Big Data

Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources

Mobility Analytics through Social and Personal Data. Pierre Senellart

Leveraging Data Relationships to Resolve Conflicts from Disparate Data Sources. Romila Pradhan, Walid G. Aref, Sunil Prabhakar

Influence-Aware Truth Discovery

Approximating Global Optimum for Probabilistic Truth Discovery

arxiv: v1 [cs.db] 14 May 2017

Towards Confidence in the Truth: A Bootstrapping based Truth Discovery Approach

Discovering Truths from Distributed Data

Online Truth Discovery on Time Series Data

Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective

Time Series Data Cleaning

Online Data Fusion. Xin Luna Dong AT&T Labs-Research Xuan Liu National University of Singapore

REGIONAL PATTERNS OF KIS (KNOWLEDGE INTENSIVE SERVICES) ACTIVITIES: A EUROPEAN PERSPECTIVE

The Web is Great. Divesh Srivastava AT&T Labs Research

2017 AP European History Summer Work

Automated Fine-grained Trust Assessment in Federated Knowledge Bases

Estimating Local Information Trustworthiness via Multi-Source Joint Matrix Factorization

THE POWER OF CERTAINTY

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

EMEA Rents and Yields MarketView

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

Truth Discovery for Spatio-Temporal Events from Crowdsourced Data

Community Detection. fundamental limits & efficient algorithms. Laurent Massoulié, Inria

APPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS

Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model

Exploitation of Physical Constraints for Reliable Social Sensing

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

COMP Intro to Logic for Computer Scientists. Lecture 3

TO BE OR NOT TO BE GREEN? THE CHALLENGE OF URBAN SUSTAINABLE DEVELOPMENT IN THE POST-SOCIALIST CITY. CASE STUDY: CENTRAL AND EASTERN EUROPE

Patch similarity under non Gaussian noise

[Title removed for anonymity]

6.041/6.431 Fall 2010 Final Exam Wednesday, December 15, 9:00AM - 12:00noon.

A Representation of Time Series for Temporal Web Mining

Primal Dual Algorithms for Halfspace Depth

Quantum Simultaneous Contract Signing

Truth Discovery for Passive and Active Crowdsourcing. Jing Gao 1, Qi Li 1, and Wei Fan 2 1. SUNY Buffalo; 2 Baidu Research Big Data Lab

Predicates and Quantifiers

Yahoo! Labs Nov. 1 st, Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University

CSE 5243 INTRO. TO DATA MINING

An Overview of Alternative Rule Evaluation Criteria and Their Use in Separate-and-Conquer Classifiers

NP-Hardness reductions

Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA

MobiHoc 2014 MINIMUM-SIZED INFLUENTIAL NODE SET SELECTION FOR SOCIAL NETWORKS UNDER THE INDEPENDENT CASCADE MODEL

CSCI-567: Machine Learning (Spring 2019)

Assessment of Multi-Hop Interpersonal Trust in Social Networks by 3VSL

Analysis Based on SVM for Untrusted Mobile Crowd Sensing

Densest subgraph computation and applications in finding events on social media

AP Human Geography Syllabus

Constraint-based Subspace Clustering

Dynamic Truth Discovery on Numerical Data

Application and verification of the ECMWF products Report 2007

Distributed data management with the rule-based language: Webdamlog

NetBox: A Probabilistic Method for Analyzing Market Basket Data

PATTERNS OF THE ADDED VALUE AND THE INNOVATION IN EUROPE WITH SPECIAL REGARDS ON THE METROPOLITAN REGIONS OF CEE

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Round-off Error Analysis of Explicit One-Step Numerical Integration Methods

Equations and Solutions

Weighted Voting Games

Lecture 15: A Brief Look at PCP

Improving Quality of Crowdsourced Labels via Probabilistic Matrix Factorization

Metropolitan Areas in Italy

DS504/CS586: Big Data Analytics Graph Mining II

Truth and Logic. Brief Introduction to Boolean Variables. INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder

Large-Scale Behavioral Targeting

Your quiz in recitation on Tuesday will cover 3.1: Arguments and inference. Your also have an online quiz, covering 3.1, due by 11:59 p.m., Tuesday.

Temporal Multi-View Inconsistency Detection for Network Traffic Analysis

Chapter 9. PSPACE: A Class of Problems Beyond NP. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

CSE 5243 INTRO. TO DATA MINING

Spatial Extension of the Reality Mining Dataset

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Grade 7/8 Math Circles November 28/29/30, Math Jeopardy

9. PSPACE 9. PSPACE. PSPACE complexity class quantified satisfiability planning problem PSPACE-complete

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

9. PSPACE. PSPACE complexity class quantified satisfiability planning problem PSPACE-complete

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Uncovering the Latent Structures of Crowd Labeling

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

1) The set of the days of the week A) {Saturday, Sunday} B) {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Sunday}

Monica Brezzi (with (with Justine Boulant and Paolo Veneri) OECD EFGS Conference Paris 16 November 2016

Aggregating Ordinal Labels from Crowds by Minimax Conditional Entropy. Denny Zhou Qiang Liu John Platt Chris Meek

Distributed Systems. 06. Logical clocks. Paul Krzyzanowski. Rutgers University. Fall 2017

Clustering non-stationary data streams and its applications

Enabling Accurate Analysis of Private Network Data

Homework 1. Spring 2019 (Due Tuesday January 22)

Discovering Geographical Topics in Twitter

Database Privacy: k-anonymity and de-anonymization attacks

AHMED, ALAA HASSAN, M.S. Fusing Uncertain Data with Probabilities. (2016) Directed by Dr. Fereidoon Sadri. 86pp.

Social Choice and Networks

SHAKE MAPS OF STRENGTH AND DISPLACEMENT DEMANDS FOR ROMANIAN VRANCEA EARTHQUAKES

Optimizing Abstaining Classifiers using ROC Analysis. Tadek Pietraszek / 'tʌ dek pɪe 'trʌ ʃek / ICML 2005 August 9, 2005

UC Berkeley International Conference on GIScience Short Paper Proceedings

CSE 20: Discrete Mathematics

The School Geography Curriculum in European Geography Education. Similarities and differences in the United Europe.

The Evolution and Evaluation of City Brand Models and Rankings

Private Key Cryptography. Fermat s Little Theorem. One Time Pads. Public Key Cryptography

Directorate C: National Accounts, Prices and Key Indicators Unit C.3: Statistics for administrative purposes

Statistical Privacy For Privacy Preserving Information Sharing

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Earthquake early warning: Adding societal value to regional networks and station clusters

Transcription:

Corroboration A. Galland WSDM 2010 1/26 Corroborating Information from Disagreeing Views Alban Galland 1 Serge Abiteboul 1 Amélie Marian 2 Pierre Senellart 3 1 INRIA Saclay Île-de-France 2 Rutgers University 3 Télécom ParisTech February 4, 2010, WSDM

Corroboration A. Galland WSDM 2010 Introduction 2/26 Motivating Example What are the capital cities of European countries? France Italy Poland Romania Hungary Alice Paris Rome Warsaw Bucharest Budapest Bob? Rome Warsaw Bucharest Budapest Charlie Paris Rome Katowice Bucharest Budapest David Paris Rome Bratislava Budapest Sofia Eve Paris Florence Warsaw Budapest Sofia Fred Rome?? Budapest Sofia George Rome??? Sofia

Corroboration A. Galland WSDM 2010 Introduction 3/26 Voting Information: redundance France Italy Poland Romania Hungary Alice Paris Rome Warsaw Bucharest Budapest Bob? Rome Warsaw Bucharest Budapest Charlie Paris Rome Katowice Bucharest Budapest David Paris Rome Bratislava Budapest Sofia Eve Paris Florence Warsaw Budapest Sofia Fred Rome?? Budapest Sofia George Rome??? Sofia Frequence P. 0.67 R. 0.80 W. 0.60 Buch. 0.50 Bud. 0.43 R. 0.33 F. 0.20 K. 0.20 Bud. 0.50 S. 0.57 B. 0.20

Corroboration A. Galland WSDM 2010 Introduction 4/26 Evaluating Trustworthiness of Sources Information: redundance, trustworthiness of sources (= average frequence of predicted correctness) France Italy Poland Romania Hungary Trust Alice Paris Rome Warsaw Bucharest Budapest 0.60 Bob? Rome Warsaw Bucharest Budapest 0.58 Charlie Paris Rome Katowice Bucharest Budapest 0.52 David Paris Rome Bratislava Budapest Sofia 0.55 Eve Paris Florence Warsaw Budapest Sofia 0.51 Fred Rome?? Budapest Sofia 0.47 George Rome??? Sofia 0.45 Frequence P. 0.70 R. 0.82 W. 0.61 Buch. 0.53 Bud. 0.46 weighted R. 0.30 F. 0.18 K. 0.19 Bud. 0.47 S. 0.54 by trust B 0.20

Corroboration A. Galland WSDM 2010 Introduction 5/26 Iterative Fixpoint Computation Information: redundance, trustworthiness of sources with iterative fixpoint computation France Italy Poland Romania Hungary Trust Alice Paris Rome Warsaw Bucharest Budapest 0.65 Bob? Rome Warsaw Bucharest Budapest 0.63 Charlie Paris Rome Katowice Bucharest Budapest 0.57 David Paris Rome Bratislava Budapest Sofia 0.54 Eve Paris Florence Warsaw Budapest Sofia 0.49 Fred Rome?? Budapest Sofia 0.39 George Rome??? Sofia 0.37 Frequence P. 0.75 R. 0.83 W. 0.62 Buch. 0.57 Bud. 0.51 weighted R. 0.25 F. 0.17 K. 0.20 Bud. 0.43 S. 0.49 by trust B 0.19

Corroboration A. Galland WSDM 2010 Introduction 6/26 Context and problem Context: Set of sources stating facts (Possible) functional dependencies between facts Fully unsupervised setting: we do not assume any information on truth values of facts or inherent trust in sources Problem: determine which facts are true and which facts are false Real world applications: query answering, source selection, data quality assessment on the web, making good use of the wisdom of crowds

Corroboration A. Galland WSDM 2010 Introduction 7/26 Outline Introduction Model Algorithms Experiments Conclusion

Corroboration A. Galland WSDM 2010 Model 8/26 Outline Introduction Model Algorithms Experiments Conclusion

Corroboration A. Galland WSDM 2010 Model 9/26 General Model Set of facts F = ff 1 :::f n g Examples: Paris is capital of France, Rome is capital of France, Rome is capital of Italy Set of views (= sources) V = fv 1 :::V m g, where a view is a partial mapping from F to {T, F} Example: : Paris is capital of France ^ Rome is capital of France Objective: find the most likely real world W given V where the real world is a total mapping from F to {T, F} Example: Paris is capital of France ^ : Rome is capital of France ^ Rome is capital of Italy ^...

Corroboration A. Galland WSDM 2010 Model 10/26 Generative Probabilistic Model V i, f j '(V i )'(f j )? "(V i )"(f j ) 1 '(V i )'(f j ) 1 "(V i )"(f j ) :W(f j ) W(f j ) '(V i )'(f j ): probability that V i forgets f j "(V i )"(f j ): probability that V i makes an error on f j Number of parameters: n + 2(n + m) Size of data: 'nm with ' the average forget rate

Corroboration A. Galland WSDM 2010 Model 11/26 Obvious Approach Method: use this generative model to find the most likely parameters given the data Inverse the generative model to compute the probability of a set of parameters given the data Not practically applicable: Non-linearity of the model and boolean parameter W(f j ) ) equations for inversing the generative model very complex Large number of parameters (n and m can both be quite large) ) Any exponential technique unpractical ) Heuristic fix-point algorithms

Corroboration A. Galland WSDM 2010 Algorithms 12/26 Outline Introduction Model Algorithms Experiments Conclusion

Corroboration A. Galland WSDM 2010 Algorithms 13/26 Baselines Counting (does not look at negative statements, popularity) 8 >< >: T F jfv i : V i (f j ) = T gj if max f jfv i : V i (f ) = T gj > otherwise Voting (adapted only with negative statements) 8 >< >: T F jfv i : V i (f j ) = T gj if jfv i : V i (f j ) = T _ V i (f j ) = F gj > 0:5 otherwise TruthFinder [YHY07]: heuristic fix-point method from the literature

Corroboration A. Galland WSDM 2010 Algorithms 14/26 3-Estimates Iterative estimation of 3 kind of parameters: truth value of facts error rate or trustworthiness of sources hardness of facts Tricky normalization to ensure stability

Corroboration A. Galland WSDM 2010 Algorithms 15/26 Functional dependencies So far, the models and algorithms are about positive and negative statements, without correlation between facts How to deal with functional dependencies (e.g., capital cities)? pre-filtering: When a view states a value, all other values governed by this FD are considered stated false. If I say that Paris is the capital of France, then I say that neither Rome nor Lyon nor... is the capital of France. post-filtering: Choose the best answer for a given FD.

Corroboration A. Galland WSDM 2010 Experiments 16/26 Outline Introduction Model Algorithms Experiments Conclusion

Corroboration A. Galland WSDM 2010 Experiments 17/26 Datasets Synthetic dataset: large scale and higly customizable Real-world datasets: General-knowledge quiz Biology 6th-grade test Search-engines results Hubdub

Corroboration A. Galland WSDM 2010 Experiments 18/26 Hubdub (1/2) http://www.hubdub.com/ 357 questions, 1 to 20 answers, 473 participants

Corroboration A. Galland WSDM 2010 Experiments 19/26 Hubdub (2/2) Number of errors (no post-filtering) Number of errors (with post-filtering) Voting 278 292 Counting 340 327 TruthFinder 458 274 3-Estimates 272 270

Corroboration A. Galland WSDM 2010 Experiments 20/26 General-Knowledge Quiz (1/2) http://www.madore.org/~david/quizz/quizz1.html 17 questions, 4 to 14 answers, 601 participants

Corroboration A. Galland WSDM 2010 Experiments 21/26 General-Knowledge Quiz (2/2) Number of errors (no post-filtering) Number of errors (with post-filtering) Voting 11 6 Counting 12 6 TruthFinder - - 3-Estimates 9 0

Corroboration A. Galland WSDM 2010 Conclusion 22/26 Outline Introduction Model Algorithms Experiments Conclusion

Corroboration A. Galland WSDM 2010 Conclusion 23/26 In brief We believe truth discovery is an important problem, we do not claim we have solved it completely Collection of fix-point methods (see paper), one of them (3-Estimates) performing remarkably and regularly well Cool real-world applications! All code and datasets available from http://datacorrob.gforge.inria.fr/

Corroboration A. Galland WSDM 2010 Conclusion 24/26 Thanks. Foundations of Web data management

Corroboration A. Galland WSDM 2010 25/26 Perspectives Exploiting dependencies between sources [DBES09] Numerical values (1:77m and 1:78m cannot be seen as two completely contradictory statements for a height) No clear functional dependencies, but a limited number of values for a given object (e.g., phone numbers) Pre-existing trust, e.g., in a social network Clustering of facts, each source being trustworthy for a given field

Corroboration A. Galland WSDM 2010 26/26 References I Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: The role of source dependence. In Proc. VLDB, Lyon, France, August 2009. Xiaoxin Yin, Jiawei Han, and Philip S. Yu. Truth discovery with multiple conflicting information providers on the Web. In Proc. KDD, San Jose, California, USA, August 2007.