The Web is Great. Divesh Srivastava AT&T Labs Research

Similar documents
Leveraging Data Relationships to Resolve Conflicts from Disparate Data Sources. Romila Pradhan, Walid G. Aref, Sunil Prabhakar

Fusing Data with Correlations

arxiv: v1 [cs.db] 14 May 2017

United States Presidents

Time Series Data Cleaning

AHMED, ALAA HASSAN, M.S. Fusing Uncertain Data with Probabilities. (2016) Directed by Dr. Fereidoon Sadri. 86pp.

Compact Explanation of Data Fusion Decisions

Statistics, continued

Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources

Corroborating Information from Disagreeing Views

BUDT 733 Data Analysis for Decision Makers

Standard Format for Importing Questions into Respondus

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Required Elements (Multiple Choice) Each question must begin with a question number, followed by either a period "." or a parentheses ")".

Standard Format for Importing Questions

Standard Format for Importing Questions

Respondus/Blackboard - The Standard Format for Importing

BASIC ELEMENTS Each question must begin with a question number, followed by either a period. or a parentheses ).

The Local Community and Regional Communities

Database design and implementation CMPSCI 645. Lecture 02: Project overview

Jacob s Magic Mittens

Probability and Statistics

Online Data Fusion. Xin Luna Dong AT&T Labs-Research Xuan Liu National University of Singapore

Statistics for Managers Using Microsoft Excel/SPSS Chapter 4 Basic Probability And Discrete Probability Distributions

Information Retrieval and Organisation

googling it: how google ranks search results Courtney R. Gibbons October 17, 2017

Class President: A Network Approach to Popularity. Due July 18, 2014

Bound and Free Variables. Theorems and Proofs. More valid formulas involving quantifiers:

JUST THE MATHS UNIT NUMBER 7.2. DETERMINANTS 2 (Consistency and third order determinants) A.J.Hobson

Recovery Renewal Rebuilding

Uncertain Knowledge and Bayes Rule. George Konidaris

Numbers and Arithmetic

Similarity and Dissimilarity

Classification & Information Theory Lecture #8

ECO 317 Economics of Uncertainty. Lectures: Tu-Th , Avinash Dixit. Precept: Fri , Andrei Rachkov. All in Fisher Hall B-01

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Grant Cameron: The Government's UFO Disclosure Plan

Objectives for Linear Activity. Calculate average rate of change/slope Interpret intercepts and slope of linear function Linear regression

Probability Calculus. Chapter From Propositional to Graded Beliefs

Outline. Lesson 3: Linear Functions. Objectives:

IN Indiana Indiana Academic Standards

Informational Substitutes

CSC321 Lecture 18: Learning Probabilistic Models


Numbers and Arithmetic

CS 561: Artificial Intelligence

Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data

Influence-Aware Truth Discovery

Propositional Languages

Assessing Risks. Tom Carter. tom/sfi-csss. Complex Systems

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Test 2 VERSION A STAT 3090 Fall 2017

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

15-780: Grad AI Lecture 17: Probability. Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik Zawadzki, Abe Othman

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

NATIONAL FESTIVAL OF THE STATES WASHINGTON, D.C. Part of the. 150 TH COMMEMORATION of the AMERICAN CIVIL WAR

In this initial chapter, you will be introduced to, or more than likely be reminded of, a

Pengju XJTU 2016

Physics 509: Bootstrap and Robust Parameter Estimation

Predicting flight on-time performance

4. Conditional Probability P( ) CSE 312 Autumn 2012 W.L. Ruzzo

The Veracity of Big Data

DIOCESE OF HARRISBURG SOCIAL STUDIES CURRICULUM GRADE 2 SECOND GRADE EXPECTATIONS

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Reasoning Under Uncertainty: Introduction to Probability

Washington Master Address Services: Project Overview Ben Vaught, OCIO David Wright, DOR Craig Erickson, DOH Tom Kimpel, OFM

Review: Statistical Model

Regular Languages and Finite Automata

Statistics for Managers Using Microsoft Excel (3 rd Edition)


COMP Analysis of Algorithms & Data Structures

Lecture 7B: Chapter 6, Section 2 Finding Probabilities: More General Rules

Reading for Lecture 6 Release v10

Introduction to probability

Discrete Probability and State Estimation

2. Probability. Chris Piech and Mehran Sahami. Oct 2017

CS 361: Probability & Statistics

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Introduction to Semantics. The Formalization of Meaning 1

4 th Grade Social Studies Curriculum. 4.1 (1 st 9 Weeks) Students will be able to: NM Standards & Benchmarks CK=Core Knowledge.

Using the Dempster-Shafer theory of evidence to resolve ABox inconsistencies. Andriy Nikolov Victoria Uren Enrico Motta Anne de Roeck

Artificial Intelligence

Motivation for introducing probabilities

Discrete Probability and State Estimation

2. I will provide two examples to contrast the two schools. Hopefully, in the end you will fall in love with the Bayesian approach.

CS 188: Artificial Intelligence Spring Today

Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective

The things in the set are called either members or elements. All the members of a set are members of a universal set (or for short: the Universe)

Learning from Data. Amos Storkey, School of Informatics. Semester 1. amos/lfd/

New Partners for Smart Growth: Building Safe, Healthy, and Livable Communities Mayor Jay Williams, Youngstown OH

Logic - recap. So far, we have seen that: Logic is a language which can be used to describe:

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CS 361: Probability & Statistics

Test Homogeneity The Single-Factor Model. Test Theory Chapter 6 Lecture 9

Indiana Academic Standards Science Grade: 3 - Adopted: 2016

Expectation of geometric distribution

Contents. Positive and Negative. Additional Practice Answers to Check Your Work. Section A

Transcription:

The Web is Great Divesh Srivastava AT&T Labs Research

A Lot of Information on the Web Information Can Be Erroneous The story, marked Hold for release Do not use, was sent in error to the news service s thousands of corporate clients.

Information Can Be Erroneous Maurice Jarre (1924 2009) French Conductor and Composer One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear. 2:29, 30 March 2009 False Information Can Be Propagated UA s bankruptcy Chicago Tribune, 2002 Sun Sentinel.com Google News Bloomberg.com The UAL stock plummeted to $3 from $12.5

Study on Two Domains #Sources Period #Objects #Localattrs #Globalattrs Belief of clean data Poor data quality can have big impact Consider ed items Stock 55 7/2011 1000*20 333 153 16000*20 Flight 38 12/2011 1200*31 43 15 7200*31

Study on Two Domains #Sources Period #Objects #Localattrs #Globalattrs Consider ed items Stock 55 7/2011 1000*20 333 153 16000*20 Stock Search stock price quotes Sources: 200 (search results)89 (deep web)76 (GET method) 55 (no JavaScript) 1000 Objects : a stock with a particular symbol on a particular day 30 from Dow Jones Index 100 from NASDAQ100 (3 overlaps) 873 from Russell 3000 Attributes: 333 (local) 153 (global) 21 (provided by > 1/3 sources) 16 (no change after market close) Study on Two Domains #Sources Period #Objects #Localattrs #Globalattrs Consider ed items Flight 38 12/2011 1200*31 43 15 7200*31 Flight Search flight status Sources: 38 3 airline websites (AA, UA, Continental) 8 airport websites (SFO, DEN, etc.) 27 third party websites (Orbitz, Travelocity, etc.) 1200 Objects : a flight with a particular flight number on a particular day from a particular departure city Departing or arriving at the hub airports of AA/UA/Continental Attributes: 43 (local) 15 (global) 6 (provided by > 1/3 sources) scheduled dept/arr time, actual dept/arr time, dept/arr gate

Q1. Is There a Lot of Redundant Data? Q2. Is the Data Consistent? Tolerance to 1% value difference

Q2. Is the Data Consistent? Tolerance to 1% value difference Inconsistency on 50% items after removing StockSmart Q2. Is the Data Consistent? (II) Entropy measures distribution of different values Quite low entropy: one value provided more often than others

Q2. Is the Data Consistent? (III) Deviation measures difference of numerical values High deviation: 13.4 for Stock, 13.1 min for Flight Why Such Inconsistency? I. Semantic Ambiguity Yahoo! Finance Day s Range: 93.80 95.71 Nasdaq 52wk Range: 25.38 95.71 Day s Range: 93.80 95.71 52 Wk: 25.38 93.72

Why Such Inconsistency? II. Instance Ambiguity Why Such Inconsistency? III. Out of Date Data 4:05 pm 3:57 pm

Why Such Inconsistency? IV. Unit Error 76,821,000 76.82B Why Such Inconsistency? V. Pure Error FlightView FlightAware Orbitz 6:15 PM 6:15 PM 6:22 PM 9:40 PM 8:33 PM 9:54 PM

Why Such Inconsistency? Random sample of 20 data items and 5 items with the largest # of values in each domain Q3. Do Sources Have High Accuracy? Not high on average:.86 for Stock and.8 for Flight Gold standard Stock: vote on data from Google Finance, Yahoo! Finance, MSN Money, NASDAQ, Bloomberg Flight: from airline websites

Q3 2. What About Authoritative Sources? Reasonable but not so high accuracy Medium coverage Q4. Is There Copying or Data Sharing Between Deep Web Sources?

Q4 2. Is Copying or Data Sharing Mainly on Accurate Data?

Basic Solution: Voting Only 70% correct values are provided by over half of the sources.908 voting precision for Stock; i.e., wrong values for 1500 data items.864 voting precision for Flight; i.e., wrong values for 1000 data items Improvement I. Using Source Accuracy S1 S2 S3 Flight 1 7:02PM 6:40PM 7:02PM Flight 2 5:43PM 5:43PM 5:50PM Flight 3 9:20AM 9:20AM 9:20AM Flight 4 9:40PM 9:52PM 8:33PM Flight 5 6:15PM 6:15PM 6:22PM

Improvement I. Using Source Accuracy Higher accuracy; More trustable S1 S2 S3 Flight 1 7:02PM 6:40PM 7:02PM Flight 2 5:43PM 5:43PM 5:50PM Flight 3 9:20AM 9:20AM 9:20AM Flight 4 9:40PM 9:52PM 8:33PM Flight 5 6:15PM 6:15PM 6:22PM Naïve voting obtains an accuracy of 80% Improvement I. Using Source Accuracy Higher accuracy; More trustable S1 S2 S3 Flight 1 7:02PM 6:40PM 7:02PM Flight 2 5:43PM 5:43PM 5:50PM Flight 3 9:20AM 9:20AM 9:20AM Flight 4 9:40PM 9:52PM 8:33PM Flight 5 6:15PM 6:15PM 6:22PM Challenges: 1. How to decide source accuracy? 2. How to leverage accuracy in voting? Considering accuracy obtains an accuracy of 100%

Source Accuracy: Bayesian Analysis Goal: Pr(v i (D) true Ф D (S)), for each D, v i (D) According to Bayes Rule, we need to know Pr(Ф D (S) v i (D) true), Pr(v i (D) true), for each v i (D) Pr(Ф D (S) v i (D) true) can be computed as: S S(v i(d))(a(s)) * S S\S(v i(d))((1 A(S))/n) Pr(v i (D) true Ф D (S)) = e Conf(v i(d)) /( v 0(D)e Conf(v 0(D)) ) Conf(v i (D)) = S S(v i(d))ln(na(s)/(1 A(S))) A(S) = Avg v i(d) S Pr(v i (D) true Ф D (S)) Computing Source Accuracy Source accuracy A(S) A(S) = Avg v i(d) S Pr(v i (D) true Ф) v i (D) S : S provides value v i on data item D Ф : observations on all data items by sources S Pr(v i (D) true Ф) : probability of v i (D) being true How to compute Pr(v i (D) true Ф)?

Using Source Accuracy in Data Fusion Input: data item D, val(d) = {v 0,v 1,,v n }, Ф Output: Pr(v i (D) true Ф), for i=0,, n (sum=1) Based on Bayes Rule, need Pr(Ф v i (D) true) Under independence, need Pr(Ф D (S) v i (D) true) If S provides v i : Pr(Ф D (S) v i (D) true) = A(S) If S does not : Pr(Ф D (S) v i (D) true) =(1 A(S))/n Challenge: How to handle inter dependence between source accuracy and value probability? Data Fusion Using Source Accuracy Source accuracy A( S) Avg Pr( v( D) ) v( D) S Value probability C( v( e Pr( v( D) ) C e v0val( D) D)) ( v0 ( D)) Source vote count na( S) A' ( S) ln 1 A( S) Value vote count C( v( D)) SS ( v( D)) A' ( S) Continue until source accuracy converges

Results on Stock Data (I) Sources ordered by recall (coverage * accuracy) Among various methods, the Bayesian based method (Accu) performs best at the beginning, but in the end obtains a final precision (=recall) of.900, worse than Vote (.908) Results on Stock Data (II) AccuSim obtains a final precision of.929, higher than Vote and any other method (around.908) This translates to 350 more correct values

Results on Stock Data (III) Results on Flight Data Accu/AccuSim obtain final precision of.831/.833, both lower than Vote (.857) WHY??? What is that magic source?

Copying on Erroneous Data Copying on Erroneous Data S1 S2 S3 S4 S5 Flight 1 7:02PM 6:40PM 7:02PM 7:02PM 8:02PM Flight 2 5:43PM 5:43PM 5:50PM 5:50PM 5:50PM Flight 3 9:20AM 9:20AM 9:20AM 9:20AM 9:20AM Flight 4 9:40PM 9:52PM 8:33PM 8:33PM 8:33PM Flight 5 6:15PM 6:15PM 6:22PM 6:22PM 6:22PM A lie told often enough becomes the truth. Vladimir Lenin

Copying on Erroneous Data S1 S2 S3 S4 S5 Flight 1 7:02PM 6:40PM 7:02PM 7:02PM 8:02PM Flight 2 5:43PM 5:43PM 5:50PM 5:50PM 5:50PM Flight 3 9:20AM 9:20AM 9:20AM 9:20AM 9:20AM Flight 4 9:40PM 9:52PM 8:33PM 8:33PM 8:33PM Flight 5 6:15PM 6:15PM 6:22PM 6:22PM 6:22PM A lie told often enough becomes the truth. Vladimir Lenin Higher accuracy; More trustable Considering source accuracy can be worse when there is copying Improvement II. Ignoring Copied Data S1 S2 S3 S4 S5 Flight 1 7:02PM 6:40PM 7:02PM 7:02PM 8:02PM Flight 2 5:43PM 5:43PM 5:50PM 5:50PM 5:50PM Flight 3 9:20AM 9:20AM 9:20AM 9:20AM 9:20AM Flight 4 9:40PM 9:52PM 8:33PM 8:33PM 8:33PM Flight 5 6:15PM 6:15PM 6:22PM 6:22PM 6:22PM Challenges: 1. How to detect copying? 2. How to leverage copying in voting? It is important to detect copying and ignore copied values in fusion

Copying? Source 1 on USA Presidents: 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison Are Source 1 and Source 2 dependent? 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama Source 2 on USA Presidents: 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama Not necessarily Copying? Common Errors Are Source 1 and Source 2 dependent? Very likely Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 1 st : George Washington 2 nd : Benjamin Franklin 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 3 rd : John F. Kennedy 4 th : Abraham Lincoln 4 th : Abraham Lincoln 41 st : George W. Bush 41 st : George W. Bush 42 nd : Hillary Clinton 42 nd : Hillary Clinton 43 rd : Dick Cheney 43 rd : Dick Cheney 44 th : Barack Obama 44 th : John McCain

Copying Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Goal: Pr(S1S2 Ф), Pr(S1S2 Ф) (sum = 1) According to Bayes Rule, we need to know Pr(Ф S1S2), Pr(Ф S1S2) Key: compute Pr(Ф D S1S2), Pr(Ф D S1S2) For each D S1 S2 Copying Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Pr Independence Copying O t O f O d 2 A > A c (1 c) A 2 2 ( 1 ) (1 A) (1 A) c (1 c) n n 2 (1 A P d 1 A ) 2 P d ( 1 c) n A source accuracy; n #wrong values; c copy rate A 2

Results on Flight Data AccuCopy obtains a final precision of.943, much higher than Vote (.864) This translates to 570 more correct values Results on Flight Data (II)

Take Aways Deep Web data is not fully trustable Deep Web sources have different accuracies Copying is common Truth finding on the Deep Web can leverage source accuracy copying relationships, and value similarity Important Direction: Source Selection Peaks happen before integrating all sources How to find the best set of sources while balancing quality gain and integration cost?

Important Direction: Source Selection Peaks happen before integrating all sources How to find the best set of sources while balancing quality gain and integration cost? Acknowledgements Joint work with: Xin Luna Dong, Yifan Hu, Ken Lyons (AT&T) Laure Berti Equille (IRD) Xian Li, Weiyi Meng (SUNY Binghamton) Selected research papers: Truth Finding on the Deep Web: Is the Problem Solved? PVLDB 2013? Global detection of complex copying relationships between sources. PVLDB 2010. Integrating conflicting data: the role of source dependence. PVLDB 2009.