Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data

Similar documents
Temporal Multi-View Inconsistency Detection for Network Traffic Analysis

UAPD: Predicting Urban Anomalies from Spatial-Temporal Data

GAMINGRE 8/1/ of 7

Estimating Local Information Trustworthiness via Multi-Source Joint Matrix Factorization

BUSI 460 Suggested Answers to Selected Review and Discussion Questions Lesson 7

How are adding integers and subtracting integers related? Work with a partner. Use integer counters to find 4 2. Remove 2 positive counters.

Multi-Plant Photovoltaic Energy Forecasting Challenge with Regression Tree Ensembles and Hourly Average Forecasts

Discovering Truths from Distributed Data

Climatography of the United States No

Forecasting the electricity consumption by aggregating specialized experts

Real-time Sentiment-Based Anomaly Detection in Twitter Data Streams

Online Truth Discovery on Time Series Data

FEB DASHBOARD FEB JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

Detecting Origin-Destination Mobility Flows From Geotagged Tweets in Greater Los Angeles Area

Mountain View Community Shuttle Monthly Operations Report

WHEN IS IT EVER GOING TO RAIN? Table of Average Annual Rainfall and Rainfall For Selected Arizona Cities

Lesson Adaptation Activity: Analyzing and Interpreting Data

Chiang Rai Province CC Threat overview AAS1109 Mekong ARCC

Time Series Analysis

Climatography of the United States No

Climatography of the United States No

Lecture Prepared By: Mohammad Kamrul Arefin Lecturer, School of Business, North South University

GTR # VLTs GTR/VLT/Day %Δ:

Climatography of the United States No

Climatography of the United States No

Specialist rainfall scenarios and software package

Appendix BAL Baltimore, Maryland 2003 Annual Report on Freeway Mobility and Reliability

Chapter 3. Regression-Based Models for Developing Commercial Demand Characteristics Investigation

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Identification of Bursts in a Document Stream

SYSTEM BRIEF DAILY SUMMARY

Climatography of the United States No

Climatography of the United States No

Jayalath Ekanayake Jonas Tappolet Harald Gall Abraham Bernstein. Time variance and defect prediction in software projects: additional figures

Climatography of the United States No

Dates and Prices ICAEW - Manchester In Centre Programme Prices

Climatography of the United States No

RaRE: Social Rank Regulated Large-scale Network Embedding

Climatography of the United States No

LAB 3: THE SUN AND CLIMATE NAME: LAB PARTNER(S):

TILT, DAYLIGHT AND SEASONS WORKSHEET

YACT (Yet Another Climate Tool)? The SPI Explorer

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Forecasting using R. Rob J Hyndman. 1.3 Seasonality and trends. Forecasting using R 1

DROUGHT INDICES BEING USED FOR THE GREATER HORN OF AFRICA (GHA)

SYSTEM BRIEF DAILY SUMMARY

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Time Series Data Cleaning

NASA Products to Enhance Energy Utility Load Forecasting

Computing & Telecommunications Services Monthly Report January CaTS Help Desk. Wright State University (937)

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Recovery Analysis Methods and Data Requirements Study

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Climatography of the United States No

Analysis Based on SVM for Untrusted Mobile Crowd Sensing

Climatography of the United States No

Advanced Techniques for Mining Structured Data: Process Mining

Grade 6 Standard 2 Unit Test Astronomy

REGIONAL TRAINING COURSE ON METHODS AND TOOLS TO IDENTIFY SOURCES OF AIR POLLUTION

Climatography of the United States No

Introduction to Course

Determine the trend for time series data

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Climatography of the United States No

Salem Economic Outlook

Climatography of the United States No

Climatography of the United States No

Supplementary appendix

Where to Find My Next Passenger?

Improve Forecasts: Use Defect Signals

What is the difference between Weather and Climate?

DAILY QUESTIONS 28 TH JUNE 18 REASONING - CALENDAR

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

ALASKA REGION CLIMATE OUTLOOK BRIEFING. December 22, 2017 Rick Thoman National Weather Service Alaska Region

Seasonal Hydrometeorological Ensemble Prediction System: Forecast of Irrigation Potentials in Denmark

Transcription:

SDM 15 Vancouver, CAN Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data Houping Xiao 1, Yaliang Li 1, Jing Gao 1, Fei Wang 2, Liang Ge 3, Wei Fan 4, Long Vu 5, and Deepak Turaga 5 1 SUNY at Buffalo; 2 University of Connecticut; 3 Google; 4 Baidu Big Data lab; 5 IBM T.J. Watson 1

Outline Motivation Challenges Proposed Two-Step Framework Step-1: Joint Nonnegative Tensor Factorization Step-2: Inconsistency Score calculation Experiments Conclusions 2

Motivation Multiple Information Sources Example: Hotel ratings can be obtained from multiple websites, such as Priceline, Orbitz, and Tripadvisor Question? Which piece of information is trustworthy? Which object does receive reliable information? Our solution Calculate the degree of receiving inconsistent information across sources Lower degree of inconsistency more reliable 3

Motivating Example 4

How to Find Inconsistent Ratings Easy comparisons Aggregate the ratings of all the users into average ratings will lose information Users are not matched across different platforms, so we are unable to compare ratings of each user Our solution: Identify user groups and compare at the group level In each source, users can be partitioned into groups so that users in the same group share similar rating patterns over objects The underlying user groups and the ratings given by each group should be consistent across sources 5

Importance of Time Observations Multi-source data can continuously arrive with constantly changing distributions Motivating Example Sep-Nov Dec-Feb Mar-May Jun-Aug 6

Solutions Baselines Conduct separate modeling on each snapshot (Simple, but the temporal connection between timestamps is missing) Our Solution Consider the behavior at timestamp-cluster level (e.g. hotel ratings could change seasonally) In each source, timestamps can be clustered. Users behavior at the same timestamp cluster should be similar 7

Proposed Framework Core tensor User group assignment matrix Timestamps cluster Identity matrix assignment matrix hotel Tripadvisor user Orbitz user hotel Inconsistency vector Priceline Data Collection user hotel Step-1: Joint Nonnegative Tensor Factorization Step-2: Inconsistency score computation 8

Joint Nonnegative Tensor Factorization N s N 1 T T X 1 K X M K N 1 N U s,1 s U 1,2 C U 1,3 U 1,1 C G 1 K U 1 s,2 C C G s K U s,3 min G s, U s,i M 0 s=1 (L X s + αω(g s, G )) L X s = X s G s i U s,i 2 F, measures the factorization error of each tensor Ω(G s, G ) = G s G F 2, where G = 1 M s=1 M G s, is a regularization term proposed to learn the consensus information α, is regularization parameter 9

Inconsistency Score Computation Inconsistency Score I k = S k S median 2, where S k (s) = similarity (G s :, k, :, G (:, k, : )) S median = median {S k, k = 1,, K} k G 1 G s k k G M k G C D C D C D C D S k = (S k 1,, S k s,, S k (M)) 10

Streaming Data Observation: Multi-source data continuously arrives Solution: Step 1: obtain {U o s,t,i } at time T based on {G s,t 1 } M min {U (s,i,t) } s=1 X s,t G s,t 1 i U s,i,t F Step 2: use {U o s,t,i } to obtain {G s,t } at time T min G s,t M T s=1 t=1 X s,t G s,t i U o s,t,i F 2 2 + α G s,t G T 1 2 F 11

Observation: Many users may only give ratings for a few hotels at some specific timestamps Solution: K s = { i, j, k : X ijk set Objective function min G s, U s,i M 0 s=1 i,j,k K s Missing Data s s X ijk is available} is a triple-element G s i U s,i ijk 2 + α s Gijk 2 G ijk 12

Experiment Set-up Datasets: Synthetic datasets Real-world datasets Hotel Rating Network Traffic Flow Weather Forecast 13

Effectiveness Comparison F-Measure Comparison w.r.t. Outlier Percentage 2% 3% 5% 6% 8% 10% NHC 1.000 0.500 0.667 0.500 0.600 0.333 JMF 1.000 1.000 0.667 0.750 0.600 0.667 MSDBN 1.000 1.000 1.000 0.750 0.600 0.833 JNNTF 1.000 1.000 1.000 1.000 1.000 1.000 NHC [1] : Normalized Histogram Comparison JMF [1] : Joint Matrix Factorization MSDBN [2] : Multi-Source Deep Belief Network JNNTF: Joint Non-Negative Tensor Factorization 14

Hotel Rating Dataset The distribution of hotels inconsistency scores in Las Vages and New York City 1 1 0.9 0.9 0.8 0.8 Inconsistency Score 0.7 0.6 0.5 0.4 0.3 Inconsistency Score 0.7 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 10 20 30 40 50 60 70 80 90 100 110 ID of Hotel in Las Vages 0 20 40 60 80 100 120 140 160 180 200 ID of Hotel in New York City 15

Case Study Rating pattern of the top inconsistent and a consistent hotels in New York City 1 0.8 Hotel with highest inconsistency score Rating 0.6 0.4 0.2 0 Orbitz Priceline TripAdvisor 1 2 3 4 1 Hotel with low inconsistency score 0.8 Rating 0.6 0.4 0.2 0 1 2 3 4 Timestamp Cluster Orbitz Priceline TripAdvisor 16

Network Traffic Flow Dataset Inconsistency Score 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (a) : The distribution of hosts inconsistency scores (b) : Case Study: traffic flow pattern of hosts 10 20 30 40 50 60 70 80 90 100 Host ID (a) Network Traffic Network Traffic 150 100 50 0 140 120 100 80 60 40 Host with high inconsistency score Source 1 Source 2 Source 3 1 2 3 4 5 Host with low inconsistency score 1 2 3 4 5 Timestamp Cluster (b) 17 Source 1 Source 2 Source 3

Inconsistency Score Weather Forecast Dataset (a): The distribution of cities inconsistency scores (b): Case study: Highest temperature pattern of cities 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 City ID (a) High Temperature High Temperature 60 55 50 45 40 85 80 75 70 65 City with highest inconsistency score HAM Wund WWO 1 2 3 4 5 HAM Wund WWO City with low inconsistency score 1 2 3 4 5 Timestamp Cluster (b) 18

Incremental V.S. Offline (a): Efficiency Comparison (b): Average difference of inconsistency score Running Time (s) 12 x 104 10 8 6 4 2 0 Offline Incremental 1 2 3 4 5 Timestamp (a) Average difference of inconsistency score 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 Timestamp (b) = 2 = 3 = 4 = 5 19

The Effect of Considering Missing Data Running time w.r.t. Percentage of Missing Value Percentage JNNTF-MD/s JNNTF/s ratio 10% 1009.149 1341.918 75.2% 30% 953.113 1394.353 68.4% 50% 722.393 1444.786 50.0% 70% 460.788 1451.650 31.7% 90% 461.383 1445.306 31.9% Percentage: the percentage of entries that are missing JNNTF: basic algorithm without considering missing values JNNTF-MD: the method considering missing values Ratio: RunningTime(JNNTF-MD) / RunningTime(JNNTF) 20

Conclusions Developed a multi-source joint tensor factorization framework to conduct untrustworthy information detection, which takes the importance of time dimension into consideration Proposed an incremental factorization to dynamically conduct joint tensor factorization for streaming data Proposed an approach to handle missing values by focusing only on available entries Results on synthetic and real-world datasets show the advantage of the proposed framework 21

Thank You! Questions? 22

Back-up

Tensor Factorization X s G s 1 U s,1 2 U s,2 3 U s,3 G s R C K D is the core tensor, which represents the latent behavior of each user group on each time cluster for each object U s,1 R Ns C denotes the partition of N s users into C groups, where U s,1 C ij 0 and U s,1 ij = 1 j=1 U s,2 R K K is an identity matrix, denoting the objects we are interested in U s,3 R Ts D denotes the partition of T s timestamps into D clusters, where U s,3 ij 0 and D U s,3 ij = 1 j=1

Reference for Baselines [1] Ge, Liang, et al. "Estimating local information trustworthiness via multi-source joint matrix factorization." Data Mining (ICDM), 2012 IEEE 12th International Conference on. IEEE, 2012. [2] Ge, Liang, et al. "Multi-source deep learning for information trustworthiness estimation." Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.

Hotel Rating Dataset Data Collection: Ratings for 111 common hotels in Las Vegas and 210 common hotels for New York City from January to December 2013 from three popular travel websites: Orbitz, Priceline, and Tripadvisor. Goal: Detect hotels which receive inconsistent ratings from Orbitz, Priceline and Tripadvisor Input: First dimension: Users ID Second dimension: Hotel ID. For example, in Las Vegas, its dimensionality is 110 Third dimension: month ID. Totally, we have 12 months. We have three sources. Output: Inconsistency vector of 110 hotels in Las Vegas and 210 hotels in New York City.

Network Traffic Flow Dataset Data Collection: Network traffic flow dataset is collected from an enterprise network containing 500 hosts. Goal: Detect inconsistent hosts whose network traffic flow is inconsistent across months Input: First dimension: each weekday Second dimension: Host ID Third dimension: every hour in a day We treat each month s each as a source Output: Inconsistency score vector for 500 hosts

Weather Forecast Dataset Data Collection: Highest temperature of 88 cities in US are collected from three platforms: HAM weather (HAM), Wunderground (Wund), and World Weather Online (WWO), from Oct. 7, 2013 to Dec. 17, 2013. Goal: Detect cities which receive inconsistent predicted highest temperature from HAM, Wund, and WWO. Input: First dimension: Prediction timestamps ID Second dimension: city ID Third dimension: day ID from Oct. 7, 2013 to Dec. 17, 2013 Three sources: HAM, Wund, and WWO Output: Inconsistency score vector for 88 cities