SDM 15 Vancouver, CAN Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data Houping Xiao 1, Yaliang Li 1, Jing Gao 1, Fei Wang 2, Liang Ge 3, Wei Fan 4, Long Vu 5, and Deepak Turaga 5 1 SUNY at Buffalo; 2 University of Connecticut; 3 Google; 4 Baidu Big Data lab; 5 IBM T.J. Watson 1
Outline Motivation Challenges Proposed Two-Step Framework Step-1: Joint Nonnegative Tensor Factorization Step-2: Inconsistency Score calculation Experiments Conclusions 2
Motivation Multiple Information Sources Example: Hotel ratings can be obtained from multiple websites, such as Priceline, Orbitz, and Tripadvisor Question? Which piece of information is trustworthy? Which object does receive reliable information? Our solution Calculate the degree of receiving inconsistent information across sources Lower degree of inconsistency more reliable 3
Motivating Example 4
How to Find Inconsistent Ratings Easy comparisons Aggregate the ratings of all the users into average ratings will lose information Users are not matched across different platforms, so we are unable to compare ratings of each user Our solution: Identify user groups and compare at the group level In each source, users can be partitioned into groups so that users in the same group share similar rating patterns over objects The underlying user groups and the ratings given by each group should be consistent across sources 5
Importance of Time Observations Multi-source data can continuously arrive with constantly changing distributions Motivating Example Sep-Nov Dec-Feb Mar-May Jun-Aug 6
Solutions Baselines Conduct separate modeling on each snapshot (Simple, but the temporal connection between timestamps is missing) Our Solution Consider the behavior at timestamp-cluster level (e.g. hotel ratings could change seasonally) In each source, timestamps can be clustered. Users behavior at the same timestamp cluster should be similar 7
Proposed Framework Core tensor User group assignment matrix Timestamps cluster Identity matrix assignment matrix hotel Tripadvisor user Orbitz user hotel Inconsistency vector Priceline Data Collection user hotel Step-1: Joint Nonnegative Tensor Factorization Step-2: Inconsistency score computation 8
Joint Nonnegative Tensor Factorization N s N 1 T T X 1 K X M K N 1 N U s,1 s U 1,2 C U 1,3 U 1,1 C G 1 K U 1 s,2 C C G s K U s,3 min G s, U s,i M 0 s=1 (L X s + αω(g s, G )) L X s = X s G s i U s,i 2 F, measures the factorization error of each tensor Ω(G s, G ) = G s G F 2, where G = 1 M s=1 M G s, is a regularization term proposed to learn the consensus information α, is regularization parameter 9
Inconsistency Score Computation Inconsistency Score I k = S k S median 2, where S k (s) = similarity (G s :, k, :, G (:, k, : )) S median = median {S k, k = 1,, K} k G 1 G s k k G M k G C D C D C D C D S k = (S k 1,, S k s,, S k (M)) 10
Streaming Data Observation: Multi-source data continuously arrives Solution: Step 1: obtain {U o s,t,i } at time T based on {G s,t 1 } M min {U (s,i,t) } s=1 X s,t G s,t 1 i U s,i,t F Step 2: use {U o s,t,i } to obtain {G s,t } at time T min G s,t M T s=1 t=1 X s,t G s,t i U o s,t,i F 2 2 + α G s,t G T 1 2 F 11
Observation: Many users may only give ratings for a few hotels at some specific timestamps Solution: K s = { i, j, k : X ijk set Objective function min G s, U s,i M 0 s=1 i,j,k K s Missing Data s s X ijk is available} is a triple-element G s i U s,i ijk 2 + α s Gijk 2 G ijk 12
Experiment Set-up Datasets: Synthetic datasets Real-world datasets Hotel Rating Network Traffic Flow Weather Forecast 13
Effectiveness Comparison F-Measure Comparison w.r.t. Outlier Percentage 2% 3% 5% 6% 8% 10% NHC 1.000 0.500 0.667 0.500 0.600 0.333 JMF 1.000 1.000 0.667 0.750 0.600 0.667 MSDBN 1.000 1.000 1.000 0.750 0.600 0.833 JNNTF 1.000 1.000 1.000 1.000 1.000 1.000 NHC [1] : Normalized Histogram Comparison JMF [1] : Joint Matrix Factorization MSDBN [2] : Multi-Source Deep Belief Network JNNTF: Joint Non-Negative Tensor Factorization 14
Hotel Rating Dataset The distribution of hotels inconsistency scores in Las Vages and New York City 1 1 0.9 0.9 0.8 0.8 Inconsistency Score 0.7 0.6 0.5 0.4 0.3 Inconsistency Score 0.7 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 10 20 30 40 50 60 70 80 90 100 110 ID of Hotel in Las Vages 0 20 40 60 80 100 120 140 160 180 200 ID of Hotel in New York City 15
Case Study Rating pattern of the top inconsistent and a consistent hotels in New York City 1 0.8 Hotel with highest inconsistency score Rating 0.6 0.4 0.2 0 Orbitz Priceline TripAdvisor 1 2 3 4 1 Hotel with low inconsistency score 0.8 Rating 0.6 0.4 0.2 0 1 2 3 4 Timestamp Cluster Orbitz Priceline TripAdvisor 16
Network Traffic Flow Dataset Inconsistency Score 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (a) : The distribution of hosts inconsistency scores (b) : Case Study: traffic flow pattern of hosts 10 20 30 40 50 60 70 80 90 100 Host ID (a) Network Traffic Network Traffic 150 100 50 0 140 120 100 80 60 40 Host with high inconsistency score Source 1 Source 2 Source 3 1 2 3 4 5 Host with low inconsistency score 1 2 3 4 5 Timestamp Cluster (b) 17 Source 1 Source 2 Source 3
Inconsistency Score Weather Forecast Dataset (a): The distribution of cities inconsistency scores (b): Case study: Highest temperature pattern of cities 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 City ID (a) High Temperature High Temperature 60 55 50 45 40 85 80 75 70 65 City with highest inconsistency score HAM Wund WWO 1 2 3 4 5 HAM Wund WWO City with low inconsistency score 1 2 3 4 5 Timestamp Cluster (b) 18
Incremental V.S. Offline (a): Efficiency Comparison (b): Average difference of inconsistency score Running Time (s) 12 x 104 10 8 6 4 2 0 Offline Incremental 1 2 3 4 5 Timestamp (a) Average difference of inconsistency score 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 Timestamp (b) = 2 = 3 = 4 = 5 19
The Effect of Considering Missing Data Running time w.r.t. Percentage of Missing Value Percentage JNNTF-MD/s JNNTF/s ratio 10% 1009.149 1341.918 75.2% 30% 953.113 1394.353 68.4% 50% 722.393 1444.786 50.0% 70% 460.788 1451.650 31.7% 90% 461.383 1445.306 31.9% Percentage: the percentage of entries that are missing JNNTF: basic algorithm without considering missing values JNNTF-MD: the method considering missing values Ratio: RunningTime(JNNTF-MD) / RunningTime(JNNTF) 20
Conclusions Developed a multi-source joint tensor factorization framework to conduct untrustworthy information detection, which takes the importance of time dimension into consideration Proposed an incremental factorization to dynamically conduct joint tensor factorization for streaming data Proposed an approach to handle missing values by focusing only on available entries Results on synthetic and real-world datasets show the advantage of the proposed framework 21
Thank You! Questions? 22
Back-up
Tensor Factorization X s G s 1 U s,1 2 U s,2 3 U s,3 G s R C K D is the core tensor, which represents the latent behavior of each user group on each time cluster for each object U s,1 R Ns C denotes the partition of N s users into C groups, where U s,1 C ij 0 and U s,1 ij = 1 j=1 U s,2 R K K is an identity matrix, denoting the objects we are interested in U s,3 R Ts D denotes the partition of T s timestamps into D clusters, where U s,3 ij 0 and D U s,3 ij = 1 j=1
Reference for Baselines [1] Ge, Liang, et al. "Estimating local information trustworthiness via multi-source joint matrix factorization." Data Mining (ICDM), 2012 IEEE 12th International Conference on. IEEE, 2012. [2] Ge, Liang, et al. "Multi-source deep learning for information trustworthiness estimation." Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.
Hotel Rating Dataset Data Collection: Ratings for 111 common hotels in Las Vegas and 210 common hotels for New York City from January to December 2013 from three popular travel websites: Orbitz, Priceline, and Tripadvisor. Goal: Detect hotels which receive inconsistent ratings from Orbitz, Priceline and Tripadvisor Input: First dimension: Users ID Second dimension: Hotel ID. For example, in Las Vegas, its dimensionality is 110 Third dimension: month ID. Totally, we have 12 months. We have three sources. Output: Inconsistency vector of 110 hotels in Las Vegas and 210 hotels in New York City.
Network Traffic Flow Dataset Data Collection: Network traffic flow dataset is collected from an enterprise network containing 500 hosts. Goal: Detect inconsistent hosts whose network traffic flow is inconsistent across months Input: First dimension: each weekday Second dimension: Host ID Third dimension: every hour in a day We treat each month s each as a source Output: Inconsistency score vector for 500 hosts
Weather Forecast Dataset Data Collection: Highest temperature of 88 cities in US are collected from three platforms: HAM weather (HAM), Wunderground (Wund), and World Weather Online (WWO), from Oct. 7, 2013 to Dec. 17, 2013. Goal: Detect cities which receive inconsistent predicted highest temperature from HAM, Wund, and WWO. Input: First dimension: Prediction timestamps ID Second dimension: city ID Third dimension: day ID from Oct. 7, 2013 to Dec. 17, 2013 Three sources: HAM, Wund, and WWO Output: Inconsistency score vector for 88 cities