Time Series Data Cleaning

Similar documents
Basics: Definitions and Notation. Stationarity. A More Formal Definition

The Web is Great. Divesh Srivastava AT&T Labs Research

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data

Citation for the original published paper (version of record): N.B. When citing this work, cite the original published paper.

Unsupervised Learning

Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data

UAPD: Predicting Urban Anomalies from Spatial-Temporal Data

Statistical Methods for Forecasting

Corroborating Information from Disagreeing Views

Where to Find My Next Passenger?

Support Vector Machines: Maximum Margin Classifiers

A Wavelet Neural Network Forecasting Model Based On ARIMA

Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams

arxiv: v1 [cs.db] 14 May 2017

Machine Learning. Support Vector Machines. Manfred Huber

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Justin Appleby CS 229 Machine Learning Project Report 12/15/17 Kevin Chalhoub Building Electricity Load Forecasting

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Clustering non-stationary data streams and its applications

Streaming multiscale anomaly detection

Robust Inverse Covariance Estimation under Noisy Measurements

Cost and Preference in Recommender Systems Junhua Chen LESS IS MORE

Generalized Zero-Shot Learning with Deep Calibration Network

Gradient Descent. Sargur Srihari

L11: Pattern recognition principles

Approximating Global Optimum for Probabilistic Truth Discovery

MANAGING UNCERTAINTY IN SPATIO-TEMPORAL SERIES

DL Approaches to Time Series Data. Miro Enev, DL Solution Architect Jeff Weiss, Director West SAs

Discovering Truths from Distributed Data

Detecting Origin-Destination Mobility Flows From Geotagged Tweets in Greater Los Angeles Area

The Perceptron Algorithm, Margins

Nonlinear Characterization of Activity Dynamics in Online Collaboration Websites

Sample Exam Questions for Econometrics

A Hybrid Method of Forecasting in the Case of the Average Daily Number of Patients

STA 414/2104: Machine Learning

On-line Support Vector Machine Regression

Window-aware Load Shedding for Aggregation Queries over Data Streams

Unsupervised Anomaly Detection for High Dimensional Data

Linear Regression (continued)

22/04/2014. Economic Research

ECE521 week 3: 23/26 January 2017

Weighted Fuzzy Time Series Model for Load Forecasting

Homework 2 Solutions Kernel SVM and Perceptron

Rare Event Discovery And Event Change Point In Biological Data Stream

Computing Solution Concepts of Normal-Form Games. Song Chong EE, KAIST

CS246 Final Exam, Winter 2011

Analysis Based on SVM for Untrusted Mobile Crowd Sensing

Solar irradiance forecasting for Chulalongkorn University location using time series models

Lecture 2: Univariate Time Series

Section #2: Linear and Integer Programming

IV Course Spring 14. Graduate Course. May 4th, Big Spatiotemporal Data Analytics & Visualization

Scheduling Parallel Jobs with Linear Speedup

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support vector machines Lecture 4

CS 6375 Machine Learning

STA 4273H: Statistical Machine Learning

Robust Speed Controller Design for Permanent Magnet Synchronous Motor Drives Based on Sliding Mode Control

Lecture 9. Time series prediction

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Kernel Methods and Support Vector Machines

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya Yanki, Bojana Dimcheva, Donald Kossmann

Fast and Accurate Causal Inference from Time Series Data

Feature Selection Criterion for Gravity Matching Navigation

Differentially Private Real-time Data Release over Infinite Trajectory Streams

Missing Data and Dynamical Systems

FE570 Financial Markets and Trading. Stevens Institute of Technology

Unit 1A: Computational Complexity

Sensor Deployment Recommendation for 3D Fine-Grained Air Quality Monitoring using Semi-Supervised Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

IE598 Big Data Optimization Introduction

Online Truth Discovery on Time Series Data

Adaptive Burst Detection in a Stream Engine

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams

Tutorial: Urban Trajectory Visualization. Case Studies. Ye Zhao

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Collaborative Filtering Matrix Completion Alternating Least Squares

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CPSC 340: Machine Learning and Data Mining

Department of Computer Science, University of Pittsburgh. Brigham and Women's Hospital and Harvard Medical School

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Approximation of Average Run Length of Moving Sum Algorithms Using Multivariate Probabilities

Machine Learning and Adaptive Systems. Lectures 3 & 4

Machine Learning Lecture 7

minimize x subject to (x 2)(x 4) u,

DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

Adaptively Detecting Changes in Autonomic Grid Computing

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

A Framework for Adaptive Anomaly Detection Based on Support Vector Data Description

Predicting freeway traffic in the Bay Area

Machine Learning And Applications: Supervised Learning-SVM

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Ch. 12: Workload Forecasting

Content-based Recommendation

The role of dimensionality reduction in classification

Sparse and Robust Optimization and Applications

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Transcription:

Time Series Data Cleaning Shaoxu Song http://ise.thss.tsinghua.edu.cn/sxsong/

Dirty Time Series Data Unreliable Readings Sensor monitoring GPS trajectory J. Freire, A. Bessa, F. Chirigati, H. T. Vo, K. Zhao: Exploring What not to Clean in Urban Data: A Study Using New York City Taxi Trips. IEEE Data Eng. Bull.39(2): 63-77 (2016)

Dirty Time Series Data Misuse Flight: Accuracy of Travelocity is 0.95 Stock: Accuracy of Stock in Yahoo! Finance is 0.93 Xian Li, Xin Luna Dong, Kenneth B. Lyons, Weiyi Meng, Divesh Srivastava: Truth Finding on the deep web: Is the problem solved? PVLDB, 6(2) (2013)

Existing cleaning methods Smoothing Filter Moving Average WMA EWMA 3 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Problem: modify almost all the data values Observation Smooth Truth E. S. Gardner Jr. Exponential smoothing: The state of the art{part ii. International Journal of Forecasting, 22(4):637-666, 2006.

Existing cleaning methods Prediction Model Modify the observation by predication if the predication is far distant from the observation autoregressive (AR) model AR(I)MA 3 2.5 May over change the data Owing to far distant 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation AR Truth Yamanishi, Kenji, and Jun-ichi Takeuchi. "A unifying framework for detecting outliers and change points from non-stationary time series data." In SIGKDD, pages 676-681, 2002

Repairing dirty data helps Time series classification 1 Clean Dirty IMR SCREEN EWMA ARX AR 0.8 ACCURACY 0.6 0.4 0.2 0 SyntheticControl GunPoint CBF FaceAll OSULeaf SwedishLeaf 50words Trace TwoPatterns wafer DATASET

Constraint-based method (SIGMOD 2015) large spike errors Statistical method (SIGMOD 2016) small errors Supervised method (VLDB 2017) consecutive errors Contents 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation Truth

Intuition on Speed Constraints Jump of values is often constrained Daily limit: in financial and commodity markets Temperatures in a week Fuel consumption Use speed constraints to identify dirty data 3 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

SCREEN Given Time series x = {x 1, x 2, } Constraints s = (s +,-, s +./ ) on min/max speeds Find repair a repair x of x Constraint satisfaction: 0 t 5 t, w, Stream Data Cleaning under Speed Constraints 3 2.5 Value x A B x, x, B t A w t, s +./ Dirty Smooth Screen x A B + s +./ (t, t A ) x A B + s +,- (t, t A ) s +,- Time s +,- / 9:/ ; < 9 :< ; s +./ Change minimization: x, x, / ; / is minimized 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Employ Existing Repairing Approach Holistic algorithm Repairing relational data Under denial constraints Adaption Time series as a relation Express speed constraints by denial constraints roughly Problem High computational costs Not guaranteed to eliminate all violations ID Timestamp Value 1 1 1.13 2 2 1.24 (t 5 < t, + w x 5 > x, + (t 5 t, ) s +./ ) (t 5 < t, + w x 5 < x, + (t 5 t, ) s +IJ ) 3 3 1.19 4 4 1.3 X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458-469, 2013.

A Lightweight Weapon Unlike NP-hard problems in most data repairing scenarios The speed constraint-based repairing can be solved as a LP problem in O n M.O L considers the entire sequence as a whole (global optimal) Online computing, over streaming data Consider local optimum in the current sliding window Using Median Principle in O nw time Repair cost 5 4 3 2 Value x, + s +,- (t A t, ) x, + s +./ (t A t, ) x A +,Q x A x, s +./ s +,- 1 0 t A w t, Time

Effectiveness and Efficiency Global: the highest accuracy Local: much faster than Holistic Trade-off 0.3 100 0.25 10 RMS 0.2 0.15 0.1 Time cost(s) 1 0.1 0.05 0.01 0 0.05 0.15 0.25 0.35 0.45 Error rate 0.001 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Error Rate Global Local Holistic EWMA Global Local Holistic EWMA

Constraint-based method (SIGMOD 2015) large spike errors Statistical method (SIGMOD 2016) small errors Supervised method (VLDB 2017) consecutive errors Contents 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation Truth

Further Issue Speed Constraint based method Large spike error: modify to max/min values allowed Small error: fail to identify 385 Observation SCREEN Truth 380 375 370 365 1540 1545 1550 1555 1560 1565 1570 1575

Intuition on Speed Change Value Consider the likelihood of speeds within the allowed range x, x, B s +./ Interesting pattern on speed changes in consecutive points Value v M = 1 v T = 3 x A B Probability No clear distribution pattern is 0.25 0.2 0.15 0.1 0.05 0 observed on speeds (0,1] (0,2] (0,3] (0,4] (0,5] (0,6] (0,7] Speed(m/s) s +,- (0,8] (0,9] Time (0,10] (0,11] Probability 0.4 0.3 0.2 0.1 0 u M = v M v T = 4 Time p.m.f. p.d.f. (-6.0,-5.2] (-5.2,-4.4] (-4.4,-3.6] (-3.6,-2.8] (-2.8,-2.0] (-2.0,-1.2] (-1.2,-0.4] (-0.4,0.4] (0.4,1.2] (1.2,2.0] (2.0,2.8] (2.8,3.6] (3.6,4.4] (4.4,5.2] (5.2,6.0] (6.0,6.8] Speed Change

Statistical Approach Calculate the likelihood of a sequence w.r.t. the speed change employ the probability distribution of speed changes The cleaning problem is thus to find a repaired sequence with the maximum likelihood about speed change instead of minimum change towards speed constraint satisfaction 385 Observation SCREEN Likelihood Truth 380 375 370 365 1540 1545 1550 1555 1560 1565 1570 1575

Maximum likelihood repair problem Given Time series x repair cost budget δ NP-hard Pseudo-polynomial time solvable Distribution on speed changes Find repair a repair x of x (x, x ) δ the likelihood L(x ) is maximized. DP, dynamic programming M O(nθ +./ δ) Exact DPC, constant-factor approximation O(n T M θ +./ ) Large budget DPL, linear time heuristics O(nd [ ) Fast, higher error QP, quadratic programming Approximate distribution SG, simple greedy O(max(n, δ)) Fastest

Effectiveness and Efficiency Significantly better accuracy than SCREEN SG is efficient, comparable to SCREEN, and still with better accuracy RMS 8 7 6 5 4 3 2 1 Time cost(s) 10000 1000 100 10 1 0.1 0.01 0 1000 3000 5000 7000 9000 11000 13000 Data size 0.001 1000 3000 5000 7000 9000 11000 13000 Data size DP DPC DPL SG SCREEN QP DP DPC DPL SG SCREEN QP

Constraint-based method (SIGMOD 2015) large spike errors Statistical method (SIGMOD 2016) small errors Supervised method (VLDB 2017) consecutive errors Contents 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation Truth

Consecutive Errors Speed constraints handle well Spike errors, but not consecutive ones 27.5 26.5 Value 25.5 24.5 23.5 360 365 370 375 380 385 390 Time Truth Observation SCREEN

Intuition Supervised by labeled truth of errors Labeling by user Check-in Labeled truth Erroneous location Labeling by machine precise equipment reports accurate air quality data in a relatively long sensing period crowd and participatory sensing generates unreliable observations in a constant manner Y. Zheng, F. Liu, and H. Hsieh. U-air: when urban air quality inference meets big data. In KDD, pages 1436 1444, 2013.

Approach Instead of modeling directly the values by AR model (autoregression), ignoring erroneous observations We model and predicate the difference between errors and their corresponding labeled truths by ARX model (autoregressive model with exogenous inputs) Value 27.5 27 26.5 26 25.5 25 24.5 24 23.5 360 365 370 375 380 385 390 Time Observation IMR AR

Iterative Minimum Repair (IMR) Rather than in chronological order Iterative repairing minimally changes one point a time to obtain the most confident repair only high confidence repairs in the former iterations could help the latter repairing Major concerns Convergence Incremental computation among iterations Value 27.5 27 26.5 26 25.5 25 24.5 24 23.5 360 365 370 375 380 385 390 Time Observation IMR ARX

Dealing with consecutive errors IMR shows significantly better results when there is a large number of consecutive errors IMR SCREEN EWMA ARX AR 1 0.9 0.8 0.7 RMS 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 # Consecutive errors

Constraint-based method (SIGMOD 2015) large spike errors Statistical method (SIGMOD 2016) small errors Supervised method (VLDB 2017) consecutive errors Contents 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation Truth

Future Study More error types Periodical Timestamp error A single ride takes 20 years http://gd.sina.com.cn/news/sz/2016-07-08/detail-ifxtwihp9790656.shtml 26

1. Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu. SCREEN: Stream Data Cleaning under Speed Constraints. ACM SIGMOD International Conference on Management of Data, SIGMOD, 2015. 2. Aoqian Zhang, Shaoxu Song, Jianmin Wang. Sequential Data Cleaning: A Statistical Approach. ACM SIGMOD International Conference on Management of Data, SIGMOD, 2016. 3. Aoqian Zhang, Shaoxu Song, Jianmin Wang, Philip S. Yu. Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing. International Conference on Very Large Data Bases, VLDB, 2017. Thanks Full text available at http://ise.thss.tsinghua.edu.cn/sxsong/