Time Series Data Cleaning Shaoxu Song http://ise.thss.tsinghua.edu.cn/sxsong/
Dirty Time Series Data Unreliable Readings Sensor monitoring GPS trajectory J. Freire, A. Bessa, F. Chirigati, H. T. Vo, K. Zhao: Exploring What not to Clean in Urban Data: A Study Using New York City Taxi Trips. IEEE Data Eng. Bull.39(2): 63-77 (2016)
Dirty Time Series Data Misuse Flight: Accuracy of Travelocity is 0.95 Stock: Accuracy of Stock in Yahoo! Finance is 0.93 Xian Li, Xin Luna Dong, Kenneth B. Lyons, Weiyi Meng, Divesh Srivastava: Truth Finding on the deep web: Is the problem solved? PVLDB, 6(2) (2013)
Existing cleaning methods Smoothing Filter Moving Average WMA EWMA 3 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Problem: modify almost all the data values Observation Smooth Truth E. S. Gardner Jr. Exponential smoothing: The state of the art{part ii. International Journal of Forecasting, 22(4):637-666, 2006.
Existing cleaning methods Prediction Model Modify the observation by predication if the predication is far distant from the observation autoregressive (AR) model AR(I)MA 3 2.5 May over change the data Owing to far distant 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation AR Truth Yamanishi, Kenji, and Jun-ichi Takeuchi. "A unifying framework for detecting outliers and change points from non-stationary time series data." In SIGKDD, pages 676-681, 2002
Repairing dirty data helps Time series classification 1 Clean Dirty IMR SCREEN EWMA ARX AR 0.8 ACCURACY 0.6 0.4 0.2 0 SyntheticControl GunPoint CBF FaceAll OSULeaf SwedishLeaf 50words Trace TwoPatterns wafer DATASET
Constraint-based method (SIGMOD 2015) large spike errors Statistical method (SIGMOD 2016) small errors Supervised method (VLDB 2017) consecutive errors Contents 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation Truth
Intuition on Speed Constraints Jump of values is often constrained Daily limit: in financial and commodity markets Temperatures in a week Fuel consumption Use speed constraints to identify dirty data 3 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
SCREEN Given Time series x = {x 1, x 2, } Constraints s = (s +,-, s +./ ) on min/max speeds Find repair a repair x of x Constraint satisfaction: 0 t 5 t, w, Stream Data Cleaning under Speed Constraints 3 2.5 Value x A B x, x, B t A w t, s +./ Dirty Smooth Screen x A B + s +./ (t, t A ) x A B + s +,- (t, t A ) s +,- Time s +,- / 9:/ ; < 9 :< ; s +./ Change minimization: x, x, / ; / is minimized 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Employ Existing Repairing Approach Holistic algorithm Repairing relational data Under denial constraints Adaption Time series as a relation Express speed constraints by denial constraints roughly Problem High computational costs Not guaranteed to eliminate all violations ID Timestamp Value 1 1 1.13 2 2 1.24 (t 5 < t, + w x 5 > x, + (t 5 t, ) s +./ ) (t 5 < t, + w x 5 < x, + (t 5 t, ) s +IJ ) 3 3 1.19 4 4 1.3 X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458-469, 2013.
A Lightweight Weapon Unlike NP-hard problems in most data repairing scenarios The speed constraint-based repairing can be solved as a LP problem in O n M.O L considers the entire sequence as a whole (global optimal) Online computing, over streaming data Consider local optimum in the current sliding window Using Median Principle in O nw time Repair cost 5 4 3 2 Value x, + s +,- (t A t, ) x, + s +./ (t A t, ) x A +,Q x A x, s +./ s +,- 1 0 t A w t, Time
Effectiveness and Efficiency Global: the highest accuracy Local: much faster than Holistic Trade-off 0.3 100 0.25 10 RMS 0.2 0.15 0.1 Time cost(s) 1 0.1 0.05 0.01 0 0.05 0.15 0.25 0.35 0.45 Error rate 0.001 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Error Rate Global Local Holistic EWMA Global Local Holistic EWMA
Constraint-based method (SIGMOD 2015) large spike errors Statistical method (SIGMOD 2016) small errors Supervised method (VLDB 2017) consecutive errors Contents 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation Truth
Further Issue Speed Constraint based method Large spike error: modify to max/min values allowed Small error: fail to identify 385 Observation SCREEN Truth 380 375 370 365 1540 1545 1550 1555 1560 1565 1570 1575
Intuition on Speed Change Value Consider the likelihood of speeds within the allowed range x, x, B s +./ Interesting pattern on speed changes in consecutive points Value v M = 1 v T = 3 x A B Probability No clear distribution pattern is 0.25 0.2 0.15 0.1 0.05 0 observed on speeds (0,1] (0,2] (0,3] (0,4] (0,5] (0,6] (0,7] Speed(m/s) s +,- (0,8] (0,9] Time (0,10] (0,11] Probability 0.4 0.3 0.2 0.1 0 u M = v M v T = 4 Time p.m.f. p.d.f. (-6.0,-5.2] (-5.2,-4.4] (-4.4,-3.6] (-3.6,-2.8] (-2.8,-2.0] (-2.0,-1.2] (-1.2,-0.4] (-0.4,0.4] (0.4,1.2] (1.2,2.0] (2.0,2.8] (2.8,3.6] (3.6,4.4] (4.4,5.2] (5.2,6.0] (6.0,6.8] Speed Change
Statistical Approach Calculate the likelihood of a sequence w.r.t. the speed change employ the probability distribution of speed changes The cleaning problem is thus to find a repaired sequence with the maximum likelihood about speed change instead of minimum change towards speed constraint satisfaction 385 Observation SCREEN Likelihood Truth 380 375 370 365 1540 1545 1550 1555 1560 1565 1570 1575
Maximum likelihood repair problem Given Time series x repair cost budget δ NP-hard Pseudo-polynomial time solvable Distribution on speed changes Find repair a repair x of x (x, x ) δ the likelihood L(x ) is maximized. DP, dynamic programming M O(nθ +./ δ) Exact DPC, constant-factor approximation O(n T M θ +./ ) Large budget DPL, linear time heuristics O(nd [ ) Fast, higher error QP, quadratic programming Approximate distribution SG, simple greedy O(max(n, δ)) Fastest
Effectiveness and Efficiency Significantly better accuracy than SCREEN SG is efficient, comparable to SCREEN, and still with better accuracy RMS 8 7 6 5 4 3 2 1 Time cost(s) 10000 1000 100 10 1 0.1 0.01 0 1000 3000 5000 7000 9000 11000 13000 Data size 0.001 1000 3000 5000 7000 9000 11000 13000 Data size DP DPC DPL SG SCREEN QP DP DPC DPL SG SCREEN QP
Constraint-based method (SIGMOD 2015) large spike errors Statistical method (SIGMOD 2016) small errors Supervised method (VLDB 2017) consecutive errors Contents 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation Truth
Consecutive Errors Speed constraints handle well Spike errors, but not consecutive ones 27.5 26.5 Value 25.5 24.5 23.5 360 365 370 375 380 385 390 Time Truth Observation SCREEN
Intuition Supervised by labeled truth of errors Labeling by user Check-in Labeled truth Erroneous location Labeling by machine precise equipment reports accurate air quality data in a relatively long sensing period crowd and participatory sensing generates unreliable observations in a constant manner Y. Zheng, F. Liu, and H. Hsieh. U-air: when urban air quality inference meets big data. In KDD, pages 1436 1444, 2013.
Approach Instead of modeling directly the values by AR model (autoregression), ignoring erroneous observations We model and predicate the difference between errors and their corresponding labeled truths by ARX model (autoregressive model with exogenous inputs) Value 27.5 27 26.5 26 25.5 25 24.5 24 23.5 360 365 370 375 380 385 390 Time Observation IMR AR
Iterative Minimum Repair (IMR) Rather than in chronological order Iterative repairing minimally changes one point a time to obtain the most confident repair only high confidence repairs in the former iterations could help the latter repairing Major concerns Convergence Incremental computation among iterations Value 27.5 27 26.5 26 25.5 25 24.5 24 23.5 360 365 370 375 380 385 390 Time Observation IMR ARX
Dealing with consecutive errors IMR shows significantly better results when there is a large number of consecutive errors IMR SCREEN EWMA ARX AR 1 0.9 0.8 0.7 RMS 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 # Consecutive errors
Constraint-based method (SIGMOD 2015) large spike errors Statistical method (SIGMOD 2016) small errors Supervised method (VLDB 2017) consecutive errors Contents 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Observation Truth
Future Study More error types Periodical Timestamp error A single ride takes 20 years http://gd.sina.com.cn/news/sz/2016-07-08/detail-ifxtwihp9790656.shtml 26
1. Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu. SCREEN: Stream Data Cleaning under Speed Constraints. ACM SIGMOD International Conference on Management of Data, SIGMOD, 2015. 2. Aoqian Zhang, Shaoxu Song, Jianmin Wang. Sequential Data Cleaning: A Statistical Approach. ACM SIGMOD International Conference on Management of Data, SIGMOD, 2016. 3. Aoqian Zhang, Shaoxu Song, Jianmin Wang, Philip S. Yu. Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing. International Conference on Very Large Data Bases, VLDB, 2017. Thanks Full text available at http://ise.thss.tsinghua.edu.cn/sxsong/