Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database Management

Uncertain Time Series Standard Time Series Value Introduction u u,..., u 1 n 1 2 n Timestamp U 1 U 2 U U,..., U 1 n U n??????????????????? Timestamp 2

Uncertain Time Series Introduction U 1 U 2?????? U U,..., U 1 n??????????? U n?? Observed value Exact value i, U i t i E Error i Timestamp Multiple readings Forecasting techniques Wireless sensor network Medical data analysis Privacy concerns Data collection error Location-based services 3

Introduction: Similarity Search Approaches Traditional Similarity Measures Uncertain Similarity Measures Values Values+ Statistical Information 4

Introduction: Similarity Search Approaches Traditional Similarity Measures [DNMP,12] OUTPERFORM Uncertain Similarity Measures Why? Using a Preprocessing Step 5

Introduction: Preprocessing Uncertain Time Series Filtering: Smoother but farther from the exact values Standard Time Series Filtering + Normalization : Closer to the exact values 6

Motivation Can preprocessing improve the performance of uncertain similarity measures? Will uncertain similarity measures outperform traditional ones using preprocessing techniques? 7

Outline Uncertain Similarity Measures Preprocessing Techniques Experimental Evaluation 8

Uncertain Similarity Measures D( X, Y ) c A Constant -Orang & Shiri [OS,14] Deterministic Probabilistic D( X, Y ) Z A Random Variable P( Z c) p DUST -Sarangi et al. [SM,10] PROUD -Yeh et al. [YWY,09] Uncertain Correlation -Orang & Shiri [OS,12] p c 9

Uncertain Similarity Measures D( X, Y ) c A Constant -Orang & Shiri [OS,14] Deterministic Probabilistic D( X, Y ) Z A Random Variable DUST -Sarangi et al. [SM,10] PROUD -Yeh et al. [YWY,09] Uncertain Correlation -Orang & Shiri [OS,12] Probabilistic Queries P( Eucl ( X, Q) d) p P( Corr( X, Q) c) p Given a dataset of uncertain time series D, an uncertain time series Q, a similarity threshold s, and a probability threshold p, a probabilistic query searches for uncertain time series X in D such that similarity between X and Q is higher than s, with a confidence of at least p. 10

Preprocessing Techniques Moving Average Filters Normalization 11

Moving Average Filters x x,..., 1 xn [DNMP,12] Each value is substituted by average of adjacent values. MA i w MA 1 xi xk Simple Moving Average 2w 1 kiw MA MA x x,..., x 1 MA m Each value is substituted by weighted average of adjacent values. UMA i w UMA 1 xk xi Uncertain Moving Average 2w 1 kiw k UMA UMA x x,..., x 1 UMA m UEMA Uncertain Exponential Moving Average UEMA UEMA x x,..., x 1 UEMA m x UEMA i iw kiw x k ( e iw kiw ki e ki Each value is substituted by weighted average of adjacent values, weights decrease exponentially. 12 ) k

Normalization x x,..., 1 xn Filtering changes the scale and baseline xˆ xˆ 1,..., xˆ n xˆ i x i s x x Normalization makes similarity measures invariant to scaling and shifting and hence helps better capture the similarity. Filtering + Normalization : Closer to the exact values 13

Experiment Setup CPU:2.66 GHz, RAM:4GB 16 UCR datasets [KZHHXWR] Uncertain time series generation [YWY,09], [SM,10]: The standard deviation of UTS: r Х σ r: Standard deviation ratio (SDR), varied from 0.01 to 4 σ: The standard deviation of the given standard time series Error distribution function: Exponential, Normal, Uniform Performance measure: Classification error, F1 Score Average over 10 runs 14

1) Deterministic Similarity Measures Comparison approach: 1NN-classification with K-fold cross-validation[dtswk,08] No Preprocessing Simple Moving Average DUST Error < Euclidean Error Uncertain Moving Average Uncertain Exponential moving Average DUST Error ~= Euclidean Error 15

2) Probabilistic Similarity Measures: Uncertain Correlation No Preprocessing Simple Moving Average more results no result Uncertain Moving Average Uncertain Exponential moving Average 16

2) Probabilistic Similarity Measures: Uncertain Correlation No Preprocessing Simple Moving Average Uncertain Moving Average Uncertain Exponential moving Average for low probability thresholds, the F1 score is higher than simple moving average 17

Dataset 50words Adiac Beef CBF Coffee ECG200 FISH FaceFour Gun-Point Lighting2 Lighting7 OSULeaf OliveOil Swedish Filter None MA UMA UEMA 0.38 0.53 (+39%) 0.5 (+32%) 0.48 (+26%) 0.51 0.72 (+41%) 0.74 (+45%) 0.71 (+39%) 0.49 0.69 (+41%) 0.69 (+41%) 0.62 (+27%) 0.38 0.48 (+26%) 0.45 (+18%) 0.45 (+18%) 0.51 0.73 (+43%) 0.69 (+35%) 0.68 (+33%) 0.46 0.62 (+35%) 0.62 (+35%) 0.61 (+33%) 0.5 0.71 (+42%) 0.74 (+48%) 0.69 (+38%) 0.39 0.53 (+36%) 0.53 (+36%) 0.5 (+28%) 0.48 0.68 (+42%) 0.69 (+44%) 0.66 (+38%) 0.31 0.42 (+35%) 0.43 (+39%) 0.41 (+32%) 0.35 0.47 (+34%) 0.47 (34%) 0.45 (+29%) 0.34 0.47 (+38%) 0.46 (+35%) 0.43 (+26%) 0.51 0.73 (+43%) 0.73 (+43%) 0.66 (+29%) 0.46 0.62 (+35%) 0.63 (+37%) 0.60 (+30%) 18

2) Probabilistic Similarity Measures: Uncertain Correlation No Preprocessing Simple Moving Average Uncertain Moving Average Uncertain Exponential moving Average 19

2) Probabilistic Similarity Measures: Uncertain Correlation Corr x, y 0.6 P Corr X, Y 0.6 0.8 P Corr X, Y 0.6 0.9 P Corr X, Y 0.6 0 20

2) Probabilistic Similarity Measures: PROUD 21

2) Probabilistic Similarity Measures: PROUD Eucl x, y 100 P Eucl X, Y 100 0.8 P Eucl X, Y 100 = 0 P Eucl X, Y 100 = 0 22

2) Probabilistic Similarity Measures: PROUD Eucl X, Y = X i Y i 2 n i=1 n 300 E Eucl X, Y = Eucl E(X), E Y + Var X i + Var(Y i ) i=1 E X =< E X 1,, E X n > P Eucl X, Y 100 = 0 P Eucl X, Y 100 = 0 23

PROUDS: An Enhanced Version of PROUD n Eucl X, Y = (E X i 2 + E Y i 2 + 2X i Y i ) i=1 E Eucl X, Y = Eucl E(X), E Y P Eucl X, Y 100 = 0.8 24

PROUDS: An Enhanced Version of PROUD n Eucl X, Y = (E X i 2 + E Y i 2 + 2X i Y i ) i=1 E Eucl X, Y = Eucl E(X), E Y P Eucl Lemma X, Y1. Given 100 uncertain = 0.8 time series X and Y with normal forms X and Y, the following holds. Eucl(X, Y) = 2(n 1)(1 Corr(X, Y)) 25

2) Probabilistic Similarity Measures: Uncertain Correlation 26

Conclusion Uncertain similarity measures can outperform the traditional similarity measures with and without data preprocessing. This indicates the effectiveness of uncertain similarity measures in practice. Uncertain similarity measures utilize all the available information to better quantify the similarity. Probabilistic similarity measures provide the users with more information about reliability of the result. Preprocessing is necessary for similarity search in uncertain time series. Simple and uncertain moving average filters improve the performance of the probabilistic measures more than uncertain exponential moving average filter. We propose an enhancement for the PROUD similarity measure, which improves its performance with and without data preprocessing. We found a linear relationship between probabilistic similarity measures. 27

Future Work Research on uncertain time series is new, mostly focused on modeling and similarity search. More work is required in this field: e.g., indexing, prediction 28

References [DNMP,12] M. Dallachiesa, B. Nushi, K. Mirylenka, and T. Palpanas, Uncertain Time Series Similarity: Return To The Basics, In Proc. of the VLDB Endowment, 5(11):1662-1673, 2012. [DTSWK,08] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh, Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures, In Proc. of the VLDB Endowment, 1(2):1542-1552, 2008. [KZHHXWR] E. Keogh, Q. Zhu, B. Hu, Y. Hao, X. Xi, L. Wei, C. A. Ratanamahatana, The UCR Time Series Classification/Clustering Homepage: www.cs.ucr.edu/~eamonn/time_series_data/. [OS,12] M. Orang, and N. Shiri, A Probabilistic Approach To Correlation Queries In Uncertain Time Series Data, In Proc. of CIKM, 2229-2233, 2012. [OS,14] M. Orang, and N. Shiri, An Experimental Evaluation of Similarity Measures for Uncertain Time Series, In Proc. of IDEAS, 2014. [SM,10] S. R. Sarangi, K. Murthy, DUST: A Generalized Notion of Similarity between Uncertain Time Series, In Proc. of the 16th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 383-392, 2010. [YWY,09] M. Y. Yeh, K. L. Wu, P. S. Yu, M. S. Chen, PROUD: A Probabilistic Approach to Processing Similarity Queries over Uncertain Data Streams, In Proc. of the 12th Int. Conf. on Extending Database Technology, 684-695, 2009. 29

QUESTION? 30