Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Similar documents
Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Uncertain Time-Series Similarity: Return to the Basics

MANAGING UNCERTAINTY IN SPATIO-TEMPORAL SERIES

Fusion of Similarity Measures for Time Series Classification

Temporal and Frequential Metric Learning for Time Series knn Classication

DIT - University of Trento Modeling and Querying Data Series and Data Streams with Uncertainty

Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams

Fundamentals of Similarity Search

Time Series Classification Using Time Warping Invariant Echo State Networks

arxiv: v2 [cs.lg] 13 Mar 2018

Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data

Learning Time-Series Shapelets

Time Series Data Cleaning

An Alternate Measure for Comparing Time Series Subsequence Clusters

CSE 5243 INTRO. TO DATA MINING

Group Pattern Mining Algorithm of Moving Objects Uncertain Trajectories

Frequency-hiding Dependency-preserving Encryption for Outsourced Databases

DISI - University of Trento Mining and Learning in Sequential Data Streams: Interesting Correlations and Classification in Noisy Settings

Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach

CSE 5243 INTRO. TO DATA MINING

Recent advances in Time Series Classification

Probabilistic Similarity Search for Uncertain Time Series

Machine Learning on temporal data

Variable Latent Semantic Indexing

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Iterative Laplacian Score for Feature Selection

Learning in Probabilistic Graphs exploiting Language-Constrained Patterns

Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams

Rare Event Discovery And Event Change Point In Biological Data Stream

A Novel Click Model and Its Applications to Online Advertising

A Representation of Time Series for Temporal Web Mining

Nonsmooth Analysis and Subgradient Methods for Averaging in Dynamic Time Warping Spaces

The τ-skyline for Uncertain Data

UAPD: Predicting Urban Anomalies from Spatial-Temporal Data

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Probabilistic Frequent Itemset Mining in Uncertain Databases

Predictive Nearest Neighbor Queries Over Uncertain Spatial-Temporal Data

Very Sparse Random Projections

Attraction and Avoidance Detection from Movements Zhenhui Li, Bolin Ding, Fei Wu, Tobias Kin Hou Lei, Roland Kays, and Margaret Crofoot.

Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks

JOINT PROBABILISTIC INFERENCE OF CAUSAL STRUCTURE

Finding persisting states for knowledge discovery in time series

Latent Geographic Feature Extraction from Social Media

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

Searching Dimension Incomplete Databases

Global Journal of Engineering Science and Research Management

Differentially Private Real-time Data Release over Infinite Trajectory Streams

Efficient search of the best warping window for Dynamic Time Warping

Finding Frequent Items in Probabilistic Data

Advanced Techniques for Mining Structured Data: Process Mining

Change Detection in Multivariate Data

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

RaRE: Social Rank Regulated Large-scale Network Embedding

Fast Comparison of Software Birthmarks for Detecting the Theft with the Search Engine

On-line Dynamic Time Warping for Streaming Time Series

Mobility Analytics through Social and Personal Data. Pierre Senellart

Mining Positive and Negative Fuzzy Association Rules

Dynamic Time Warping and FFT: A Data Preprocessing Method for Electrical Load Forecasting

ECE 661: Homework 10 Fall 2014

Learning from Temporal Data Using Dynamical Feature Space

Finding Pareto Optimal Groups: Group based Skyline

Introduction to Machine Learning

Correlation Preserving Unsupervised Discretization. Outline

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

1 Differential Privacy and Statistical Query Learning

The Truncated Tornado in TMBB: A Spatiotemporal Uncertainty Model for Moving Objects

Improving time-series rule matching performance for detecting energy consumption patterns

Sufficient Dimension Reduction for Longitudinally Measured Predictors

WEATHER DEPENENT ELECTRICITY MARKET FORECASTING WITH NEURAL NETWORKS, WAVELET AND DATA MINING TECHNIQUES. Z.Y. Dong X. Li Z. Xu K. L.

Machine Learning: Pattern Mining

Detection of Highly Correlated Live Data Streams

Streaming multiscale anomaly detection

3.6 NCEP s Global Icing Ensemble Prediction and Evaluation

Introduction to Statistical Inference

Handling Uncertainty in Clustering Art-exhibition Visiting Styles

Reducing The Data Transmission in Wireless Sensor Networks Using The Principal Component Analysis

Available online at ScienceDirect. Procedia Engineering 119 (2015 ) 13 18

Anomaly Detection for the CERN Large Hadron Collider injection magnets

A Data-driven Approach for Remaining Useful Life Prediction of Critical Components

Real-time Sentiment-Based Anomaly Detection in Twitter Data Streams

Tracking the Intrinsic Dimension of Evolving Data Streams to Update Association Rules

Multi-theme Sentiment Analysis using Quantified Contextual

Incremental Latin Hypercube Sampling

A METHOD OF FINDING IMAGE SIMILAR PATCHES BASED ON GRADIENT-COVARIANCE SIMILARITY

Decision trees for stream data mining new results

Modeling Dynamic Evolution of Online Friendship Network

ASSOCIATION RULE MINING AND CLASSIFIER APPROACH FOR QUANTITATIVE SPOT RAINFALL PREDICTION

Application and verification of ECMWF products 2010

Learning to Learn and Collaborative Filtering

Fast Approximation of Probabilistic Frequent Closed Itemsets

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Active network alignment

Classification of Hand-Written Digits Using Scattering Convolutional Network

Orientation Map Based Palmprint Recognition

A Novel Method for Mining Relationships of entities on Web

Intelligent Systems (AI-2)

Asymmetric Correlation Regularized Matrix Factorization for Web Service Recommendation

Mining Temporal Patterns for Interval-Based and Point-Based Events

Transcription:

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database Management

Uncertain Time Series Standard Time Series Value Introduction u u,..., u 1 n 1 2 n Timestamp U 1 U 2 U U,..., U 1 n U n??????????????????? Timestamp 2

Uncertain Time Series Introduction U 1 U 2?????? U U,..., U 1 n??????????? U n?? Observed value Exact value i, U i t i E Error i Timestamp Multiple readings Forecasting techniques Wireless sensor network Medical data analysis Privacy concerns Data collection error Location-based services 3

Introduction: Similarity Search Approaches Traditional Similarity Measures Uncertain Similarity Measures Values Values+ Statistical Information 4

Introduction: Similarity Search Approaches Traditional Similarity Measures [DNMP,12] OUTPERFORM Uncertain Similarity Measures Why? Using a Preprocessing Step 5

Introduction: Preprocessing Uncertain Time Series Filtering: Smoother but farther from the exact values Standard Time Series Filtering + Normalization : Closer to the exact values 6

Motivation Can preprocessing improve the performance of uncertain similarity measures? Will uncertain similarity measures outperform traditional ones using preprocessing techniques? 7

Outline Uncertain Similarity Measures Preprocessing Techniques Experimental Evaluation 8

Uncertain Similarity Measures D( X, Y ) c A Constant -Orang & Shiri [OS,14] Deterministic Probabilistic D( X, Y ) Z A Random Variable P( Z c) p DUST -Sarangi et al. [SM,10] PROUD -Yeh et al. [YWY,09] Uncertain Correlation -Orang & Shiri [OS,12] p c 9

Uncertain Similarity Measures D( X, Y ) c A Constant -Orang & Shiri [OS,14] Deterministic Probabilistic D( X, Y ) Z A Random Variable DUST -Sarangi et al. [SM,10] PROUD -Yeh et al. [YWY,09] Uncertain Correlation -Orang & Shiri [OS,12] Probabilistic Queries P( Eucl ( X, Q) d) p P( Corr( X, Q) c) p Given a dataset of uncertain time series D, an uncertain time series Q, a similarity threshold s, and a probability threshold p, a probabilistic query searches for uncertain time series X in D such that similarity between X and Q is higher than s, with a confidence of at least p. 10

Preprocessing Techniques Moving Average Filters Normalization 11

Moving Average Filters x x,..., 1 xn [DNMP,12] Each value is substituted by average of adjacent values. MA i w MA 1 xi xk Simple Moving Average 2w 1 kiw MA MA x x,..., x 1 MA m Each value is substituted by weighted average of adjacent values. UMA i w UMA 1 xk xi Uncertain Moving Average 2w 1 kiw k UMA UMA x x,..., x 1 UMA m UEMA Uncertain Exponential Moving Average UEMA UEMA x x,..., x 1 UEMA m x UEMA i iw kiw x k ( e iw kiw ki e ki Each value is substituted by weighted average of adjacent values, weights decrease exponentially. 12 ) k

Normalization x x,..., 1 xn Filtering changes the scale and baseline xˆ xˆ 1,..., xˆ n xˆ i x i s x x Normalization makes similarity measures invariant to scaling and shifting and hence helps better capture the similarity. Filtering + Normalization : Closer to the exact values 13

Experiment Setup CPU:2.66 GHz, RAM:4GB 16 UCR datasets [KZHHXWR] Uncertain time series generation [YWY,09], [SM,10]: The standard deviation of UTS: r Х σ r: Standard deviation ratio (SDR), varied from 0.01 to 4 σ: The standard deviation of the given standard time series Error distribution function: Exponential, Normal, Uniform Performance measure: Classification error, F1 Score Average over 10 runs 14

1) Deterministic Similarity Measures Comparison approach: 1NN-classification with K-fold cross-validation[dtswk,08] No Preprocessing Simple Moving Average DUST Error < Euclidean Error Uncertain Moving Average Uncertain Exponential moving Average DUST Error ~= Euclidean Error 15

2) Probabilistic Similarity Measures: Uncertain Correlation No Preprocessing Simple Moving Average more results no result Uncertain Moving Average Uncertain Exponential moving Average 16

2) Probabilistic Similarity Measures: Uncertain Correlation No Preprocessing Simple Moving Average Uncertain Moving Average Uncertain Exponential moving Average for low probability thresholds, the F1 score is higher than simple moving average 17

Dataset 50words Adiac Beef CBF Coffee ECG200 FISH FaceFour Gun-Point Lighting2 Lighting7 OSULeaf OliveOil Swedish Filter None MA UMA UEMA 0.38 0.53 (+39%) 0.5 (+32%) 0.48 (+26%) 0.51 0.72 (+41%) 0.74 (+45%) 0.71 (+39%) 0.49 0.69 (+41%) 0.69 (+41%) 0.62 (+27%) 0.38 0.48 (+26%) 0.45 (+18%) 0.45 (+18%) 0.51 0.73 (+43%) 0.69 (+35%) 0.68 (+33%) 0.46 0.62 (+35%) 0.62 (+35%) 0.61 (+33%) 0.5 0.71 (+42%) 0.74 (+48%) 0.69 (+38%) 0.39 0.53 (+36%) 0.53 (+36%) 0.5 (+28%) 0.48 0.68 (+42%) 0.69 (+44%) 0.66 (+38%) 0.31 0.42 (+35%) 0.43 (+39%) 0.41 (+32%) 0.35 0.47 (+34%) 0.47 (34%) 0.45 (+29%) 0.34 0.47 (+38%) 0.46 (+35%) 0.43 (+26%) 0.51 0.73 (+43%) 0.73 (+43%) 0.66 (+29%) 0.46 0.62 (+35%) 0.63 (+37%) 0.60 (+30%) 18

2) Probabilistic Similarity Measures: Uncertain Correlation No Preprocessing Simple Moving Average Uncertain Moving Average Uncertain Exponential moving Average 19

2) Probabilistic Similarity Measures: Uncertain Correlation Corr x, y 0.6 P Corr X, Y 0.6 0.8 P Corr X, Y 0.6 0.9 P Corr X, Y 0.6 0 20

2) Probabilistic Similarity Measures: PROUD 21

2) Probabilistic Similarity Measures: PROUD Eucl x, y 100 P Eucl X, Y 100 0.8 P Eucl X, Y 100 = 0 P Eucl X, Y 100 = 0 22

2) Probabilistic Similarity Measures: PROUD Eucl X, Y = X i Y i 2 n i=1 n 300 E Eucl X, Y = Eucl E(X), E Y + Var X i + Var(Y i ) i=1 E X =< E X 1,, E X n > P Eucl X, Y 100 = 0 P Eucl X, Y 100 = 0 23

PROUDS: An Enhanced Version of PROUD n Eucl X, Y = (E X i 2 + E Y i 2 + 2X i Y i ) i=1 E Eucl X, Y = Eucl E(X), E Y P Eucl X, Y 100 = 0.8 24

PROUDS: An Enhanced Version of PROUD n Eucl X, Y = (E X i 2 + E Y i 2 + 2X i Y i ) i=1 E Eucl X, Y = Eucl E(X), E Y P Eucl Lemma X, Y1. Given 100 uncertain = 0.8 time series X and Y with normal forms X and Y, the following holds. Eucl(X, Y) = 2(n 1)(1 Corr(X, Y)) 25

2) Probabilistic Similarity Measures: Uncertain Correlation 26

Conclusion Uncertain similarity measures can outperform the traditional similarity measures with and without data preprocessing. This indicates the effectiveness of uncertain similarity measures in practice. Uncertain similarity measures utilize all the available information to better quantify the similarity. Probabilistic similarity measures provide the users with more information about reliability of the result. Preprocessing is necessary for similarity search in uncertain time series. Simple and uncertain moving average filters improve the performance of the probabilistic measures more than uncertain exponential moving average filter. We propose an enhancement for the PROUD similarity measure, which improves its performance with and without data preprocessing. We found a linear relationship between probabilistic similarity measures. 27

Future Work Research on uncertain time series is new, mostly focused on modeling and similarity search. More work is required in this field: e.g., indexing, prediction 28

References [DNMP,12] M. Dallachiesa, B. Nushi, K. Mirylenka, and T. Palpanas, Uncertain Time Series Similarity: Return To The Basics, In Proc. of the VLDB Endowment, 5(11):1662-1673, 2012. [DTSWK,08] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh, Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures, In Proc. of the VLDB Endowment, 1(2):1542-1552, 2008. [KZHHXWR] E. Keogh, Q. Zhu, B. Hu, Y. Hao, X. Xi, L. Wei, C. A. Ratanamahatana, The UCR Time Series Classification/Clustering Homepage: www.cs.ucr.edu/~eamonn/time_series_data/. [OS,12] M. Orang, and N. Shiri, A Probabilistic Approach To Correlation Queries In Uncertain Time Series Data, In Proc. of CIKM, 2229-2233, 2012. [OS,14] M. Orang, and N. Shiri, An Experimental Evaluation of Similarity Measures for Uncertain Time Series, In Proc. of IDEAS, 2014. [SM,10] S. R. Sarangi, K. Murthy, DUST: A Generalized Notion of Similarity between Uncertain Time Series, In Proc. of the 16th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 383-392, 2010. [YWY,09] M. Y. Yeh, K. L. Wu, P. S. Yu, M. S. Chen, PROUD: A Probabilistic Approach to Processing Similarity Queries over Uncertain Data Streams, In Proc. of the 12th Int. Conf. on Extending Database Technology, 684-695, 2009. 29

QUESTION? 30