Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams

Similar documents
Stream-Based Electricity Load Forecast

Identifying Patterns and Anomalies in Delayed Neutron Monitor Data of Nuclear Power Plant

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence

Reservoir Computing and Echo State Networks

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Recent advances in Time Series Classification

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Dynamic Time Warping

Advanced Techniques for Mining Structured Data: Process Mining

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Analysis of Fast Input Selection: Application in Time Series Prediction

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Introduction to Neural Networks: Structure and Training

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Unit III. A Survey of Neural Network Model

Christian Mohr

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

ESTIMATING THE ACTIVATION FUNCTIONS OF AN MLP-NETWORK

Dynamic Time Warping and FFT: A Data Preprocessing Method for Electrical Load Forecasting

CS:4420 Artificial Intelligence

4. Multilayer Perceptrons

Neural Networks and the Back-propagation Algorithm

Holdout and Cross-Validation Methods Overfitting Avoidance

ADAPTIVE NEURO-FUZZY INFERENCE SYSTEMS

Reading Group on Deep Learning Session 1

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Finding persisting states for knowledge discovery in time series

Time Series and Forecasting

Design and Implementation of Speech Recognition Systems

Data Mining Part 5. Prediction

Multilayer Perceptrons (MLPs)

Artificial Neural Network

A Hybrid Method of CART and Artificial Neural Network for Short-term term Load Forecasting in Power Systems

Backpropagation Neural Net

Radial Basis Function (RBF) Networks

IN neural-network training, the most well-known online

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Learning Architecture for Univariate Time Series Forecasting

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

Neural Networks and Ensemble Methods for Classification

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Computer Intensive Methods in Mathematical Statistics

Advanced statistical methods for data analysis Lecture 2

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

A Representation of Time Series for Temporal Web Mining

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Day 3 Lecture 3. Optimizing deep networks

A Novel Activity Detection Method

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Improving time-series rule matching performance for detecting energy consumption patterns

WEATHER DEPENENT ELECTRICITY MARKET FORECASTING WITH NEURAL NETWORKS, WAVELET AND DATA MINING TECHNIQUES. Z.Y. Dong X. Li Z. Xu K. L.

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Machine Learning, Midterm Exam

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Non-linear Measure Based Process Monitoring and Fault Diagnosis

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CSC 578 Neural Networks and Deep Learning

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC321 Lecture 9: Generalization

Linear Models for Regression

Neural Network to Control Output of Hidden Node According to Input Patterns

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Statistical Machine Learning from Data

CS 4700: Foundations of Artificial Intelligence

DECISION TREE BASED QUALITATIVE ANALYSIS OF OPERATING REGIMES IN INDUSTRIAL PRODUCTION PROCESSES *

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Introduction to Neural Networks

Introduction to Time Series Mining. Slides edited from Keogh Eamonn s tutorial:

Approximate Inference Part 1 of 2

In the Name of God. Lectures 15&16: Radial Basis Function Networks

Mining Frequent Items in a Stream Using Flexible Windows (Extended Abstract)

Time Series and Forecasting

Detecting Statistical Interactions from Neural Network Weights

Q-Learning in Continuous State Action Spaces

Prediction Aggregation of Remote Traffic Microwave Sensors Speed and Volume Data

Approximate Inference Part 1 of 2

Learning Time-Series Shapelets

Logistic Regression & Neural Networks

y(n) Time Series Data

Artifical Neural Networks

Giovanni Stilo SAX! A Symbolic Representations of Time Series

Deep Learning & Artificial Intelligence WS 2018/2019

Neural Network, Component of Measuring Set for Error Reduction

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Hierarchical Boosting and Filter Generation

Artificial Neuron (Perceptron)

Rare Event Discovery And Event Change Point In Biological Data Stream

Learning Gaussian Process Models from Uncertain Data

Comparison of the Complex Valued and Real Valued Neural Networks Trained with Gradient Descent and Random Search Algorithms

Time Series Classification

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Mathematical Formulation of Our Example

MODULE -4 BAYEIAN LEARNING

Transcription:

Time-Series Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt

1 Time-Series Analysis 2 Prediction Filters Neural Nets 3 Similarity between Time-series Euclidean Distance Dynamic Time-Warping 4 Symbolic Approximation SAX 5 References

Outline 1 Time-Series Analysis 2 Prediction Filters Neural Nets 3 Similarity between Time-series Euclidean Distance Dynamic Time-Warping 4 Symbolic Approximation SAX 5 References

What are Time Series? Thanks to Eamonn Keog

Introduction Time Series Analysis refers to applying different data analysis techniques on measurements acquired over temporal basis. Data analysis techniques recently applied on time series include clustering, classification, indexing, and association rules. The focus of classical time series analysis was on forecasting and pattern identification

Outline 1 Time-Series Analysis 2 Prediction Filters Neural Nets 3 Similarity between Time-series Euclidean Distance Dynamic Time-Warping 4 Symbolic Approximation SAX 5 References

Moving Averages Moving Average The mean of the previous n data points: MA t = MA t 1 x t n+1 n + x t+1 n Cumulative moving average the average of all of the data up until the current data point CA i+1 = CA i + x i+1 CA i i + 1

Weighted Moving Averages Weighted moving average has multiplying factors to give different weights to different data points WMA t = nx t + (n 1)x t 1 + + 2x t n+2 + x t n+1 n + (n 1) + + 2 + 1 Exponential moving average The weighting for each older data point decreases exponentially, giving much more importance to recent observations while still not discarding older observations entirely. S t = α Y t 1 + (1 α) S t 1

Filters A filter is a device that removes from a signal some unwanted component or feature. The defining feature of filters being the complete or partial suppression of some aspect of the signal. Most often, this means removing some frequencies and not others in order to suppress interfering signals and reduce background noise. low-pass filters high-pass filters

Low-pass Filters A low-pass filter is a filter that passes low-frequency signals but attenuates (reduces the amplitude of) signals with frequencies higher than the cutoff frequency. function lowpass(x, α) y[0] := x[0] for i from 1 to n y[i] := α x[i] + (1 α) y[i 1] return y

Low-pass Filters y i = y i 1 + α (x i y i 1 ) The change from one filter output to the next is proportional to the difference between the previous output and the next input. This exponential smoothing property matches the exponential decay seen in the continuous-time system. As expected, as the smoothing parameter α decreases, the output samples respond more slowly to a change in the input samples: the system will have more inertia.

Low-pass Filters

High-pass Filters A high-pass filter is a filter that passes high-frequency signals but attenuates (reduces the amplitude of) signals with frequencies lower than the cutoff frequency. function highpass(x, α) y[0] := x[0] for i from 1 to n y[i] := α y[i 1] + α (x[i] x[i 1]) return y

high-pass filters y i = α (y i 1 + x i x i 1 ) A large alpha implies that the output will decay very slowly but will also be strongly influenced by even small changes in input. a constant input (i.e., an input with (x[i] x[i 1]) = 0) will always decay to zero, A small α implies that the output will decay quickly and will require large changes in the input (i.e., (x[i] x[i 1]) is large) to cause the output to change much. it can only pass relatively high frequencies because it requires large (i.e., fast) changes and tends to quickly forget its prior output values.

High-pass Filters

Neural-Nets and Data Streams Multilayer Neural Networks A general function approximation method; A 3 layer ANN can approximate any continuous function with arbitrary precision; Training Neural Networks Scan training data several times Update weights after processing each example (or each epoch) The only reason for multiple scans of training data is: lack of data small training sets.

Neural-Nets and Data Streams Neural Networks and Data Streams Stochastic sequential train Fast train and prediction: Each example is propagated once The error is back-propagated once No Overfitting First: prediction Second: update the model Smoothly adjust to gradual changes

NN and Data Streams

Illustrative Examples

Illustrative Examples

Load Forecast in Data Streams The goal is to have an any-time prediction of the short term horizon forecast value for all sensors. Three Horizon Forecasts: next hour, next day and next week. Although each model produces only one output prediction, after a short period (one week) the system has, at any-time: 1 prediction for the next hour; 24 predictions for the next day; 168 predictions for the next week;

Load Forecast in Data Streams MLP Topology: 10 inputs, 4 hidden neurons (tanh-activated) and a linear output. The input vectors for predicting time k depend on the predictive horizon: One hour: k minus {1, 2, 3, 4} hours and k minus {7, 14} days. One day: k minus {1, 6, 7} days, k minus 7 days and 1 hour, and k minus {8, 14} days. One week: k minus {7} days, k minus 7 days and 1 hour, and k minus {13, 14, 15, 21} days. The remaining inputs for each model are fed with 4 cyclic variables, for hourly and weekly periods (sin and cos) identification.

Buffering the Input Data and the Predictions Online prediction and training is achieved with buffered input.

Variance Reduction MLP s are prone to high-variance error. Sensor data is noisy and uncertain Dual perturb and combine Perturb test examples several times: add white noise to the attribute-values. For each perturb version the MLP makes a prediction Final prediction obtained by aggregating these predictions. Advantages: Works online Efficient variance reduction method

Outline 1 Time-Series Analysis 2 Prediction Filters Neural Nets 3 Similarity between Time-series Euclidean Distance Dynamic Time-Warping 4 Symbolic Approximation SAX 5 References

Thanks to Eamonn Keog

Similarity Similarity measures over time series data represent the main step in time series analysis. Euclidean and dynamic time warping represent the major similarity measures used in time series. Longer time series could be represent computationally hard for the analysis tasks. Different time series representations have been proposed to reduce the length of a time series.

Euclidean Distance Given two time series: Q = q 1, q 2,..., q n S = s 1, s 2,..., s n D(Q, S) = n i=1 (q i s i ) 2

Dynamic Time-Warping Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. DTW is a method to find an optimal match between two given sequences (e.g. time series) with certain restrictions. The optimization process is performed using dynamic programming The problem for one-dimensional time series can be solved in polynomial time.

Problem Formulation S. Stan & P. Chan. FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space, Intelligent Data Analysis, 2007 The dynamic time warping problem is stated as follows: Given two time series X = x 1, x 2,..., x n, and Y = y 1, y 2,..., y p, of lengths n and p construct a warp path W = w 1, w 2,..., w K K is the length of the warp path and max(n, p) K < n + p the k th element of the warp path is w k = (i, j) where i is an index from time series X, j is an index from time series Y.

Problem Formulation The warp path must start at the beginning of each time series at w 1 = (1, 1) and finish at the end of both time series at w K = (n, p). This ensures that every index of both time series is used in the warp path. Another constraint on the warp path that forces i and j to be monotonically increasing in the warp path.

Problem Formulation The optimal warp path is the minimum-distance warp path, where the distance of a warp path W is Dist(W ) = k=k k=1 dist(w ki, w kj ) where dist(w ki, w kj ) is the Euclidean distance between the two data point indexes (one from X and one from Y) in the k th element of the warp path.

Dynamic Time-Warping W = (1,1), (2,1),(3,1), (4,2), (5,3), (6,4), (7,5), (8,6), (9,7), (9,8), (9,9), (9,10), (10,11), (10,12), (11,13), (12,14), (13,15), (14,15), (15,15), (16,16).

Dynamic Time-Warping Vertical sections of the warp path, means that a single point in time series X is warped to multiple points in time series Y, Horizontal sections means that a single point in Y is warped to multiple points in X Since a single point may map to multiple points in the other time series, the time series do not need to be of equal length. If X and Y were identical time series, the warp path through the matrix would be a straight diagonal line.

Filling the Cost Matrix To find the minimum-distance warp path, every cell of the cost matrix must be filled. The value of a cell in the cost matrix is: D(i, j) = Dist(i, j) + min[d(i 1, j), D(i, j 1), D(i 1, j 1)] The cost matrix is filled one column at a time from the bottom up, from left to right After the entire matrix is filled, a warp path must be found from D(1, 1) to D(n, p).

DTW Algorithm Function DTWDistance(s[1..n], t[1..m]) int DTW[0..n, 0..m] int i, j, cost for i := 1 to m DTW[0, i] := infinity for i := 1 to n DTW[i, 0] := infinity DTW[0, 0] := 0 for i := 1 to n for j := 1 to m cost:= d(s[i], t[j]) DTW[i, j] := cost + minimum( DTW[i-1, j ], // insertion DTW[i, j-1], // deletion DTW[i-1, j-1]) // match return DTW[n, m]

Clustering Bus Travel times

Clustering Bus Travel times G1: working days from January 1st to July 15th (beginning of the school holiday period) G2: all non-working days including holidays and all weekends. G3: working days from September 15th to December 31st G4: working days from July 15th to September 15th (school holiday period)

Outline 1 Time-Series Analysis 2 Prediction Filters Neural Nets 3 Similarity between Time-series Euclidean Distance Dynamic Time-Warping 4 Symbolic Approximation SAX 5 References

Symbolic ApproXimation SAX J. Lin, E. Keogh, S. Lonardi, and B. Chiu,A Symbolic Representation of Time Series, with Implications for Streaming Algorithms, 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003. SAX is a fast symbolic approximation of time series. It allows a time series with a length n to be transformed to an approximated time series with an arbitrarily length w, where w < n. SAX follows three main steps: Piecewise Aggregate Approximation (PAA) Symbolic Discretization Distance measurement SAX is generic and could be applied to any time series analysis technique. It takes linear time

Piecewise Aggregate Approximation (PAA) A time series with size n is approximated using PAA to a time series with size w using the following equation: c i = w n w n i j= w n (i 1)+1 c j Where c i is the i th element in the approximated time series.

Symbolic Discretization Requires two parameters: Cardinality of the alphabet of symbols. Number of episodes Breakpoints are calculated that produce equal areas from one point to another under Gaussian distribution. According to the output of PAA: If a point is less than the smallest breakpoint, then it is denoted as a. Otherwise and if the point is greater than the smallest breakpoint and less than the next larger one, then it is denoted as b. etc.

Distance Measure The following distance measure is applied when comparing two different time series: MINDIST ( ˆQ, Ĉ) = n w w 1 dist(ˆq i, ĉ i ) 2 It returns the minimum distance between the original time series. A lookup table is calculated and used to find the distance between every two letters.

SAX SAX lets us do things that are difficult or impossible with other representations. Finding motifs in time series (ICDM 02, SIGKDD 03) Visualizing massive time series (SIGKDD04, VLDB 04) Cluster from streams (ICDM 03, KAIS 04) Kolmogorov complexity data mining (SIGKDD 04) Finding discords in time series (ICDM 05)

Motifs previously unknown, frequently occurring [sequential] patterns within same sequence.

HOT SAX SAX has been used to discover discords in time series. The technique is termed as Hot SAX. Keogh, E., Lin, J. and Fu, A., HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. In the 5th IEEE International Conference on Data Mining, New Orleans, LA. Nov 27-30, 2005. Discords are the time series subsequences that are maximally different from the rest of the time series subsequences. It is 3 to 4 times faster than brute force technique. This makes it a candidate for data streaming applications

HOT SAX The process starts with sliding widows of a fixed size over the whole time series to generate subsequence Each generated subsequence is approximated using SAX The approximated subsequence is then inserted in an array indexed according to its position in the original time series The number of occurrences of each SAX word is also inserted in the array. The array is then transformed to a tries where the leaf nodes represent the array index where the word appears. The two data structures (array and trie) complement each other.

Discords time series subsequences that are maximally different from the rest of the time series subsequences.

Outline 1 Time-Series Analysis 2 Prediction Filters Neural Nets 3 Similarity between Time-series Euclidean Distance Dynamic Time-Warping 4 Symbolic Approximation SAX 5 References

Resources UCR Time-Series Data Sets Maintained by Eamonn Keogh, UCR, US http://www.cs.ucr.edu/ eamonn/time series data Mining Data Streams Bibliography Maintained by Mohamed Gaber http://www.csse.monash.edu.au/ mgaber/ WResources.html

Master References J. Gama, Knowledge Discovery from Data Streams, CRC Press, 2010. J. Gama, M. Gaber (Eds), Learning from Data Streams Processing Techniques in Sensor Networks, Springer, 2007. J. Lin, E. Keogh, S. Lonardi, and B. Chiu,A Symbolic Representation of Time Series, with Implications for Streaming Algorithms, ACM SIGMOD Workshop, 2003. Keogh, Pazzani. Scaling up Dynamic Time Warping for Datamining Applications. ACM SIGKDD, 2000 A. Camerra, Jin Shieh, T. Palpanas, T. Rakthanmanon, E. Keogh: Beyond one billion time series: indexing and mining very large time series collections with i SAX2+0. Knowl. Inf. Syst. 39(1): 123-151 (2014) S. Stan & P. Chan. FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space, Intelligent Data Analysis, 2007. Keogh, E., Lin, J. and Fu, A., HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. In IEEE ICDM, 2005. N. Begum, E. Keogh: Rare Time Series Motif Discovery from Unbounded Streams. PVLDB 8(2): 149-160 (2014)