Extracting Optimal Performance from Dynamic Time Warping

Size: px

Start display at page:

Download "Extracting Optimal Performance from Dynamic Time Warping"

Imogen Daniel
6 years ago
Views:

edu While you are waiting: Please download the slides at:

1 Extracting Optimal Performance from Dynamic Time Warping Abdullah Mueen Eamonn Keogh While you are waiting: Please download the slides at: or../dtw1.pptx and../dtw2.pdf / DTW2.pptx

2 Structure There will be Q/A break before coffee break and another Q/A session at the end Interrupt only if you need clarification This segment has many high level concepts without minute details, follow embedded references for more information Items in many slides are color mapped, match colors for better understanding

3 The Second Act: How to do DTW fast We are motivated that DTW is GOOD by the first act The general conception: DTW is slow and we have a never-ending need for speed Better performance in knowledge extraction Better scalability to process BigData Better interactivity in human driven data analysis

4 What can be made fast? One-to-One comparison Exact Implementation and Constraints Efficient Approximation Exploiting Sparsity One-to-Many comparisons Nearest Neighbor Search In a database of independent time series In subsequences of a long time series Density Estimation In clustering Averaging Under Warping In classification Many-to-Many comparisons All-pair Distance Calculations

5 Speeding up DTW: one-to-one One-to-One comparison Exact Implementation Efficient Constraints Exploiting Hardware Efficient Approximation Exploiting Sparsity

6 Simplest Exact Implementation Input: x and y are time series of length n and m Output: DTW distance d between x and y O(n 2 ) time D(1:n+1,1:m+1) = inf; D(1,1) = 0; O(n 2 ) space for i = 2 : n+1 %for each row for j = 2 : m+1 %for each column cost = (x(i-1)-y(j-1))^2; D(i,j) = cost + min( [ D(i-1,j), D(i,j-1), D(i-1,j-1) ]); d = sqrt(d(n+1,m+1));

7 Simplest Implementation (Constrained) O(nw) time D(1:n+1,1:m+1) = inf; D(1,1) = 0; O(n 2 ) space w = max(w, abs(n-m)); for i = 2 : n+1 for j = max(2,i-w) : min(m+1,i+w) cost = (x(i-1)-y(j-1))^2; D(i,j) = cost + min( [ D(i-1,j), D(i,j-1), D(i-1,j-1) ]); d = sqrt(d(n+1,m+1)); w

8 Memoization O(n 2 ) time D(2,1:m+1) = inf; D(1,1) = 0; p = 1; c = 2; O(n) space for i = 2 : n+1 for j = 2 : m+1 cost = (x(i-1)-y(j-1))^2; D(c,j) = cost + min( [ D(p,j), D(c,j-1), D(p,j-1) ]); swap(c,p); d = sqrt(d(n+1,m+1)); Previous Row Current Row

9 Hardware Acceleration Single Instruction Multiple Data (SIMD) architecture Cells on a diagonal are computed in parallel Values of a diagonal depend on the previous two diagonals O(n) time O(n) space

10 PAA based approximation 2 time O n w 2 space O n w w Piecewise Aggregate Approximation Selina Chu, Eamonn J. Keogh, David M. Hart, Michael J. Pazzani: Iterative Deepening Dynamic Time Warping for Time Series. SDM 2002:

11 Approximation by Lengthencoding To exploit sparsity, encode lengths of the runs of zeros 1 (4) 1 (4) 1 1 (3) 1 (3) 1 1 (3) 1 A Mueen, N Chavoshi, N Abu-El-Rub, H Hamooni, A Minnich, Fast Warping Distance for Sparse Time Series, Technical Report at UNM:

12 Exploiting Sparsity (1) y x

13 Exploiting Sparsity (2) y x O(nm) time and space X Y 1 (3) 1 (3) 1 1 (3) 1 1 (4) 1 (4) O(NM) time and space

14 Exploiting Sparsity (2) Correct Alignment Missing Alignment Extra Alignment Non-Linear change Exact Distance y x y No change Lower bound (1) 1 x 1 (2) y x Linear change Upper bound (1) 1 1 (2)

15 Percentage Percentage Percentage Percentage Exploiting Sparsity (3) Awarp_UB 1.05 * DTW AWarp_UB = DTW AWarp_LB = DTW Normal Distribution Sparsity Factor Exponential Distribution Awarp_UB 1.05 * DTW AWarp_UB = DTW AWarp_LB = DTW Sparsity Factor Awarp_UB 1.05 * DTW AWarp_UB = DTW AWarp_LB = DTW Uniform Distribution Sparsity Factor Awarp_UB 1.05 * DTW AWarp_UB = DTW AWarp_LB = DTW Binomial Distribution Sparsity Factor Sparsity Factor of s means 1 % of the time s series is filled with nonzeros.

16 What can be made fast? One-to-One comparison Exact Implementation and Constraints Efficient Approximation Exploiting Sparsity One-to-Many comparisons Nearest Neighbor Search In a database of independent time series In subsequences of a long time series Density Estimation In clustering Averaging Under Warping In classification Many-to-Many comparisons All-pair Distance Calculations

17 Nearest Neighbor Search A query Q is given n independent candidate time series C 1, C 2,, C n O(n) distance calculations are performed to Find THE nearest neighbor of the given query under DTW.

18 Brute Force Nearest Neighbor Search Computational cost: O(nm 2 ) Algorithm Sequential_Scan(Q) 1. best_so_far = infinity; 2. for all sequences in database true_dist = DTW(C i,q); 6. if true_dist < best_so_far 7. best_so_far = true_dist; 8. index_of_best_match = i; 9. endif endfor

Lower Bounding Nearest Neighbor Search We can speed up similarity

Lower_Bounding_Sequential_Scan(Q) 1. best_so_far = infinity; 2.

LB_dist = lower_bound_distance(c i,q); if LB_dist < best_so_far 5. 6.

best_so_far = true_dist; 8. index_of_best_match = i; 9. endif 10.

19 Lower Bounding Nearest Neighbor Search We can speed up similarity search under DTW by using a lower bounding function Algorithm Lower_Bounding_Sequential_Scan(Q) 1. best_so_far = infinity; 2. for all sequences in database LB_dist = lower_bound_distance(c i,q); if LB_dist < best_so_far true_dist = DTW(C i,q); if true_dist < best_so_far 7. best_so_far = true_dist; 8. index_of_best_match = i; 9. endif 10. endif 11. endfor Try to use a cheap lower bounding calculation as often as possible. Only do the expensive, full calculations when it is absolutely necessary

20 Lower Bound of Kim C A D LB_Kim B O(1) time if considered only first and last points O(n) time for all four distances Kim, S, Park, S, & Chu, W. An index-based approach for similarity search supporting time warping in large sequence databases. ICDE 01, pp The squared difference between the two sequence s first (A), last (D), minimum (B) and maximum points (C) is returned as the lower bound

21 Lower Bound of Yi max(q) LB_Yi O(n) time Yi, B, Jagadish, H & Faloutsos, C. Efficient retrieval of similar time sequences under time warping. ICDE 98, pp min(q) The sum of the squared length of white lines represent the minimum contribution of the observations above and below the yellow lines.

22 Q Lower Bound of Keogh C U i = max(q i-w : q i+w ) L i = min(q i-w : q i+w ) U Sakoe-Chiba Band O(n) time Q L Envelope-Based Lower Bound U C LB_ Keogh( Q, C) n i 1 ( q i U ) i 2 if q U 2 ( qi Li ) if qi Li 0 otherwise i i Q L

23 Reversing the Query/Data Role in LB_Keogh Make LB_Keogh tighter Much cheaper than DTW U/L envelops on the candidates can be calculated online or pre-calculated max(lb _Keogh EQ, LB _Keogh EC) Envelop on Q Envelop on C U C U C Q L Q L 24

24 The tightness of the lower bound for each technique is proportional to the length of lines used in the illustrations LB_Kim LB_Yi LB_Keogh Sakoe-Chiba LB_Keogh Itakura

25 Tightness of lower bound Cascading Lower Bounds At least 18 lower bounds of DTW was proposed. Use lower bounds only on the Skyline. Use the bounds on the skyline in cascade from least expensive to most expensive When unable to prune, use early abandoning techniques 1 0 LB_KimFL Early_abandoning_DTW max(lb_ Keogh EQ, LB_ Keogh EC) LB_ Keogh EQ LB_Yi LB_Kim LB_FTW LB_Ecorner LB_PAA O(1) O(n) O(nR) DTW 99.9% of the time DTW is not calculated 26

26 Early Abandoning Techniques Abandon accumulating errors as soon as the current total is larger than the best_so_far Four techniques to abandon early 1. Early Abandoning of LB_Keogh 2. Early Abandoning of DTW 3. Earlier Early Abandoning of DTW using LB_Keogh 4. Reordering Early Abandoning

27 Early Abandoning of LB_Keogh Abandon the computation, when the accumulated error is larger than best_so_far U C Q L U, L are upper and lower envelopes of Q 28

28 Early Abandoning of DTW Q C Abandon the computation, when the dtw_dist is larger than best_so_far C dtw_dist R (Warping Windows) Q 29

29 Earlier Early Abandoning of DTW using LB_Keogh Abandon the computation, when the dtw_dist + lb_keogh is larger than best_so_far C Q (partial) lb_keogh U C R (Warping Windows) (partial) dtw_dist Q L U, L are upper and lower envelopes of Q 30

30 Reordering Early Abandoning We don t have to compute LB from left to right. Order points by expected contribution U C Idea Q L - Order by the absolute height of the query point. 31

31 Summary of the techniques Group-1 Techniques Early Abandoning of LB_Keogh Early Abandoning of DTW Earlier Early Abandoning of DTW using LB_Keogh Group-2 Techniques Just-in-time Z-normalizations Reordering Early Abandoning Reversing LB_Koegh Cascading Lower Bounds UCR Suite Code and data is available at: Thanawin Rakthanmanon, Bilson J. L. Campana, Abdullah Mueen, Gustavo E. A. P. A. Batista, M. Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn J. Keogh: Searching and mining trillions of time series subsequences under dynamic time warping. KDD 2012:

32 Experimental Result: Random Walk Random Walk: Varying size of the data Million (Seconds) Billion (Minutes) Trillion (Hours) DTW-Naive , ,869 Group Group-1 and

33 Experimental Result: Random Walk Random Walk: Varying size of the query seconds For query lengths of 4,096 (rightmost part of this graph) The times are: Naïve DTW 24,286 Group-1 5,078 Group-1 and Naïve DTW Group-1 Techniques Group 1 and Query Length Power of two 34

34 Seconds Experimental Result: Random Walk Random Walk: Varying size of the band 600 DTW-Naïve Group-1 Group-1 and Sakoe-Chiba Band Width (percentage of query length)

35 Nearest Subsequence Search A Q query is given A long time series of length n O(n) distance calculations are performed to Find THE nearest subsequence of the given query under DTW.

36 Time Warping Subsequence Search S Reuses computation for subsequence matching Q Match 1 Match 2 Match 3 Match 4 For every new observation only one column is added on the right No need for any of the techniques 37 Yasushi Sakurai, Christos Faloutsos, Masashi Yamamuro: Stream Monitoring under the Time Warping Distance. ICDE 2007:

37 Normalization is required If each window is normalized separately, reuse of computation is no longer possible To take advantage of the bounding and abandoning techniques, we need just-in-time normalization with constant overhead per comparison

38 Just-in-time Normalization In one pass, calculate cumulative sums of over x and x 2 and store C = x C 2 = x 2 Subtract two cumulative sums to obtain the sum over a window S S i = C i+w C i i = C i+w C i Use the sums to calculate the means and standard deviations of all windows in linear time μ i = S i w Dynamically normalize observations when calculating distance and possibly abandon early cost = σ i = S i 2 x ij μ xi σ xi w S i w q i μ qi σ qi

39 Experimental Result: ECG Data: One year of Electrocardiograms 8.5 billion data points. Query: Idealized Premature Ventricular Contraction (PVC) of length 421 (R=21=5%). PVC (aka. skipped beat) Group-1 Group 1 & 2 ECG 49.2 hours 18.0 minutes ~30,000X faster than real time! 40

40 Experimental Result: DNA Query: Human Chromosome 2 of length 72,500 bps Data: Chimp Genome 2.9 billion bps Time: UCR Suite 14.6 hours, SOTA 34.6 days (830 hours) Rhesus macaque Gibbon Orangutan Gorilla Chimp Human Catarrhines Hominoidea Hominidae Homininae Hominini Chromosome 2: BP :

41 What can be made fast? One-to-One comparison Exact Implementation and Constraints Efficient Approximation Exploiting Sparsity One-to-Many comparisons Nearest Neighbor Search In a database of independent time series In subsequences of a long time series Density Estimation In clustering Averaging Under Warping In classification Many-to-Many comparisons All-pair Distance Calculations

42 Density based clustering Density Peaks (DP)* Algorithm Find the densities of every point to pick cluster centers Connect every point to the nearest higher density point *Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191),

Range Search/Density Estimation Density is estimated by the number of points within a radius/threshold t

LB_dist = lower_bound_distance (C i, Q) 3. if LB_dist < t 4. UB_dist = upper_bound_distance (C i, Q) 5. 6.

43 Range Search/Density Estimation Density is estimated by the number of points within a radius/threshold t Algorithm Bounding_Range_Search(Q,t) 1. for all sequences C i in database 2. LB_dist = lower_bound_distance (C i, Q) 3. if LB_dist < t 4. UB_dist = upper_bound_distance (C i, Q) if UB_dist < t then output C i else true_dist = DTW(C i, Q) if true_dist < t output C i 10. endif 11. endif 12. endif 13. endfor Try to use an upper bound to identify a point within the range Nurjahan Begum, Liudmila Ulanova, Jun Wang, Eamonn J. Keogh: Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy. KDD 2015: 49-58

Density Connectedness Distance between a pair of points is an upper bound of the NN distance from both of the points Algorithm Bounding_Scan(D,Q) 1. best_so_far =min(upper_bound_nn_distance(d,q) 2.

44 Density Connectedness Distance between a pair of points is an upper bound of the NN distance from both of the points Algorithm Bounding_Scan(D,Q) 1. best_so_far =min(upper_bound_nn_distance(d,q) 2. for all sequences in D 3. LB_dist = lower_bound_distance( C i, Q); 4. if LB_dist < best_so_far 5. true_dist = DTW( C i, Q); 6. if true_dist < best_so_far 7. best_so_far = true_dist; 8. index_of_best_match = i; Try to use an 9. endif upper bound to the 10. endif NN distance as 11. endfor the best_so_far Q D Nurjahan Begum, Liudmila Ulanova, Jun Wang, Eamonn J. Keogh: Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy. KDD 2015: 49-58

45 DTW Distance Upper bounding Euclidean distance is a trivial upper bound DTW distance in a band w is an upper bound for DTW distance in band w Zoom-In Euclidean Distance

46 Distance Calculations Distance Calculations Speedup by upper bounds Density Peak: 9 Hours TADPole: 9 minutes 7 5 x 10 6 Absolute Number 100 Brute force Percentages 3 1 TADPole 0 Number of objects 3500 TADPole 0 0 Number of objects 3500 StarLightCurves dataset 47

47 What can be made fast? One-to-One comparison Exact Implementation and Constraints Efficient Approximation Exploiting Sparsity One-to-Many comparisons Nearest Neighbor Search In a database of independent time series In subsequences of a long time series Density Estimation In clustering Averaging Under Warping In classification Many-to-Many comparisons All-pair Distance Calculations

48 Data Reduction for 1NN Classification The training set is reduced to a smaller set keeping a representative set of labeled instances Smaller training set entails performance gain Smaller training set may gain accuracy if noisy instances are filtered effectively Reduction methods Random Selection Rank the instances and take top-k Cluster instances based on proximity and take representative from each cluster

Many clustering algorithms require finding a centroid of two or more instances The issue is then: How to average time series consistently with DTW?

49 Many clustering algorithms require finding a centroid of two or more instances The issue is then: How to average time series consistently with DTW? Compute average Trace dataset François Petitjean, Germain Forestier, Geoffrey I. Webb, Ann E. Nicholson, Yanping Chen, Eamonn J. Keogh: Dynamic Time Warping Averaging of Time Series Allows Faster and More Accurate Classification. ICDM 2014:

50 Mathematically, the mean oҧ of a set of objects O embedded in a space induced by a distance d is: arg min തo d 2 o, ҧ o o O The mean of a set minimizes the sum of the squared distances If d is the Euclidean distance The arithmetic mean solves the problem exactly o ҧ = 1 N o o O Arithmetic mean 51

51 To solve the optimization problem for DTW distance, we need to perform simultaneous alignment of many time series. But, finding the optimal multiple alignment: 1. Is NP-complete [a] 2. Requires O L N operations L is the length of the sequences ( 100) N is the number of sequences ( 1,000) Efficient solutions will be heuristic Pairwise Averaging DTW Barycenter Averaging (DBA) [a] F. Petitjean, A. Ketterlin and P. Gançarski, A global averaging method for dynamic time warping, with applications to clustering, Pattern Recognition, vol. 44, no. 3, pp , 2011.

52 Pairwise averaging for DTW Average each alignment between the two time series Commonly increases the length Chaining can produce average over a set The operation is not associative, the average produced depends on the order Average X 1 X 2 Average X 1,2 Average X 1-4 X 3 Average X 3,4 X 4 V. Niennattrakul and C. A. Ratanamahatana, On Clustering Multimedia Time Series Data Using K-Means and Dynamic Time Warping, IEEE International Conference on Multimedia and Ubiquitous Engineering, pp , 2007.

53 DTW Barycenter Averaging (DBA) Algorithm DBA(D,av) 1 Iterate until convergence 2 for each series s i in D 3 A i = GetAlignment(DTW(s i, av)) 4 for each sample j in av 5 av[j] = mean([a 1 [j] A 2 [j] A 3 [j]. A n [j]) av s 1 s 1 s 2 s 2

54 Error-Rate Experimental Evaluation on Insect Data 0.3 SR Drop1 Drop3 random Drop2 4 rank-based competitors 1. Drop 1 2. Drop 2 3. Drop 3 4. Simple Rank 2 average-based techniques 1. K-means 2. AHC both using DBA 0.2 AHC Kmeans 0.1 The minimum error-rate is 0.092, with 19 pairs of objects The full dataset error-rate is 0.14, with 100 pairs of objects Items per class in reduced training set Code Available : 55

55 What can be made fast? One-to-One comparison Exact Implementation and Constraints Efficient Approximation Exploiting Sparsity One-to-Many comparisons Nearest Neighbor Search In a database of independent time series In subsequences of a long time series Density Estimation In clustering Averaging Under Warping In classification Many-to-Many comparisons All-pair Distance Calculations

56 [a] N Chavoshi, H Hamooni, A Mueen, DeBot: Real-Time Bot Detection via Activity Correlation UNM Technical Report [b] Y. Chen, G. Chen, K. Chen and B. C. Ooi, "Efficient Processing of Warping Time Series Join of Motion Capture Data," ICDE, 2009, pp Speeding up DTW: many-to-many Several variants Self-join within a threshold - top-k Self-join Use similarity search techniques as subroutine Application: Motif discovery [a], Discord discovery A/B Join within a threshold - top-k A/B Join Use similarity search techniques as subroutine Application: Motion Stitching [b] All-pair distance matrix Use techniques to speedup one-to-one comparisons Application: Hierarchical Clustering

57 PrunedDTW: speeding up all-pair distance matrix calculation Two types of pruning when calculating DTW matrix UB >UB >UB UB UB UB Exact method 6 UB Lower triangle pruning UB UB UB UB UB UB >UB >UB >UB >UB UB = Euclidean distance UB = DTW distance Upper triangle pruning Diego F. Silva, Gustavo E. A. P. A. Batista, Speeding Up All-Pairwise Dynamic Time Warping Matrix Calculation, SDM 2016

58 Time (s) Experiments x Car CinC ECG torso Haptics InlineSkate Lighting-2 x x x x MALLAT Non-Invasive Fetal ECG Non-Invasive Fetal ECG Olive Oil Starlight Curves DTW Pruned DTW Oracle DTW Warping Window Length

59 Conclusion Nearest neighbor search under warping is fast enough for most practical purposes New invariances (e.g. normalization) lead to challenging problems Data reduction improves 1NN classification both in speed and accuracy DTW is an extraordinarily powerful and useful tool. Its uses are limited only by our imagination.

Efficient search of the best warping window for Dynamic Time Warping

Efficient search of the best warping window for Dynamic Time Warping Chang Wei Tan 1 Matthieu Herrmann 1 Germain Geoffrey I. Forestier 2,1 Webb 1 François Petitjean 1 1 Faculty of IT, Monash University,