Mining State Dependencies Between Multiple Sensor Data Sources C. Robardet Co-Authored with Marc Plantevit and Vasile-Marian Scuturici April 2013 1 / 27
Mining Sensor data A timely challenge? Why is it such a keen interest? Sensor technology monitors many events in real time. It produces multiple heterogeneous data streams that generates very large volumes of information. 2 / 27
Mining Sensor data A timely challenge? Why is it such a keen interest? Sensor technology monitors many events in real time. It produces multiple heterogeneous data streams that generates very large volumes of information. Data streams challenges: Handling infinite sequences of events that occur at steady a pace Providing actionable insights to end-users The mining step has to be faster than the data acquisition process. 2 / 27
Mining Sensor data A timely challenge? Why is it such a keen interest? Sensor technology monitors many events in real time. It produces multiple heterogeneous data streams that generates very large volumes of information. Data streams challenges: Handling infinite sequences of events that occur at steady a pace Providing actionable insights to end-users The mining step has to be faster than the data acquisition process. New challenges! Identifying temporal dependencies between data sources Events last for a period of time What is the time lag between dependent events? 2 / 27
Our Proposal Identify temporal dependencies between data streams: Data stream events change the internal state of the sensor. Each state has a duration, represented as a set of disjoint time intervals. Temporal relations between two interval sets infers dependencies between the corresponding sensors. 3 / 27
Our Proposal Identify temporal dependencies between data streams: Data stream events change the internal state of the sensor. Each state has a duration, represented as a set of disjoint time intervals. Temporal relations between two interval sets infers dependencies between the corresponding sensors. Main idea: discover temporal dependencies and their associated time-delay intervals robust to the temporal variability of events characterizes the time intervals during which the events are related. without user-determined threshold 3 / 27
Examples Organizing the deicing operations on the basis of freezing alert prediction Sensor anomaly detection 4 / 27
Overview of our Approach TEDDY: TEmporal Dependency DiscoverY Data Streams A 1 A 2 A i A k Failure détection Structural Health Monitoring smart home cameras indexing object tracking Smart Environment Discovery of valid and significant temporal state dependencies B 1 B 2 B l C 1 C 2 C i C n State Streams 5 / 27
Step 1: Converting a data stream into a state stream Batch of events: To process the infinite set of events, we use batches of events defined by a time interval [t begin, t end ). timestamp event 1 open 2 close 4 open 5 close 8 open 9 close Active time period The main property of a state is the period of time when it is active. The importance of a state A is evaluated by the sum of the lengths of its active time intervals: len(a) 6 / 27
Step 2: Measuring state dependency Active time period intersection The dependency of two states A and B is evaluated on the basis of the intersection of their active time intervals: len(a B). 7 / 27
Step 2: Measuring time shift state dependency A B B can be transformed to maximize its intersection with A: B can be shifted in the past B can be slightly extended Extending the intervals makes the temporal dependency measure more robust to the inherent variability of the data. Shifting of t time units (B [t,t] ) Slight extension of t time units (B [0,t] ) 8 / 27
Step 2: Measuring the confidence of a time shift state dependency Time shift intervals We explore all possible time shift intervals [α, β] included in [t min, t max ]: B [α,β] = {[b j + α, b j+1 + β[ j = 1 #B} with α β 0, and len(b) len(b [α,β] ) Confidence of state dependency A [α,β] B The proportion of time where the two states are active over the active time period of A: conf(a [α,β] B) = len(a B[α,β] ) len(a) 9 / 27
Step 3: Statistical assessment of the confidence measure Pearson s chi-squared test of independence Are the occurrences of A and B [α,β] statistically independent? B [α,β] B [α,β] A P(A B [α,β] ) P(A B [α,β] ) A P(A B [α,β] ) P(A B [α,β] ) (O) ij is the contingency table that crosses the observed outcomes of A and B [α,β] (E) ij gives the expected outcomes under the null hypothesis, where A and B [α,β] occur uniformly over T = [t begin, t end ): X 2 = 2 i=1 j=1 2 (O ij E ij ) 2 E ij χ 2 0.05 conf(a [α,β] B) minconf(len(b [α,β] )) 10 / 27
Step 4: Significant time shift intervals selection Limiting the redundancy between time shift intervals of a state dependency A huge number of time shift intervals may exist that result in valid temporal dependencies Many of them are redundant, depicting the same phenomenon several times Property If [α 1, β 1 ] [α 2, β 2 ], then conf(a [α1,β1] B) conf(a [α2,β2] B). 11 / 27
Step 4: Significant time shift intervals selection Most interesting time shift intervals 1 have a high confidence value 2 are as specific as possible with respect to the inclusion relation Dominance relationship on the time shift intervals if [α 1, β 1 ] [α 2, β 2 ] and 1 conf(a [α1,β1] B) conf(a [α2,β2] B) then A [α1,β1] B A [α2,β2] B < 1 len(b[α1,β1] ) len(b [α2,β2] ) 12 / 27
Step 4: Significant time shift intervals selection Dominance relationship on the time shift intervals if [α 1, β 1 ] [α 2, β 2 ] and 1 conf(a [α1,β1] B) conf(a [α2,β2] B) then A [α1,β1] B A [α2,β2] B < 1 len(b[α1,β1] ) len(b [α2,β2] ) Therefore len(b [α2,β2]\[α1,β1] A) is almost 0. 12 / 27
Step 4: Significant time shift intervals selection Significant temporal dependency: The most specific temporal dependency that dominates all its supersets and each of its supersets dominates its supersets as well. [-4,0] [-4,-1] [-3,0] [-4,-2] [-3,-1] [-2,0] [-4,-3] [-3,-2] [-2,-1] [-1,0] [-4,-4] [-3,-3] [-2,-2] [-1,-1] [0,0] 13 / 27
TEDDY: TEmporal Dependency DiscoverY Enumeration principle of the time shift intervals A [α,β] B Level-wise enumeration to compute the confidence value of each interval once at the most. Use the monotonic property of the confidence measure and a lower bound on minconf(len(b [α,β] )) to early prune the search space. Exploits an upper bound on the confidence measure, whose complexity is O(1), to avoid unnecessary computation of the confidence. Exploits the transitivity of the dominance relationship in the identification of significant temporal dependencies. 14 / 27
TEDDY: Candidate time shift intervals generation Level-wise enumeration [-4,0] [-4,-1] [-3,0] [-4,-2] [-3,-1] [-2,0] [-4,-3] [-3,-2] [-2,-1] [-1,0] [-4,-4] [-1,-1] [0,0] 15 / 27
TEDDY: Pruning-based on confidence measure minconf (x) = λx + with x = len(b [α,β] ) and λ = len(a). χ 2 0.05 T λ(t λ)x(t x) λt 1.5 MinConfidence(x) Lower bound 1 0.5 0 x 1 T Lower bound on minconf of A [α,β] B ( ) ( ) minconf len(b [α,β] ) min 1, minconf(len(b [0,0] )) 16 / 27
TEDDY: Avoiding unnecessary computation of the confidence Upper bound on confidence conf(a [α 1,β 1 ] B) conf(a [α 2,β 2 ] ( α1 α2 + β1 β2 ) #B B) len(a) where #B represents the number of intervals of B. Complexity in O(1) instead of O(#B). 17 / 27
TEDDY: Pruning-based on dominance relationships Dominance transitivity [α 1, β 1 ] [α, β] [α 2, β 2 ], if A [α1,β1] B A [α,β] B and A [α,β] B A [α2,β2] B then A [α1,β1] B A [α2,β2] B. Pruning If, x, y such that t min x α and β y t max A [x,y] B A [x 1,y] B and A [x,y] B A [x,y+1] B then A [α,β] B also dominates all its ancestors. 18 / 27
TEDDY: Identification of valid and significant dependencies Final step As we used a lower bound on the confidence threshold, in the final step we need to check if conf(a [α,β] B) minconf(len(b [α,β] )) If not, A [α 1,β] B and A [α,β+1] B are recursively considered. 19 / 27
Example M011.ON [2,4] M014.ON if the sensor M011 produces the event ON, then the sensor M014 will produce a similar event in the interval 2 to 4 seconds 20 / 27
Experiments: Evaluation of the pruning efficiency Ratio (without pruning to TEDDY time) 70 60 50 40 30 20 10 SYNT02 SYNT04 SYNT08 SYNT16 f(x)=2 0 0 5 10 15 20 25 30 TMAX Ratio WP to TEDDY (running time) 35 30 25 20 15 10 5 0 SYNT02 SYNT04 SYNT08 SYNT16 f(x)=2 1200 1500 1800 2100 2400 2700 3000 Batch size (in seconds) 21 / 27
Experiments: impact of each pruning technique on search space exploration % of search space % of search space % of search space 5 10 15 20 25 30 35 40 45 50 Batch size (in minutes) 5 10 15 20 25 30 35 40 45 50 Batch size (in minutes) 5 10 15 20 25 30 35 40 45 50 Batch size (in minutes) % of search space % of search space % of search space 1 3 5 10 15 20 25 30 1 3 5 10 15 20 25 30 1 3 5 10 15 20 25 30 TMAX TMAX TMAX 2 4 16 Average number of events per minute and per stream 22 / 27
Results from Milan dataset Temporal dependencies that occur the 26th October (left) and the 1st January (right). Master Bedroom Office Master Bedroom Office Master Bathroom Guest Bedroom Living Room Master Bathroom Guest Bedroom Living Room Kitchen Dining Room Entrance Kitchen Dining Room Entrance 23 / 27
Results from Foxstream 24 / 27
Results from Foxstream 25 / 27
Anomaly detection 1 f(x)=0.6 0.8 Jaccard Index 0.6 0.4 0.2 0 Tuesday Spider Wednesday T 26 / 27
Conclusion and Future Work Temporal dependencies mining We investigate a new direction in data stream mining: mining relations between data sources that produce multiple heterogeneous data. Future Directions Analyzing the dynamics of dependency graphs through time. Studying the potential use of temporal dependencies in a database perspective: potential integration into continuous query engines as a basis of a semantic indexation of data sources. 27 / 27