Mining State Dependencies Between Multiple Sensor Data Sources

Similar documents
Data Mining and Analysis: Fundamental Concepts and Algorithms

Discrete Events Modelling of a Person Behaviour at Home

CS 584 Data Mining. Association Rule Mining 2

Constraint-based Subspace Clustering

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Handling a Concept Hierarchy

Data Mining Concepts & Techniques

COMP 5331: Knowledge Discovery and Data Mining

Association Analysis Part 2. FP Growth (Pei et al 2000)

Data Analytics Beyond OLAP. Prof. Yanlei Diao

CSE 5243 INTRO. TO DATA MINING

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Association Rules. Fundamentals

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

IMP 2 September &October: Solve It

DATA MINING - 1DL360

Patrol: Revealing Zero-day Attack Paths through Network-wide System Object Dependencies

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Sequential Pattern Mining

CSE 5243 INTRO. TO DATA MINING

Chapter 4: Frequent Itemsets and Association Rules

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL360

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Statistical inference for Markov deterioration models of bridge conditions in the Netherlands

40 WAYS TO INCREASE YOUR HOME VALUE

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Mathematical statistics

Dynamic Data-Driven Adaptive Sampling and Monitoring of Big Spatial-Temporal Data Streams for Real-Time Solar Flare Detection

FARMER: Finding Interesting Rule Groups in Microarray Datasets

A stochastic model-based approach to online event prediction and response scheduling

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Statistical Testing of Randomness

Association Rule Mining on Web

Approximate counting: count-min data structure. Problem definition

Summary of Chapters 7-9

Binary response data

Analysis of Call Center Data

Introduction to Linear Programming (LP) Mathematical Programming (MP) Concept (1)

COMP 5331: Knowledge Discovery and Data Mining

Lecture 21: October 19

Frequent Pattern Mining: Exercises

A Branch and Bound Algorithm for the Project Duration Problem Subject to Temporal and Cumulative Resource Constraints

Association Analysis. Part 2

Moment-based Availability Prediction for Bike-Sharing Systems

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Delay and Accessibility in Random Temporal Networks

Behavioral Data Mining. Lecture 2

Temporal Data Mining

Sociology 6Z03 Review II

Aberrant Behavior Detection in Time Series for Monitoring Business-Critical Metrics (DRAFT)

Searching Dimension Incomplete Databases

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

CS 484 Data Mining. Association Rule Mining 2

Sensor Tasking and Control

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

What is (certain) Spatio-Temporal Data?

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

UnSAID: Uncertainty and Structure in the Access to Intensional Data

Logic, Optimization and Data Analytics

Topic 2: Probability & Distributions. Road Map Probability & Distributions. ECO220Y5Y: Quantitative Methods in Economics. Dr.

ECML PKDD Discovery Challenges 2017

Alpha-Investing. Sequential Control of Expected False Discoveries

Reliability Engineering I

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Exploring Spatial Relationships for Knowledge Discovery in Spatial Data

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

Topic 21 Goodness of Fit

4.5.1 The use of 2 log Λ when θ is scalar

Oak Ridge Urban Dynamics Institute

Covariance test Selective inference. Selective inference. Patrick Breheny. April 18. Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/20

Two-Sample Inference for Proportions and Inference for Linear Regression

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Spatial Data Science. Soumya K Ghosh

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Overall Plan of Simulation and Modeling I. Chapters

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

WAM-Miner: In the Search of Web Access Motifs from Historical Web Log Data

QUANTITATIVE TECHNIQUES

Slides 8: Statistical Models in Simulation

Independence Solutions STAT-UB.0103 Statistics for Business Control and Regression Models

CPSC 340: Machine Learning and Data Mining. Linear Least Squares Fall 2016

Testing Independence

Review of Statistics

Latent Geographic Feature Extraction from Social Media

Announcements. Unit 3: Foundations for inference Lecture 3: Decision errors, significance levels, sample size, and power.

Mining bi-sets in numerical data

Transcription:

Mining State Dependencies Between Multiple Sensor Data Sources C. Robardet Co-Authored with Marc Plantevit and Vasile-Marian Scuturici April 2013 1 / 27

Mining Sensor data A timely challenge? Why is it such a keen interest? Sensor technology monitors many events in real time. It produces multiple heterogeneous data streams that generates very large volumes of information. 2 / 27

Mining Sensor data A timely challenge? Why is it such a keen interest? Sensor technology monitors many events in real time. It produces multiple heterogeneous data streams that generates very large volumes of information. Data streams challenges: Handling infinite sequences of events that occur at steady a pace Providing actionable insights to end-users The mining step has to be faster than the data acquisition process. 2 / 27

Mining Sensor data A timely challenge? Why is it such a keen interest? Sensor technology monitors many events in real time. It produces multiple heterogeneous data streams that generates very large volumes of information. Data streams challenges: Handling infinite sequences of events that occur at steady a pace Providing actionable insights to end-users The mining step has to be faster than the data acquisition process. New challenges! Identifying temporal dependencies between data sources Events last for a period of time What is the time lag between dependent events? 2 / 27

Our Proposal Identify temporal dependencies between data streams: Data stream events change the internal state of the sensor. Each state has a duration, represented as a set of disjoint time intervals. Temporal relations between two interval sets infers dependencies between the corresponding sensors. 3 / 27

Our Proposal Identify temporal dependencies between data streams: Data stream events change the internal state of the sensor. Each state has a duration, represented as a set of disjoint time intervals. Temporal relations between two interval sets infers dependencies between the corresponding sensors. Main idea: discover temporal dependencies and their associated time-delay intervals robust to the temporal variability of events characterizes the time intervals during which the events are related. without user-determined threshold 3 / 27

Examples Organizing the deicing operations on the basis of freezing alert prediction Sensor anomaly detection 4 / 27

Overview of our Approach TEDDY: TEmporal Dependency DiscoverY Data Streams A 1 A 2 A i A k Failure détection Structural Health Monitoring smart home cameras indexing object tracking Smart Environment Discovery of valid and significant temporal state dependencies B 1 B 2 B l C 1 C 2 C i C n State Streams 5 / 27

Step 1: Converting a data stream into a state stream Batch of events: To process the infinite set of events, we use batches of events defined by a time interval [t begin, t end ). timestamp event 1 open 2 close 4 open 5 close 8 open 9 close Active time period The main property of a state is the period of time when it is active. The importance of a state A is evaluated by the sum of the lengths of its active time intervals: len(a) 6 / 27

Step 2: Measuring state dependency Active time period intersection The dependency of two states A and B is evaluated on the basis of the intersection of their active time intervals: len(a B). 7 / 27

Step 2: Measuring time shift state dependency A B B can be transformed to maximize its intersection with A: B can be shifted in the past B can be slightly extended Extending the intervals makes the temporal dependency measure more robust to the inherent variability of the data. Shifting of t time units (B [t,t] ) Slight extension of t time units (B [0,t] ) 8 / 27

Step 2: Measuring the confidence of a time shift state dependency Time shift intervals We explore all possible time shift intervals [α, β] included in [t min, t max ]: B [α,β] = {[b j + α, b j+1 + β[ j = 1 #B} with α β 0, and len(b) len(b [α,β] ) Confidence of state dependency A [α,β] B The proportion of time where the two states are active over the active time period of A: conf(a [α,β] B) = len(a B[α,β] ) len(a) 9 / 27

Step 3: Statistical assessment of the confidence measure Pearson s chi-squared test of independence Are the occurrences of A and B [α,β] statistically independent? B [α,β] B [α,β] A P(A B [α,β] ) P(A B [α,β] ) A P(A B [α,β] ) P(A B [α,β] ) (O) ij is the contingency table that crosses the observed outcomes of A and B [α,β] (E) ij gives the expected outcomes under the null hypothesis, where A and B [α,β] occur uniformly over T = [t begin, t end ): X 2 = 2 i=1 j=1 2 (O ij E ij ) 2 E ij χ 2 0.05 conf(a [α,β] B) minconf(len(b [α,β] )) 10 / 27

Step 4: Significant time shift intervals selection Limiting the redundancy between time shift intervals of a state dependency A huge number of time shift intervals may exist that result in valid temporal dependencies Many of them are redundant, depicting the same phenomenon several times Property If [α 1, β 1 ] [α 2, β 2 ], then conf(a [α1,β1] B) conf(a [α2,β2] B). 11 / 27

Step 4: Significant time shift intervals selection Most interesting time shift intervals 1 have a high confidence value 2 are as specific as possible with respect to the inclusion relation Dominance relationship on the time shift intervals if [α 1, β 1 ] [α 2, β 2 ] and 1 conf(a [α1,β1] B) conf(a [α2,β2] B) then A [α1,β1] B A [α2,β2] B < 1 len(b[α1,β1] ) len(b [α2,β2] ) 12 / 27

Step 4: Significant time shift intervals selection Dominance relationship on the time shift intervals if [α 1, β 1 ] [α 2, β 2 ] and 1 conf(a [α1,β1] B) conf(a [α2,β2] B) then A [α1,β1] B A [α2,β2] B < 1 len(b[α1,β1] ) len(b [α2,β2] ) Therefore len(b [α2,β2]\[α1,β1] A) is almost 0. 12 / 27

Step 4: Significant time shift intervals selection Significant temporal dependency: The most specific temporal dependency that dominates all its supersets and each of its supersets dominates its supersets as well. [-4,0] [-4,-1] [-3,0] [-4,-2] [-3,-1] [-2,0] [-4,-3] [-3,-2] [-2,-1] [-1,0] [-4,-4] [-3,-3] [-2,-2] [-1,-1] [0,0] 13 / 27

TEDDY: TEmporal Dependency DiscoverY Enumeration principle of the time shift intervals A [α,β] B Level-wise enumeration to compute the confidence value of each interval once at the most. Use the monotonic property of the confidence measure and a lower bound on minconf(len(b [α,β] )) to early prune the search space. Exploits an upper bound on the confidence measure, whose complexity is O(1), to avoid unnecessary computation of the confidence. Exploits the transitivity of the dominance relationship in the identification of significant temporal dependencies. 14 / 27

TEDDY: Candidate time shift intervals generation Level-wise enumeration [-4,0] [-4,-1] [-3,0] [-4,-2] [-3,-1] [-2,0] [-4,-3] [-3,-2] [-2,-1] [-1,0] [-4,-4] [-1,-1] [0,0] 15 / 27

TEDDY: Pruning-based on confidence measure minconf (x) = λx + with x = len(b [α,β] ) and λ = len(a). χ 2 0.05 T λ(t λ)x(t x) λt 1.5 MinConfidence(x) Lower bound 1 0.5 0 x 1 T Lower bound on minconf of A [α,β] B ( ) ( ) minconf len(b [α,β] ) min 1, minconf(len(b [0,0] )) 16 / 27

TEDDY: Avoiding unnecessary computation of the confidence Upper bound on confidence conf(a [α 1,β 1 ] B) conf(a [α 2,β 2 ] ( α1 α2 + β1 β2 ) #B B) len(a) where #B represents the number of intervals of B. Complexity in O(1) instead of O(#B). 17 / 27

TEDDY: Pruning-based on dominance relationships Dominance transitivity [α 1, β 1 ] [α, β] [α 2, β 2 ], if A [α1,β1] B A [α,β] B and A [α,β] B A [α2,β2] B then A [α1,β1] B A [α2,β2] B. Pruning If, x, y such that t min x α and β y t max A [x,y] B A [x 1,y] B and A [x,y] B A [x,y+1] B then A [α,β] B also dominates all its ancestors. 18 / 27

TEDDY: Identification of valid and significant dependencies Final step As we used a lower bound on the confidence threshold, in the final step we need to check if conf(a [α,β] B) minconf(len(b [α,β] )) If not, A [α 1,β] B and A [α,β+1] B are recursively considered. 19 / 27

Example M011.ON [2,4] M014.ON if the sensor M011 produces the event ON, then the sensor M014 will produce a similar event in the interval 2 to 4 seconds 20 / 27

Experiments: Evaluation of the pruning efficiency Ratio (without pruning to TEDDY time) 70 60 50 40 30 20 10 SYNT02 SYNT04 SYNT08 SYNT16 f(x)=2 0 0 5 10 15 20 25 30 TMAX Ratio WP to TEDDY (running time) 35 30 25 20 15 10 5 0 SYNT02 SYNT04 SYNT08 SYNT16 f(x)=2 1200 1500 1800 2100 2400 2700 3000 Batch size (in seconds) 21 / 27

Experiments: impact of each pruning technique on search space exploration % of search space % of search space % of search space 5 10 15 20 25 30 35 40 45 50 Batch size (in minutes) 5 10 15 20 25 30 35 40 45 50 Batch size (in minutes) 5 10 15 20 25 30 35 40 45 50 Batch size (in minutes) % of search space % of search space % of search space 1 3 5 10 15 20 25 30 1 3 5 10 15 20 25 30 1 3 5 10 15 20 25 30 TMAX TMAX TMAX 2 4 16 Average number of events per minute and per stream 22 / 27

Results from Milan dataset Temporal dependencies that occur the 26th October (left) and the 1st January (right). Master Bedroom Office Master Bedroom Office Master Bathroom Guest Bedroom Living Room Master Bathroom Guest Bedroom Living Room Kitchen Dining Room Entrance Kitchen Dining Room Entrance 23 / 27

Results from Foxstream 24 / 27

Results from Foxstream 25 / 27

Anomaly detection 1 f(x)=0.6 0.8 Jaccard Index 0.6 0.4 0.2 0 Tuesday Spider Wednesday T 26 / 27

Conclusion and Future Work Temporal dependencies mining We investigate a new direction in data stream mining: mining relations between data sources that produce multiple heterogeneous data. Future Directions Analyzing the dynamics of dependency graphs through time. Studying the potential use of temporal dependencies in a database perspective: potential integration into continuous query engines as a basis of a semantic indexation of data sources. 27 / 27