Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach

Similar documents
Using Convolution to Mine Obscure Periodic Patterns in One Pass

An Approach to Classification Based on Fuzzy Association Rules

Mining Temporal Patterns for Interval-Based and Point-Based Events

Mining Strong Positive and Negative Sequential Patterns

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems c World Scientific Publishing Company

Periodic Pattern Mining for Spatio-Temporal Trajectories: A Survey

Discovering Lag Intervals for Temporal Dependencies

Mining Partially Periodic Event Patterns With. Unknown Periods. Sheng Ma and Joseph L. Hellerstein. IBM T.J. Watson Research Center

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.

Mining Partially Periodic Event Patterns With Unknown Periods*

Preserving Privacy in Data Mining using Data Distortion Approach

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Frequent Items in a Stream Using Flexible Windows (Extended Abstract)

A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms

COMP 5331: Knowledge Discovery and Data Mining

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

CS246 Final Exam, Winter 2011

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

Correlation Preserving Unsupervised Discretization. Outline

CSE 5243 INTRO. TO DATA MINING

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

D B M G Data Base and Data Mining Group of Politecnico di Torino

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Machine Learning: Pattern Mining

InfoMiner: Mining Surprising Periodic Patterns

Association Rules. Fundamentals

Guaranteeing the Accuracy of Association Rules by Statistical Significance

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

A Streaming Algorithm for 2-Center with Outliers in High Dimensions

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Statistical Privacy For Privacy Preserving Information Sharing

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Processing Count Queries over Event Streams at Multiple Time Granularities

PRELIMINARY STUDIES ON CONTOUR TREE-BASED TOPOGRAPHIC DATA MINING

Ranking Sequential Patterns with Respect to Significance

Mining Positive and Negative Fuzzy Association Rules

6-1. Canonical Correlation Analysis

Mining periodic patterns from nested event logs

CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS. Due to elements of uncertainty many problems in this world appear to be

Interesting Patterns. Jilles Vreeken. 15 May 2015

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

On Information Maximization and Blind Signal Deconvolution

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part I. 1 st Nine Weeks,

A Logical Formulation of the Granular Data Model

The Ties that Bind Characterizing Classes by Attributes and Social Ties

Social Studies 201 September 22, 2003 Histograms and Density

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Reductions for Frequency-Based Data Mining Problems

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Descriptive Data Summarization

Anomaly Detection via Over-sampling Principal Component Analysis

Computing Correlation Anomaly Scores using Stochastic Nearest Neighbors

A METHOD OF FINDING IMAGE SIMILAR PATCHES BASED ON GRADIENT-COVARIANCE SIMILARITY

Exploring Spatial Relationships for Knowledge Discovery in Spatial Data

Maintaining Frequent Itemsets over High-Speed Data Streams

Outlier Detection Using Rough Set Theory

Heuristics for The Whitehead Minimization Problem

Mining Correlated High-Utility Itemsets using the Bond Measure

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

ANÁLISE DOS DADOS. Daniela Barreiro Claro

130 Important Questions for XI

Statistics for Managers Using Microsoft Excel Chapter 9 Two Sample Tests With Numerical Data

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Searching Dimension Incomplete Databases

Prime Analysis in Binary

Part 1: Hashing and Its Many Applications

Rare Event Discovery And Event Change Point In Biological Data Stream

Algorithms for Characterization and Trend Detection in Spatial Databases

User-Driven Ranking for Measuring the Interestingness of Knowledge Patterns

CSE 5243 INTRO. TO DATA MINING

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Robust Inverse Covariance Estimation under Noisy Measurements

CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication

Session-Based Queueing Systems

Multi-scale anomaly detection algorithm based on infrequent pattern of time series

Analysis of Variance and Co-variance. By Manza Ramesh

Hierarchies of sustainability in a catchment

Approximate counting: count-min data structure. Problem definition

Statistics 3 WEDNESDAY 21 MAY 2008

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Data Mining Concepts & Techniques

SUFFIX TREE. SYNONYMS Compact suffix trie

On Multi-Class Cost-Sensitive Learning

Real-time Sentiment-Based Anomaly Detection in Twitter Data Streams

Scalable Hierarchical Recommendations Using Spatial Autocorrelation

Constructing comprehensive summaries of large event sequences

Comprehensive Evaluation of Social Benefits of Mineral Resources Development in Ordos Basin

CPT+: A Compact Model for Accurate Sequence Prediction

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data

CPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication

Models, Data, Learning Problems

Compression in the Space of Permutations

Transcription:

Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach Yinghui (Catherine) Yang Graduate School of Management, University of California, Davis AOB IV, One Shields Ave., Davis, CA 95616, USA, yiyang@ucdavis.edu Balaji Padmanabhan ISDS Department, College of Business, University of South Florida 40 East Fowler Ave., CIS 1040, Tampa, FL 3360-7800, USA, bpadmana@coba.usf.edu Hongyan Liu, Xiaoyu Wang Department of Management Science and Engineering, School of Economics and Management, Tsinghua University, Beijing, China, 100084, {liuhy,wangxy3}@sem.tsinghua.edu.cn More Related Work for Section 1 Most previous work on mining sequence data fell into two categories: discovering sequential patterns (Agrawal and Srikant 1995, Ayres et al. 00, Garofalakis et al. 1999, Srikant and Agrawal 1996) and mining periodic patterns (Han et al. 1998, 1999; Ozden et al. 1998; Yang et al. 003, 004). Full cyclic patterns were first studied in Ozden et al. (1998). The input data to Ozden et al. (1998) is a set of transactions, each of which consists a set of items. In addition, each transaction is tagged with an execution time. The goal is to find association rules that repeat themselves throughout the input data. Han et al. (1998, 1999) presented algorithms for efficiently mining partial periodic patterns. In practice, not every portion in the time series may contribute to the periodicity. For example, a company s stock may often gain a couple of points at the beginning of each trading session but it may not have much regularity at later time. This type of periodicity is often referred to as partial periodicity (we will discuss this in greater detail in the next section). Han et al. focused on frequent periodic patterns. Yang et al. (004) addresses the mining of surprising periodic patterns and also allows partial periodicity. As pointed out in Ma and Hellerstein (001) and Han et al. (1999), the fast Fourier transform (FFT) (Brigham 1988) can also be used to identify periodicity. There are two problems though. First, the FFT does not cope well with random off-segments in periodic patterns. Further, the computational efficiency of FFT is O ( T logt ), where T is the number of time units. In most applications, T is large even though events are sparse. 1

Most of the research studying frequent or periodic sequential patterns used support as the measure of interestingness and addressed the discovery of frequent patterns. Yang et al. (004) instead used information gain metric to mine surprising periodic patterns. Some work treats these as one long sequence (Yang et al. 003), and most work within the bioinformatics field belongs to this category. Others consider these as a set of transactions, each of which consists of a set of items (Ozden et al. 1998, Han et al. 1998, 1999). While related to the broader topic of periodicity, Elfeky et al. (004), Funda et al. (004), Vlachos et al. (005) and Yeh and Lin (009) do not specifically study partial periodicity and thus are less related to our paper (for example, Elfeky et al. (004) develops an algorithm that mines periodic patterns with unknown or obscure periods; Funda et al. (004) presents algorithms that use less resource to discover periodicities in data streams.) References Agrawal, R. and R. Srikant. 1995. Mining Sequential Patterns, Proc. 11th Int l Conf. Data Eng. Ayres, J., J. Gehrke, T. Yiu, and J. Flannick. 00. Sequential Pattern Mining Using a Bitmap Representation, Proc. Eighth Int l Conf. Knowledge Discovery and Data Mining. Brigham, E. 1988. Fast Fourier Transform and Its Applications, Prentice Hall. Elfeky, M.G., W. G. Aref, and A. K.Elmagarmid. 004. Using Convolution to Mine Obscure Periodic Patterns in One Pass, Proc. 9 th Int l Conf. Extending Database Technology (EDBT). Funda, E., S. Muthukrishnan, and S. C. Sahinalp. 004. Sublinear methods for detecting periodic trends in data streams. Proc. of Latin American Symposium on Theoretical Informatics. Garofalakis, M., R. Rastogi, and K. Shim. 1999. SPIRIT: Sequential Pattern Mining with Regular Expression Constraints, Proc. 5 th Int l Conf. Very Large Data Bases. Ozden, B., Ramaswamy, S., and Silberschatz, A. 1998. Cyclic association rules. Procs. ICDE 98, pp. 41 41. Srikant, R. and R. Agrawal. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements, Proc. Fifth Int l Conf. Extending Data Base Technology. Vlachos, M., P.S. Yu, V. Castelli. 005. On Periodicity Detection and Structural Periodic Similarity Proc. of SIAM Conf. Data Mining. Yeh, J.S., and S.C. Lin. 009. A new data structure for asynchronous periodic pattern mining, Proc. 3rd Int l Conf. Ubiquitous Information Management and Communication.

Formal Presentation of the Algorithms for Section 3 Inputs: 1. Dataset D with specific time stamps 1 to T associated with each transaction. Pattern discovery algorithm, R, that discovers patterns that can be evaluated to hold or not at each time stamp 3. Threshold or_ratio c (e.g. 5%) 4. Minimum length b Output: 1. A ranked list of periodic patterns L = {}, output Generate a set of patterns P = {P 1, P,, P M } by applying R to D. for each e P do { Let Q be the inter-arrival time sequence of e in D. Compute F as the number of time stamps when pattern e holds in D V 0 = T /F //variance of the exponential distribution V = variance of inter-arrival times of e in D if (V/V 0 < c) and Length(Q)> b, then L.append([e,or_ratio]) } Print sorted list of patterns in L according to the or_ratio (i.e. V/V 0 score) for each pattern. Figure A1. Basic Method - Identifying Type 1 Patterns 3

Inputs: 1. Dataset D. Pattern discovery algorithm, R 3. Threshold or_ratio c 4. Minimum length b 5. Equal mean threshold q 6. On-segment ratio r Output: Periodic patterns with their type. Define S, E as stacks Generate a set of patterns P = { P 1, P,, P M } by applying R to D. L={}, output for each e P do { Set S, E to be empty stacks node = sequence of inter-arrival times of e S.push(node) While not_empty(s) { node = S.pop() if or_ratio(node)<=c children=null E.push(node) ElseIf node is longer than b choose split point k such that p L *or_ratio(left)+p R *or_ratio(right) is minimized split(node, children, k) S.push(children(right)) S.push(children(left)) } // end while Get the max mean value m_max and minimum mean value m_min from all the subsequences in E. Leng = the sum of the length of all subsequences in E. LengO = the length of the original inter-arrival sequence. If m_max<=m_min*(1+q) If Leng = LengO Output e as periodic with equal periods Else if Leng/LengO >= r Output e as partially periodic with equal periods Else If Leng=LengO Output e as periodic with unequal periods Elseif Leng/LengO >= r Output e as partially periodic with unequal periods } // end for Figure A. A Unified Approach: The Division Method 4

Inputs: 1. Dataset D. Pattern discovery algorithm, R 3. Threshold or_ratio c 4. Minimum length b 5. Equal mean threshold q 6. On-segment ratio r Output: Periodic patterns with their type. Generate a set of patterns P = {P 1, P,, P M } by applying R to D. E={}, Subsequences for each e P do { Q=inter-arrival time sequence of e, and Q={m 1, m,, m N }, N is the size of Q For i from 1 to N-b+1, For j from N to i+b-1, Q ={m i, m i+1,, m j } If length of Q is smaller than b or Q is the subsequence of any subsequence in E: Break If or_ratio(q )<=c: E.append(Q ) Break Get the maximum mean value m_max and minimum mean value m_min from all the subsequences in E. Leng = sum of length of all subsequences in E LengO = length of the original sequence If m_max<=m_min*(1+q) If Leng = LengO Output e as periodic with equal periods Else if Leng/LengO >= r Output e as partially periodic with equal periods Else If Leng=LengO Output e as periodic with unequal periods Elseif Leng/LengO >= r Output e as partially periodic with unequal periods } // end for Figure A3. The Complete Method A note on the complexity of the different methods. Given an inter-arrival time sequence S with N inter-arrival times, Ma and Hellerstein (001) needs a list of counters to record frequencies of each potential period. For each inter-arrival time in S, they first look for the right counter for that inter-arrival time (as a potential period), and then either increase the counter or create new counter. After all inter-arrival times in S have been read, they check all the counters for all the 5

possible periods, calculate the total frequency for each possible period subject to tolerance and compare that with the corresponding threshold. Therefore, the complexity of Ma and Hellerstein s method is O(N). The Basic method needs to read all the inter-arrival times in S and calculates the variance ratio. Therefore, the complexity of the Basic method is also O(N). Division method will have logn levels of divisions, and for each division O(N) to find the optimal division. Therefore, the complexity of Division method is O(NlogN). Complete method checks at most N(N+1)/ subsequences of S. Thus the complexity of Complete method is O(N ). Proof of the Range Result for Section 3.4 Proof: By at least as periodic we mean or _ ratio ( Q) or _ ratio ( Q). Hence, solving this will reach the range as we show below. N N Let A = x 1 i B = x 1 i V N ( x ) ( ) 1 i x N xi x xi + Nx Define r = = = V N( x ) B 0 i B B N( xi + ) N N NA = = 1 B B Hence, or _ ratio ( Q) or _ ratio ( Q) is equivalent to: ( N + 1)( A + u ) NA 1 1 ( B + u) B where u is the next point (x N+1 ) in the sequence. Solving for u in the quadratic inequality will provide the bounds. The graph in Figure A4 graphically shows how different values of the next point, u, affect the inequality. Since inter-arrival times are positive the range u > 0 (the right quadrant) is useful to focus on. Within this, there is a range around A/B where the new ratio is less than or equal to the old ratio in the sequence. This point A/B can be determined by calculating the derivative of f(u). The derivative is positive when u is greater than A/B and the function is therefore increasing in this range (else it is decreasing). The second derivative can also be used to determine the inflection point further right in the figure. 6

( N + 1)( A + u ) ( B + u) N+1 (N+1)A A+B (N+1)A B NA B 0 A/B - B u Figure A4. The range result graph Summary Statistics for Section 4.1 Figure A5a-A5c plot the histograms of the percentage of periodic patterns among all patterns considered for each user when the variance threshold c takes three different values 100%, 30% and 5%. For example in Figure A5a (histogram on far left), the first bar shows that there are approximately 7 users for whom 3% or less of all their patterns represent periodic patterns. The second bar shows the number of users with 3% ~ 5% of their patterns being periodic. As expected, setting the variance threshold tighter will result in a fewer percentage of user patterns flagged periodic. The histogram at the far right for instance shows that, under a very tight threshold (5%), eighty users seem to have 0.0001% or less of their patterns periodic. Figures A5a-A5c. Histogram of the % of periodic patterns (c =100%, 30%, 5%) Figures A6a-A6c plot the histograms of the period length among all periodic patterns. The first bar in Figure A6c represents the number of periodic patterns with periods between 0 and 1. The 7

averages of the period under these values are 1.69, 5.39 and 1.0. This suggests that when the search is restricted to strictly periodic patterns (variance close to zero) the patterns identified tend to be those which hold in every session, such as for instance a user s unique homepage or any other user pattern which holds every session. As the threshold is loosened it is possible to identify patterns that hold across larger periods (as some of our examples in the next section will show). Figure A6a-A6c. Histograms of period length (c = 100%, 30%, 5%) Predictive Accuracy for Section 4.1.3 Figure A7. Predictive Accuracy Varying the Length of the Sequences 8

Synthetic Data Generator and Parameter Tables for Section 4. Inputs: The total time, T Maximum mean value of any segment, M Periodic type, TY (1-periodic, -partial, 3-unequal, 4-partial unequal) Threshold or_ratio c Minimum length b Equal mean threshold q 1. Randomly set the period value for an on-segment m<m, and randomly generate the first inter-arrival time.. While sum(inter-arrival times)< T: 3. If TY=1, add a new inter-arrival time to satisfy c. (similar to Theorem 1, we can calculate a range for the new inter-arrival time to satisfy c). 4. If TY=, randomly decide whether to switch to an off-segment or to continue on the current on-segment; if TY=3, randomly decide whether to continue on the current on-segment or to start a new onsegment; If TY=4, then randomly decide whether to continue on the current on-segment, switch to an off-segment or start a new onsegment. 5. If continue on the current on-segment, add a new inter-arrival time to satisfy c. 6. If switch to an off-segment, generate an off-segment and go to step 1. 7. If change to a new on-segment, go to step 1. 8. End-while. 9. Check the sequence generated to see if it satisfies c, b, and q. If not, abandon this sequence and go back to step 1 to generate the desired number of sequences. Figure A8. Data Generator 9

Table A1. Notations Notation Description E A set of patterns e A pattern e t i The i th occurrence time of pattern e N The number of times a pattern occurred in a sequence e τ i The i th inter-arrival time of pattern e T Total time over which events arrive p Period λ Mean of the exponential distribution V The variance of the exponential distribution, which equals to T N 0 V 1 The observed variance of the inter-arrival times of a sequence. or_ratio V 1 / V 0 D Dataset R Pattern discovery technique M Maximum mean value of any segment c Threshold or_ratio b Minimum length of an on-segment q Equal mean threshold r On-segment ratio Table A. Parameters used in the Experiments Parameters Description Value T Total time over which events arrive 500 M Maximum mean value of any Varies across simulated data sets as segment c shown in Table in the main paper. Threshold or_ratio b Minimum length of an on-segment 10 q Equal mean threshold 0.1 r On-segment ratio Not used in the synthetic data (i.e. set to r =0). 10