Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach

Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach Yinghui (Catherine) Yang Graduate School of Management, University of California, Davis AOB IV, One Shields Ave., Davis, CA 95616, USA, yiyang@ucdavis.edu Balaji Padmanabhan ISDS Department, College of Business, University of South Florida 40 East Fowler Ave., CIS 1040, Tampa, FL 3360-7800, USA, bpadmana@coba.usf.edu Hongyan Liu, Xiaoyu Wang Department of Management Science and Engineering, School of Economics and Management, Tsinghua University, Beijing, China, 100084, {liuhy,wangxy3}@sem.tsinghua.edu.cn More Related Work for Section 1 Most previous work on mining sequence data fell into two categories: discovering sequential patterns (Agrawal and Srikant 1995, Ayres et al. 00, Garofalakis et al. 1999, Srikant and Agrawal 1996) and mining periodic patterns (Han et al. 1998, 1999; Ozden et al. 1998; Yang et al. 003, 004). Full cyclic patterns were first studied in Ozden et al. (1998). The input data to Ozden et al. (1998) is a set of transactions, each of which consists a set of items. In addition, each transaction is tagged with an execution time. The goal is to find association rules that repeat themselves throughout the input data. Han et al. (1998, 1999) presented algorithms for efficiently mining partial periodic patterns. In practice, not every portion in the time series may contribute to the periodicity. For example, a company s stock may often gain a couple of points at the beginning of each trading session but it may not have much regularity at later time. This type of periodicity is often referred to as partial periodicity (we will discuss this in greater detail in the next section). Han et al. focused on frequent periodic patterns. Yang et al. (004) addresses the mining of surprising periodic patterns and also allows partial periodicity. As pointed out in Ma and Hellerstein (001) and Han et al. (1999), the fast Fourier transform (FFT) (Brigham 1988) can also be used to identify periodicity. There are two problems though. First, the FFT does not cope well with random off-segments in periodic patterns. Further, the computational efficiency of FFT is O ( T logt ), where T is the number of time units. In most applications, T is large even though events are sparse. 1

Most of the research studying frequent or periodic sequential patterns used support as the measure of interestingness and addressed the discovery of frequent patterns. Yang et al. (004) instead used information gain metric to mine surprising periodic patterns. Some work treats these as one long sequence (Yang et al. 003), and most work within the bioinformatics field belongs to this category. Others consider these as a set of transactions, each of which consists of a set of items (Ozden et al. 1998, Han et al. 1998, 1999). While related to the broader topic of periodicity, Elfeky et al. (004), Funda et al. (004), Vlachos et al. (005) and Yeh and Lin (009) do not specifically study partial periodicity and thus are less related to our paper (for example, Elfeky et al. (004) develops an algorithm that mines periodic patterns with unknown or obscure periods; Funda et al. (004) presents algorithms that use less resource to discover periodicities in data streams.) References Agrawal, R. and R. Srikant. 1995. Mining Sequential Patterns, Proc. 11th Int l Conf. Data Eng. Ayres, J., J. Gehrke, T. Yiu, and J. Flannick. 00. Sequential Pattern Mining Using a Bitmap Representation, Proc. Eighth Int l Conf. Knowledge Discovery and Data Mining. Brigham, E. 1988. Fast Fourier Transform and Its Applications, Prentice Hall. Elfeky, M.G., W. G. Aref, and A. K.Elmagarmid. 004. Using Convolution to Mine Obscure Periodic Patterns in One Pass, Proc. 9 th Int l Conf. Extending Database Technology (EDBT). Funda, E., S. Muthukrishnan, and S. C. Sahinalp. 004. Sublinear methods for detecting periodic trends in data streams. Proc. of Latin American Symposium on Theoretical Informatics. Garofalakis, M., R. Rastogi, and K. Shim. 1999. SPIRIT: Sequential Pattern Mining with Regular Expression Constraints, Proc. 5 th Int l Conf. Very Large Data Bases. Ozden, B., Ramaswamy, S., and Silberschatz, A. 1998. Cyclic association rules. Procs. ICDE 98, pp. 41 41. Srikant, R. and R. Agrawal. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements, Proc. Fifth Int l Conf. Extending Data Base Technology. Vlachos, M., P.S. Yu, V. Castelli. 005. On Periodicity Detection and Structural Periodic Similarity Proc. of SIAM Conf. Data Mining. Yeh, J.S., and S.C. Lin. 009. A new data structure for asynchronous periodic pattern mining, Proc. 3rd Int l Conf. Ubiquitous Information Management and Communication.

Formal Presentation of the Algorithms for Section 3 Inputs: 1. Dataset D with specific time stamps 1 to T associated with each transaction. Pattern discovery algorithm, R, that discovers patterns that can be evaluated to hold or not at each time stamp 3. Threshold or_ratio c (e.g. 5%) 4. Minimum length b Output: 1. A ranked list of periodic patterns L = {}, output Generate a set of patterns P = {P 1, P,, P M } by applying R to D. for each e P do { Let Q be the inter-arrival time sequence of e in D. Compute F as the number of time stamps when pattern e holds in D V 0 = T /F //variance of the exponential distribution V = variance of inter-arrival times of e in D if (V/V 0 < c) and Length(Q)> b, then L.append([e,or_ratio]) } Print sorted list of patterns in L according to the or_ratio (i.e. V/V 0 score) for each pattern. Figure A1. Basic Method - Identifying Type 1 Patterns 3

Inputs: 1. Dataset D. Pattern discovery algorithm, R 3. Threshold or_ratio c 4. Minimum length b 5. Equal mean threshold q 6. On-segment ratio r Output: Periodic patterns with their type. Define S, E as stacks Generate a set of patterns P = { P 1, P,, P M } by applying R to D. L={}, output for each e P do { Set S, E to be empty stacks node = sequence of inter-arrival times of e S.push(node) While not_empty(s) { node = S.pop() if or_ratio(node)<=c children=null E.push(node) ElseIf node is longer than b choose split point k such that p L *or_ratio(left)+p R *or_ratio(right) is minimized split(node, children, k) S.push(children(right)) S.push(children(left)) } // end while Get the max mean value m_max and minimum mean value m_min from all the subsequences in E. Leng = the sum of the length of all subsequences in E. LengO = the length of the original inter-arrival sequence. If m_max<=m_min*(1+q) If Leng = LengO Output e as periodic with equal periods Else if Leng/LengO >= r Output e as partially periodic with equal periods Else If Leng=LengO Output e as periodic with unequal periods Elseif Leng/LengO >= r Output e as partially periodic with unequal periods } // end for Figure A. A Unified Approach: The Division Method 4

Inputs: 1. Dataset D. Pattern discovery algorithm, R 3. Threshold or_ratio c 4. Minimum length b 5. Equal mean threshold q 6. On-segment ratio r Output: Periodic patterns with their type. Generate a set of patterns P = {P 1, P,, P M } by applying R to D. E={}, Subsequences for each e P do { Q=inter-arrival time sequence of e, and Q={m 1, m,, m N }, N is the size of Q For i from 1 to N-b+1, For j from N to i+b-1, Q ={m i, m i+1,, m j } If length of Q is smaller than b or Q is the subsequence of any subsequence in E: Break If or_ratio(q )<=c: E.append(Q ) Break Get the maximum mean value m_max and minimum mean value m_min from all the subsequences in E. Leng = sum of length of all subsequences in E LengO = length of the original sequence If m_max<=m_min*(1+q) If Leng = LengO Output e as periodic with equal periods Else if Leng/LengO >= r Output e as partially periodic with equal periods Else If Leng=LengO Output e as periodic with unequal periods Elseif Leng/LengO >= r Output e as partially periodic with unequal periods } // end for Figure A3. The Complete Method A note on the complexity of the different methods. Given an inter-arrival time sequence S with N inter-arrival times, Ma and Hellerstein (001) needs a list of counters to record frequencies of each potential period. For each inter-arrival time in S, they first look for the right counter for that inter-arrival time (as a potential period), and then either increase the counter or create new counter. After all inter-arrival times in S have been read, they check all the counters for all the 5

possible periods, calculate the total frequency for each possible period subject to tolerance and compare that with the corresponding threshold. Therefore, the complexity of Ma and Hellerstein s method is O(N). The Basic method needs to read all the inter-arrival times in S and calculates the variance ratio. Therefore, the complexity of the Basic method is also O(N). Division method will have logn levels of divisions, and for each division O(N) to find the optimal division. Therefore, the complexity of Division method is O(NlogN). Complete method checks at most N(N+1)/ subsequences of S. Thus the complexity of Complete method is O(N ). Proof of the Range Result for Section 3.4 Proof: By at least as periodic we mean or _ ratio ( Q) or _ ratio ( Q). Hence, solving this will reach the range as we show below. N N Let A = x 1 i B = x 1 i V N ( x ) ( ) 1 i x N xi x xi + Nx Define r = = = V N( x ) B 0 i B B N( xi + ) N N NA = = 1 B B Hence, or _ ratio ( Q) or _ ratio ( Q) is equivalent to: ( N + 1)( A + u ) NA 1 1 ( B + u) B where u is the next point (x N+1 ) in the sequence. Solving for u in the quadratic inequality will provide the bounds. The graph in Figure A4 graphically shows how different values of the next point, u, affect the inequality. Since inter-arrival times are positive the range u > 0 (the right quadrant) is useful to focus on. Within this, there is a range around A/B where the new ratio is less than or equal to the old ratio in the sequence. This point A/B can be determined by calculating the derivative of f(u). The derivative is positive when u is greater than A/B and the function is therefore increasing in this range (else it is decreasing). The second derivative can also be used to determine the inflection point further right in the figure. 6

( N + 1)( A + u ) ( B + u) N+1 (N+1)A A+B (N+1)A B NA B 0 A/B - B u Figure A4. The range result graph Summary Statistics for Section 4.1 Figure A5a-A5c plot the histograms of the percentage of periodic patterns among all patterns considered for each user when the variance threshold c takes three different values 100%, 30% and 5%. For example in Figure A5a (histogram on far left), the first bar shows that there are approximately 7 users for whom 3% or less of all their patterns represent periodic patterns. The second bar shows the number of users with 3% ~ 5% of their patterns being periodic. As expected, setting the variance threshold tighter will result in a fewer percentage of user patterns flagged periodic. The histogram at the far right for instance shows that, under a very tight threshold (5%), eighty users seem to have 0.0001% or less of their patterns periodic. Figures A5a-A5c. Histogram of the % of periodic patterns (c =100%, 30%, 5%) Figures A6a-A6c plot the histograms of the period length among all periodic patterns. The first bar in Figure A6c represents the number of periodic patterns with periods between 0 and 1. The 7

averages of the period under these values are 1.69, 5.39 and 1.0. This suggests that when the search is restricted to strictly periodic patterns (variance close to zero) the patterns identified tend to be those which hold in every session, such as for instance a user s unique homepage or any other user pattern which holds every session. As the threshold is loosened it is possible to identify patterns that hold across larger periods (as some of our examples in the next section will show). Figure A6a-A6c. Histograms of period length (c = 100%, 30%, 5%) Predictive Accuracy for Section 4.1.3 Figure A7. Predictive Accuracy Varying the Length of the Sequences 8

Synthetic Data Generator and Parameter Tables for Section 4. Inputs: The total time, T Maximum mean value of any segment, M Periodic type, TY (1-periodic, -partial, 3-unequal, 4-partial unequal) Threshold or_ratio c Minimum length b Equal mean threshold q 1. Randomly set the period value for an on-segment m<m, and randomly generate the first inter-arrival time.. While sum(inter-arrival times)< T: 3. If TY=1, add a new inter-arrival time to satisfy c. (similar to Theorem 1, we can calculate a range for the new inter-arrival time to satisfy c). 4. If TY=, randomly decide whether to switch to an off-segment or to continue on the current on-segment; if TY=3, randomly decide whether to continue on the current on-segment or to start a new onsegment; If TY=4, then randomly decide whether to continue on the current on-segment, switch to an off-segment or start a new onsegment. 5. If continue on the current on-segment, add a new inter-arrival time to satisfy c. 6. If switch to an off-segment, generate an off-segment and go to step 1. 7. If change to a new on-segment, go to step 1. 8. End-while. 9. Check the sequence generated to see if it satisfies c, b, and q. If not, abandon this sequence and go back to step 1 to generate the desired number of sequences. Figure A8. Data Generator 9

Table A1. Notations Notation Description E A set of patterns e A pattern e t i The i th occurrence time of pattern e N The number of times a pattern occurred in a sequence e τ i The i th inter-arrival time of pattern e T Total time over which events arrive p Period λ Mean of the exponential distribution V The variance of the exponential distribution, which equals to T N 0 V 1 The observed variance of the inter-arrival times of a sequence. or_ratio V 1 / V 0 D Dataset R Pattern discovery technique M Maximum mean value of any segment c Threshold or_ratio b Minimum length of an on-segment q Equal mean threshold r On-segment ratio Table A. Parameters used in the Experiments Parameters Description Value T Total time over which events arrive 500 M Maximum mean value of any Varies across simulated data sets as segment c shown in Table in the main paper. Threshold or_ratio b Minimum length of an on-segment 10 q Equal mean threshold 0.1 r On-segment ratio Not used in the synthetic data (i.e. set to r =0). 10