Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach
|
|
- Magnus Armstrong
- 5 years ago
- Views:
Transcription
1 Online Appendix for Discovery of Periodic Patterns in Sequence Data: A Variance Based Approach Yinghui (Catherine) Yang Graduate School of Management, University of California, Davis AOB IV, One Shields Ave., Davis, CA 95616, USA, yiyang@ucdavis.edu Balaji Padmanabhan ISDS Department, College of Business, University of South Florida 40 East Fowler Ave., CIS 1040, Tampa, FL , USA, bpadmana@coba.usf.edu Hongyan Liu, Xiaoyu Wang Department of Management Science and Engineering, School of Economics and Management, Tsinghua University, Beijing, China, , {liuhy,wangxy3}@sem.tsinghua.edu.cn More Related Work for Section 1 Most previous work on mining sequence data fell into two categories: discovering sequential patterns (Agrawal and Srikant 1995, Ayres et al. 00, Garofalakis et al. 1999, Srikant and Agrawal 1996) and mining periodic patterns (Han et al. 1998, 1999; Ozden et al. 1998; Yang et al. 003, 004). Full cyclic patterns were first studied in Ozden et al. (1998). The input data to Ozden et al. (1998) is a set of transactions, each of which consists a set of items. In addition, each transaction is tagged with an execution time. The goal is to find association rules that repeat themselves throughout the input data. Han et al. (1998, 1999) presented algorithms for efficiently mining partial periodic patterns. In practice, not every portion in the time series may contribute to the periodicity. For example, a company s stock may often gain a couple of points at the beginning of each trading session but it may not have much regularity at later time. This type of periodicity is often referred to as partial periodicity (we will discuss this in greater detail in the next section). Han et al. focused on frequent periodic patterns. Yang et al. (004) addresses the mining of surprising periodic patterns and also allows partial periodicity. As pointed out in Ma and Hellerstein (001) and Han et al. (1999), the fast Fourier transform (FFT) (Brigham 1988) can also be used to identify periodicity. There are two problems though. First, the FFT does not cope well with random off-segments in periodic patterns. Further, the computational efficiency of FFT is O ( T logt ), where T is the number of time units. In most applications, T is large even though events are sparse. 1
2 Most of the research studying frequent or periodic sequential patterns used support as the measure of interestingness and addressed the discovery of frequent patterns. Yang et al. (004) instead used information gain metric to mine surprising periodic patterns. Some work treats these as one long sequence (Yang et al. 003), and most work within the bioinformatics field belongs to this category. Others consider these as a set of transactions, each of which consists of a set of items (Ozden et al. 1998, Han et al. 1998, 1999). While related to the broader topic of periodicity, Elfeky et al. (004), Funda et al. (004), Vlachos et al. (005) and Yeh and Lin (009) do not specifically study partial periodicity and thus are less related to our paper (for example, Elfeky et al. (004) develops an algorithm that mines periodic patterns with unknown or obscure periods; Funda et al. (004) presents algorithms that use less resource to discover periodicities in data streams.) References Agrawal, R. and R. Srikant Mining Sequential Patterns, Proc. 11th Int l Conf. Data Eng. Ayres, J., J. Gehrke, T. Yiu, and J. Flannick. 00. Sequential Pattern Mining Using a Bitmap Representation, Proc. Eighth Int l Conf. Knowledge Discovery and Data Mining. Brigham, E Fast Fourier Transform and Its Applications, Prentice Hall. Elfeky, M.G., W. G. Aref, and A. K.Elmagarmid Using Convolution to Mine Obscure Periodic Patterns in One Pass, Proc. 9 th Int l Conf. Extending Database Technology (EDBT). Funda, E., S. Muthukrishnan, and S. C. Sahinalp Sublinear methods for detecting periodic trends in data streams. Proc. of Latin American Symposium on Theoretical Informatics. Garofalakis, M., R. Rastogi, and K. Shim SPIRIT: Sequential Pattern Mining with Regular Expression Constraints, Proc. 5 th Int l Conf. Very Large Data Bases. Ozden, B., Ramaswamy, S., and Silberschatz, A Cyclic association rules. Procs. ICDE 98, pp Srikant, R. and R. Agrawal Mining Sequential Patterns: Generalizations and Performance Improvements, Proc. Fifth Int l Conf. Extending Data Base Technology. Vlachos, M., P.S. Yu, V. Castelli On Periodicity Detection and Structural Periodic Similarity Proc. of SIAM Conf. Data Mining. Yeh, J.S., and S.C. Lin A new data structure for asynchronous periodic pattern mining, Proc. 3rd Int l Conf. Ubiquitous Information Management and Communication.
3 Formal Presentation of the Algorithms for Section 3 Inputs: 1. Dataset D with specific time stamps 1 to T associated with each transaction. Pattern discovery algorithm, R, that discovers patterns that can be evaluated to hold or not at each time stamp 3. Threshold or_ratio c (e.g. 5%) 4. Minimum length b Output: 1. A ranked list of periodic patterns L = {}, output Generate a set of patterns P = {P 1, P,, P M } by applying R to D. for each e P do { Let Q be the inter-arrival time sequence of e in D. Compute F as the number of time stamps when pattern e holds in D V 0 = T /F //variance of the exponential distribution V = variance of inter-arrival times of e in D if (V/V 0 < c) and Length(Q)> b, then L.append([e,or_ratio]) } Print sorted list of patterns in L according to the or_ratio (i.e. V/V 0 score) for each pattern. Figure A1. Basic Method - Identifying Type 1 Patterns 3
4 Inputs: 1. Dataset D. Pattern discovery algorithm, R 3. Threshold or_ratio c 4. Minimum length b 5. Equal mean threshold q 6. On-segment ratio r Output: Periodic patterns with their type. Define S, E as stacks Generate a set of patterns P = { P 1, P,, P M } by applying R to D. L={}, output for each e P do { Set S, E to be empty stacks node = sequence of inter-arrival times of e S.push(node) While not_empty(s) { node = S.pop() if or_ratio(node)<=c children=null E.push(node) ElseIf node is longer than b choose split point k such that p L *or_ratio(left)+p R *or_ratio(right) is minimized split(node, children, k) S.push(children(right)) S.push(children(left)) } // end while Get the max mean value m_max and minimum mean value m_min from all the subsequences in E. Leng = the sum of the length of all subsequences in E. LengO = the length of the original inter-arrival sequence. If m_max<=m_min*(1+q) If Leng = LengO Output e as periodic with equal periods Else if Leng/LengO >= r Output e as partially periodic with equal periods Else If Leng=LengO Output e as periodic with unequal periods Elseif Leng/LengO >= r Output e as partially periodic with unequal periods } // end for Figure A. A Unified Approach: The Division Method 4
5 Inputs: 1. Dataset D. Pattern discovery algorithm, R 3. Threshold or_ratio c 4. Minimum length b 5. Equal mean threshold q 6. On-segment ratio r Output: Periodic patterns with their type. Generate a set of patterns P = {P 1, P,, P M } by applying R to D. E={}, Subsequences for each e P do { Q=inter-arrival time sequence of e, and Q={m 1, m,, m N }, N is the size of Q For i from 1 to N-b+1, For j from N to i+b-1, Q ={m i, m i+1,, m j } If length of Q is smaller than b or Q is the subsequence of any subsequence in E: Break If or_ratio(q )<=c: E.append(Q ) Break Get the maximum mean value m_max and minimum mean value m_min from all the subsequences in E. Leng = sum of length of all subsequences in E LengO = length of the original sequence If m_max<=m_min*(1+q) If Leng = LengO Output e as periodic with equal periods Else if Leng/LengO >= r Output e as partially periodic with equal periods Else If Leng=LengO Output e as periodic with unequal periods Elseif Leng/LengO >= r Output e as partially periodic with unequal periods } // end for Figure A3. The Complete Method A note on the complexity of the different methods. Given an inter-arrival time sequence S with N inter-arrival times, Ma and Hellerstein (001) needs a list of counters to record frequencies of each potential period. For each inter-arrival time in S, they first look for the right counter for that inter-arrival time (as a potential period), and then either increase the counter or create new counter. After all inter-arrival times in S have been read, they check all the counters for all the 5
6 possible periods, calculate the total frequency for each possible period subject to tolerance and compare that with the corresponding threshold. Therefore, the complexity of Ma and Hellerstein s method is O(N). The Basic method needs to read all the inter-arrival times in S and calculates the variance ratio. Therefore, the complexity of the Basic method is also O(N). Division method will have logn levels of divisions, and for each division O(N) to find the optimal division. Therefore, the complexity of Division method is O(NlogN). Complete method checks at most N(N+1)/ subsequences of S. Thus the complexity of Complete method is O(N ). Proof of the Range Result for Section 3.4 Proof: By at least as periodic we mean or _ ratio ( Q) or _ ratio ( Q). Hence, solving this will reach the range as we show below. N N Let A = x 1 i B = x 1 i V N ( x ) ( ) 1 i x N xi x xi + Nx Define r = = = V N( x ) B 0 i B B N( xi + ) N N NA = = 1 B B Hence, or _ ratio ( Q) or _ ratio ( Q) is equivalent to: ( N + 1)( A + u ) NA 1 1 ( B + u) B where u is the next point (x N+1 ) in the sequence. Solving for u in the quadratic inequality will provide the bounds. The graph in Figure A4 graphically shows how different values of the next point, u, affect the inequality. Since inter-arrival times are positive the range u > 0 (the right quadrant) is useful to focus on. Within this, there is a range around A/B where the new ratio is less than or equal to the old ratio in the sequence. This point A/B can be determined by calculating the derivative of f(u). The derivative is positive when u is greater than A/B and the function is therefore increasing in this range (else it is decreasing). The second derivative can also be used to determine the inflection point further right in the figure. 6
7 ( N + 1)( A + u ) ( B + u) N+1 (N+1)A A+B (N+1)A B NA B 0 A/B - B u Figure A4. The range result graph Summary Statistics for Section 4.1 Figure A5a-A5c plot the histograms of the percentage of periodic patterns among all patterns considered for each user when the variance threshold c takes three different values 100%, 30% and 5%. For example in Figure A5a (histogram on far left), the first bar shows that there are approximately 7 users for whom 3% or less of all their patterns represent periodic patterns. The second bar shows the number of users with 3% ~ 5% of their patterns being periodic. As expected, setting the variance threshold tighter will result in a fewer percentage of user patterns flagged periodic. The histogram at the far right for instance shows that, under a very tight threshold (5%), eighty users seem to have % or less of their patterns periodic. Figures A5a-A5c. Histogram of the % of periodic patterns (c =100%, 30%, 5%) Figures A6a-A6c plot the histograms of the period length among all periodic patterns. The first bar in Figure A6c represents the number of periodic patterns with periods between 0 and 1. The 7
8 averages of the period under these values are 1.69, 5.39 and 1.0. This suggests that when the search is restricted to strictly periodic patterns (variance close to zero) the patterns identified tend to be those which hold in every session, such as for instance a user s unique homepage or any other user pattern which holds every session. As the threshold is loosened it is possible to identify patterns that hold across larger periods (as some of our examples in the next section will show). Figure A6a-A6c. Histograms of period length (c = 100%, 30%, 5%) Predictive Accuracy for Section Figure A7. Predictive Accuracy Varying the Length of the Sequences 8
9 Synthetic Data Generator and Parameter Tables for Section 4. Inputs: The total time, T Maximum mean value of any segment, M Periodic type, TY (1-periodic, -partial, 3-unequal, 4-partial unequal) Threshold or_ratio c Minimum length b Equal mean threshold q 1. Randomly set the period value for an on-segment m<m, and randomly generate the first inter-arrival time.. While sum(inter-arrival times)< T: 3. If TY=1, add a new inter-arrival time to satisfy c. (similar to Theorem 1, we can calculate a range for the new inter-arrival time to satisfy c). 4. If TY=, randomly decide whether to switch to an off-segment or to continue on the current on-segment; if TY=3, randomly decide whether to continue on the current on-segment or to start a new onsegment; If TY=4, then randomly decide whether to continue on the current on-segment, switch to an off-segment or start a new onsegment. 5. If continue on the current on-segment, add a new inter-arrival time to satisfy c. 6. If switch to an off-segment, generate an off-segment and go to step If change to a new on-segment, go to step End-while. 9. Check the sequence generated to see if it satisfies c, b, and q. If not, abandon this sequence and go back to step 1 to generate the desired number of sequences. Figure A8. Data Generator 9
10 Table A1. Notations Notation Description E A set of patterns e A pattern e t i The i th occurrence time of pattern e N The number of times a pattern occurred in a sequence e τ i The i th inter-arrival time of pattern e T Total time over which events arrive p Period λ Mean of the exponential distribution V The variance of the exponential distribution, which equals to T N 0 V 1 The observed variance of the inter-arrival times of a sequence. or_ratio V 1 / V 0 D Dataset R Pattern discovery technique M Maximum mean value of any segment c Threshold or_ratio b Minimum length of an on-segment q Equal mean threshold r On-segment ratio Table A. Parameters used in the Experiments Parameters Description Value T Total time over which events arrive 500 M Maximum mean value of any Varies across simulated data sets as segment c shown in Table in the main paper. Threshold or_ratio b Minimum length of an on-segment 10 q Equal mean threshold 0.1 r On-segment ratio Not used in the synthetic data (i.e. set to r =0). 10
Using Convolution to Mine Obscure Periodic Patterns in One Pass
Using Convolution to Mine Obscure Periodic Patterns in One Pass Mohamed G. Elfeky, Walid G. Aref, and Ahmed K. Elmagarmid Department of Computer Sciences, Purdue University {mgelfeky, aref, ake}@cs.purdue.edu
More informationAn Approach to Classification Based on Fuzzy Association Rules
An Approach to Classification Based on Fuzzy Association Rules Zuoliang Chen, Guoqing Chen School of Economics and Management, Tsinghua University, Beijing 100084, P. R. China Abstract Classification based
More informationMining Temporal Patterns for Interval-Based and Point-Based Events
International Journal of Computational Engineering Research Vol, 03 Issue, 4 Mining Temporal Patterns for Interval-Based and Point-Based Events 1, S.Kalaivani, 2, M.Gomathi, 3, R.Sethukkarasi 1,2,3, Department
More informationMining Strong Positive and Negative Sequential Patterns
Mining Strong Positive and Negative Sequential Patter NANCY P. LIN, HUNG-JEN CHEN, WEI-HUA HAO, HAO-EN CHUEH, CHUNG-I CHANG Department of Computer Science and Information Engineering Tamang University,
More informationInternational Journal of Uncertainty, Fuzziness and Knowledge-Based Systems c World Scientific Publishing Company
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems c World Scientific Publishing Company DISCOVERING FUZZY UNEXPECTED SEQUENCES WITH CONCEPT HIERARCHIES LGI2P, Dong (Haoyuan) Li
More informationPeriodic Pattern Mining for Spatio-Temporal Trajectories: A Survey
2015 International Conference on Intelligent Systems and Knowledge Engineering Periodic Pattern Mining for Spatio-Temporal Trajectories: A Survey Dongzhi Zhang, Kyungmi Lee, Ickjai Lee College of Business,
More informationDiscovering Lag Intervals for Temporal Dependencies
Discovering Lag Intervals for Temporal Dependencies ABSTRACT Liang Tang Tao Li School of Computer Science Florida International University 11200 S.W. 8th Street Miami, Florida, 33199 U.S.A {ltang002,taoli}@cs.fiu.edu
More informationMining Partially Periodic Event Patterns With. Unknown Periods. Sheng Ma and Joseph L. Hellerstein. IBM T.J. Watson Research Center
Mining Partially Periodic Event Patterns With Unknown Periods Sheng Ma and Joseph L. Hellerstein IBM T.J. Watson Research Center Hawthorne, NY 10532 Abstract Periodic behavior is common in real-world applications.
More informationModified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.
Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Murphy Choy Cally Claire Ong Michelle Cheong Abstract The rapid explosion in retail data calls for more effective
More informationMining Partially Periodic Event Patterns With Unknown Periods*
Mining Partially Periodic Event Patterns With Unknown Periods* Sheng Ma and Joseph L. Hellerstein IBM T.J. Watson Research Center Hawthorne, NY 10532 { shengma, jlh} @us.ibm.com Abstract Periodic behavior
More informationPreserving Privacy in Data Mining using Data Distortion Approach
Preserving Privacy in Data Mining using Data Distortion Approach Mrs. Prachi Karandikar #, Prof. Sachin Deshpande * # M.E. Comp,VIT, Wadala, University of Mumbai * VIT Wadala,University of Mumbai 1. prachiv21@yahoo.co.in
More informationMining Molecular Fragments: Finding Relevant Substructures of Molecules
Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli
More informationMining Frequent Items in a Stream Using Flexible Windows (Extended Abstract)
Mining Frequent Items in a Stream Using Flexible Windows (Extended Abstract) Toon Calders, Nele Dexters and Bart Goethals University of Antwerp, Belgium firstname.lastname@ua.ac.be Abstract. In this paper,
More informationA Novel Dencos Model For High Dimensional Data Using Genetic Algorithms
A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms T. Vijayakumar 1, V.Nivedhitha 2, K.Deeba 3 and M. Sathya Bama 4 1 Assistant professor / Dept of IT, Dr.N.G.P College of Engineering
More informationCOMP 5331: Knowledge Discovery and Data Mining
COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar And Jiawei Han, Micheline Kamber, and Jian Pei 1 10
More informationMining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies
Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies Juliano Brito da Justa Neves 1 Marina Teresa Pires Vieira {juliano,marina}@dc.ufscar.br Computer Science Department
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationDistributed Mining of Frequent Closed Itemsets: Some Preliminary Results
Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore
More informationUsing Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case
Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case Masato Kikuchi, Eiko Yamamoto, Mitsuo Yoshida, Masayuki Okabe, Kyoji Umemura Department of Computer Science
More informationChapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining
Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of
More informationA Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window
A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY
More informationCorrelation Preserving Unsupervised Discretization. Outline
Correlation Preserving Unsupervised Discretization Jee Vang Outline Paper References What is discretization? Motivation Principal Component Analysis (PCA) Association Mining Correlation Preserving Discretization
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization
More informationAssociation Rule. Lecturer: Dr. Bo Yuan. LOGO
Association Rule Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview Frequent Itemsets Association Rules Sequential Patterns 2 A Real Example 3 Market-Based Problems Finding associations
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationImproving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques
Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database
More informationMachine Learning: Pattern Mining
Machine Learning: Pattern Mining Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Pattern Mining Overview Itemsets Task Naive Algorithm Apriori Algorithm
More informationInfoMiner: Mining Surprising Periodic Patterns
InfoMiner: Mining Surprising Periodic Patterns Jiong Yang IBM Watson Research Center jiyang@us.ibm.com Wei Wang IBM Watson Research Center ww1@us.ibm.com Philip S. Yu IBM Watson Research Center psyu@us.ibm.com
More informationAssociation Rules. Fundamentals
Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule
More informationGuaranteeing the Accuracy of Association Rules by Statistical Significance
Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.
Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example
Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationA Streaming Algorithm for 2-Center with Outliers in High Dimensions
CCCG 2015, Kingston, Ontario, August 10 12, 2015 A Streaming Algorithm for 2-Center with Outliers in High Dimensions Behnam Hatami Hamid Zarrabi-Zadeh Abstract We study the 2-center problem with outliers
More informationImproving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques
Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database
More informationStatistical Privacy For Privacy Preserving Information Sharing
Statistical Privacy For Privacy Preserving Information Sharing Johannes Gehrke Cornell University http://www.cs.cornell.edu/johannes Joint work with: Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh
More informationCS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014
CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof.
More informationProcessing Count Queries over Event Streams at Multiple Time Granularities
Processing Count Queries over Event Streams at Multiple Time Granularities Aykut Ünal, Yücel Saygın, Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey. Faculty of Engineering
More informationPRELIMINARY STUDIES ON CONTOUR TREE-BASED TOPOGRAPHIC DATA MINING
PRELIMINARY STUDIES ON CONTOUR TREE-BASED TOPOGRAPHIC DATA MINING C. F. Qiao a, J. Chen b, R. L. Zhao b, Y. H. Chen a,*, J. Li a a College of Resources Science and Technology, Beijing Normal University,
More informationRanking Sequential Patterns with Respect to Significance
Ranking Sequential Patterns with Respect to Significance Robert Gwadera, Fabio Crestani Universita della Svizzera Italiana Lugano, Switzerland Abstract. We present a reliable universal method for ranking
More informationMining Positive and Negative Fuzzy Association Rules
Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing
More information6-1. Canonical Correlation Analysis
6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another
More informationMining periodic patterns from nested event logs
University of Wollongong Research Online Faculty of Engineering and Information Sciences - Papers: Part A Faculty of Engineering and Information Sciences 2014 Mining periodic patterns from nested event
More informationCHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS. Due to elements of uncertainty many problems in this world appear to be
11 CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS Due to elements of uncertainty many problems in this world appear to be complex. The uncertainty may be either in parameters defining the problem
More informationInteresting Patterns. Jilles Vreeken. 15 May 2015
Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interesting patterns? What is a pattern? Data Pattern y = x - 1 What
More informationApriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation 12.3.2008 Lauri Lahti Association rules Techniques for data mining and knowledge discovery in databases
More informationOn Information Maximization and Blind Signal Deconvolution
On Information Maximization and Blind Signal Deconvolution A Röbel Technical University of Berlin, Institute of Communication Sciences email: roebel@kgwtu-berlinde Abstract: In the following paper we investigate
More informationSTANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part I. 1 st Nine Weeks,
STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I Part I 1 st Nine Weeks, 2016-2017 OVERVIEW Algebra I Content Review Notes are designed by the High School Mathematics Steering Committee as a resource
More informationA Logical Formulation of the Granular Data Model
2008 IEEE International Conference on Data Mining Workshops A Logical Formulation of the Granular Data Model Tuan-Fang Fan Department of Computer Science and Information Engineering National Penghu University
More informationThe Ties that Bind Characterizing Classes by Attributes and Social Ties
The Ties that Bind WWW April, 2017, Bryan Perozzi*, Leman Akoglu Stony Brook University *Now at Google. Introduction Outline Our problem: Characterizing Community Differences Proposed Method Experimental
More informationSocial Studies 201 September 22, 2003 Histograms and Density
1 Social Studies 201 September 22, 2003 Histograms and Density 1. Introduction From a frequency or percentage distribution table, a statistical analyst can develop a graphical presentation of the distribution.
More informationFrequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:
Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify
More informationReductions for Frequency-Based Data Mining Problems
Reductions for Frequency-Based Data Mining Problems Stefan Neumann University of Vienna Vienna, Austria Email: stefan.neumann@univie.ac.at Pauli Miettinen Max Planck Institute for Informatics Saarland
More informationA Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation
A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from
More informationEncyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen
Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi
More informationDescriptive Data Summarization
Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning
More informationAnomaly Detection via Over-sampling Principal Component Analysis
Anomaly Detection via Over-sampling Principal Component Analysis Yi-Ren Yeh, Zheng-Yi Lee, and Yuh-Jye Lee Abstract Outlier detection is an important issue in data mining and has been studied in different
More informationComputing Correlation Anomaly Scores using Stochastic Nearest Neighbors
Computing Correlation Anomaly Scores using Stochastic Nearest Neighbors Tsuyoshi Idé IBM Research, Tokyo Research Laboratory Yamato, Kanagawa, Japan goodidea@jp.ibm.com Spiros Papadimitriou Michail Vlachos
More informationA METHOD OF FINDING IMAGE SIMILAR PATCHES BASED ON GRADIENT-COVARIANCE SIMILARITY
IJAMML 3:1 (015) 69-78 September 015 ISSN: 394-58 Available at http://scientificadvances.co.in DOI: http://dx.doi.org/10.1864/ijamml_710011547 A METHOD OF FINDING IMAGE SIMILAR PATCHES BASED ON GRADIENT-COVARIANCE
More informationExploring Spatial Relationships for Knowledge Discovery in Spatial Data
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Exploring Spatial Relationships for Knowledge Discovery in Spatial Norazwin Buang
More informationMaintaining Frequent Itemsets over High-Speed Data Streams
Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,
More informationOutlier Detection Using Rough Set Theory
Outlier Detection Using Rough Set Theory Feng Jiang 1,2, Yuefei Sui 1, and Cungen Cao 1 1 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences,
More informationHeuristics for The Whitehead Minimization Problem
Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve
More informationMining Correlated High-Utility Itemsets using the Bond Measure
Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger 1, Jerry Chun-Wei Lin 2, Tai Dinh 3, Hoai Bac Le 4 1 School of Natural Sciences and Humanities, Harbin Institute of
More informationEstimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan
Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process
More informationANÁLISE DOS DADOS. Daniela Barreiro Claro
ANÁLISE DOS DADOS Daniela Barreiro Claro Outline Data types Graphical Analysis Proimity measures Prof. Daniela Barreiro Claro Types of Data Sets Record Ordered Relational records Video data: sequence of
More information130 Important Questions for XI
130 Important Questions for XI E T V A 1 130 Important Questions for XI PREFACE Have you ever seen a plane taking off from a runway and going up and up, and crossing the clouds but just think again that
More informationStatistics for Managers Using Microsoft Excel Chapter 9 Two Sample Tests With Numerical Data
Statistics for Managers Using Microsoft Excel Chapter 9 Two Sample Tests With Numerical Data 999 Prentice-Hall, Inc. Chap. 9 - Chapter Topics Comparing Two Independent Samples: Z Test for the Difference
More informationDegenerate Expectation-Maximization Algorithm for Local Dimension Reduction
Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,
More informationSearching Dimension Incomplete Databases
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO., JANUARY 3 Searching Dimension Incomplete Databases Wei Cheng, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang Abstract
More informationPrime Analysis in Binary
Prime Analysis in Binary Brandynne Cho Saint Mary s College of California September 17th, 2012 The best number is 73. [...] 73 is the 21st prime number. Its mirror, 37, is the 12th, and its mirror, 21,
More informationPart 1: Hashing and Its Many Applications
1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random
More informationRare Event Discovery And Event Change Point In Biological Data Stream
Rare Event Discovery And Event Change Point In Biological Data Stream T. Jagadeeswari 1 M.Tech(CSE) MISTE, B. Mahalakshmi 2 M.Tech(CSE)MISTE, N. Anusha 3 M.Tech(CSE) Department of Computer Science and
More informationAlgorithms for Characterization and Trend Detection in Spatial Databases
Published in Proceedings of 4th International Conference on Knowledge Discovery and Data Mining (KDD-98) Algorithms for Characterization and Trend Detection in Spatial Databases Martin Ester, Alexander
More informationUser-Driven Ranking for Measuring the Interestingness of Knowledge Patterns
User-Driven Ranking for Measuring the Interestingness of Knowledge s M. Baumgarten Faculty of Informatics, University of Ulster, Newtownabbey, BT37 QB, UK A.G. Büchner Faculty of Informatics, University
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.
More informationCS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.
CS570 Data Mining Anomaly Detection Li Xiong Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber April 3, 2011 1 Anomaly Detection Anomaly is a pattern in the data that does not conform
More informationRobust Inverse Covariance Estimation under Noisy Measurements
.. Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang, Shou-De Lin Intel-NTU, National Taiwan University ICML 2014 1 / 30 . Table of contents Introduction.1 Introduction.2 Related
More informationCPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication
CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication March, 2006 1 Introduction We have now seen that the Fast Fourier Transform can be applied to perform
More informationSession-Based Queueing Systems
Session-Based Queueing Systems Modelling, Simulation, and Approximation Jeroen Horters Supervisor VU: Sandjai Bhulai Executive Summary Companies often offer services that require multiple steps on the
More informationMulti-scale anomaly detection algorithm based on infrequent pattern of time series
Journal of Computational and Applied Mathematics 214 (2008) 227 237 www.elsevier.com/locate/cam Multi-scale anomaly detection algorithm based on infrequent pattern of time series Xiao-yun Chen, Yan-yan
More informationAnalysis of Variance and Co-variance. By Manza Ramesh
Analysis of Variance and Co-variance By Manza Ramesh Contents Analysis of Variance (ANOVA) What is ANOVA? The Basic Principle of ANOVA ANOVA Technique Setting up Analysis of Variance Table Short-cut Method
More informationHierarchies of sustainability in a catchment
Sustainable Development and Planning IV, Vol. 2 635 Hierarchies of sustainability in a catchment N. Dunstan School of Science and Technology, University of New England, Australia Abstract This paper investigates
More informationApproximate counting: count-min data structure. Problem definition
Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem
More informationStatistics 3 WEDNESDAY 21 MAY 2008
ADVANCED GCE 4768/01 MATHEMATICS (MEI) Statistics 3 WEDNESDAY 1 MAY 008 Additional materials: Answer Booklet (8 pages) Graph paper MEI Examination Formulae and Tables (MF) Afternoon Time: 1 hour 30 minutes
More informationMining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany
Mining Rank Data Sascha Henzgen and Eyke Hüllermeier Department of Computer Science University of Paderborn, Germany {sascha.henzgen,eyke}@upb.de Abstract. This paper addresses the problem of mining rank
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No. 05 Sequential Pattern Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
More informationSUFFIX TREE. SYNONYMS Compact suffix trie
SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix
More informationOn Multi-Class Cost-Sensitive Learning
On Multi-Class Cost-Sensitive Learning Zhi-Hua Zhou and Xu-Ying Liu National Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {zhouzh, liuxy}@lamda.nju.edu.cn Abstract
More informationReal-time Sentiment-Based Anomaly Detection in Twitter Data Streams
Real-time Sentiment-Based Anomaly Detection in Twitter Data Streams Khantil Patel, Orland Hoeber, and Howard J. Hamilton Department of Computer Science University of Regina, Canada patel26k@uregina.ca,
More informationScalable Hierarchical Recommendations Using Spatial Autocorrelation
Scalable Hierarchical Recommendations Using Spatial Autocorrelation Ayushi Dalmia, Joydeep Das, Prosenjit Gupta, Subhashis Majumder, Debarshi Dutta Ayushi Dalmia, JoydeepScalable Das, Prosenjit Hierarchical
More informationConstructing comprehensive summaries of large event sequences
Constructing comprehensive summaries of large event sequences JERRY KIERNAN IBM Silicon Valley Lab and EVIMARIA TERZI IBM Almaden Research Center Event sequences capture system and user activity over time.
More informationComprehensive Evaluation of Social Benefits of Mineral Resources Development in Ordos Basin
Studies in Sociology of Science Vol. 4, No. 1, 2013, pp. 25-29 DOI:10.3968/j.sss.1923018420130401.2909 ISSN 1923-0176 [Print] ISSN 1923-0184 [Online] www.cscanada.net www.cscanada.org Comprehensive Evaluation
More informationCPT+: A Compact Model for Accurate Sequence Prediction
CPT+: A Compact Model for Accurate Sequence Prediction Ted Gueniche 1, Philippe Fournier-Viger 1, Rajeev Raman 2, Vincent S. Tseng 3 1 University of Moncton, Canada 2 University of Leicester, UK 3 National
More informationAnomaly Detection for the CERN Large Hadron Collider injection magnets
Anomaly Detection for the CERN Large Hadron Collider injection magnets Armin Halilovic KU Leuven - Department of Computer Science In cooperation with CERN 2018-07-27 0 Outline 1 Context 2 Data 3 Preprocessing
More informationMining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data
Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional -Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1 Series Data Many applications produce time series
More informationCPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication
CPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication 1 Introduction We have now seen that the Fast Fourier Transform can be applied to perform polynomial multiplication
More informationModels, Data, Learning Problems
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,
More informationCompression in the Space of Permutations
Compression in the Space of Permutations Da Wang Arya Mazumdar Gregory Wornell EECS Dept. ECE Dept. Massachusetts Inst. of Technology Cambridge, MA 02139, USA {dawang,gww}@mit.edu University of Minnesota
More information