Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength

Size: px

Start display at page:

Download "Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength"

Delphia Welch
6 years ago
Views:

1 Mining Interesting Infrequent and Frequent Itemsets Based on Minimum Correlation Strength Xiangjun Dong School of Information, Shandong Polytechnic University Jinan , China Abstract. IMLMS (interesting MLMS (Multiple Level Minimum Supports)) model, which was proposed in our previous works, is designed for pruning uninteresting infrequent and frequent itemsets discovered by MLMS model. One of the pruning measures used in IMLMS model, interest, can be described as follows: to two disjoint itemsets A,B, if interest(a,b)= s(a B) - s(a)s(b) <mi, then A B is recognized as uninteresting itemset and is pruned, where s( ) is the support and mi a minimum interestingness threshold. This measure, however, is a bit difficult for users to set the value mi because interest (A,B) highly depends on the values of s( ). So in this paper, we propose a new measure, MCS (minimum correlation strength) as a substitute. MCS, which is based on correlation coefficient, has better performance than interest and it is very easy for users to set its value. The theoretical analysis and experimental results show the validity of the new measure. Keywords: infrequent itemset, frequent itemset, negative association rule, multiple minimum supports, prune, correlation coefficient. 1 Introduction As we have known, the traditional association rule is the form A B, whose support (s) and confidence (c) meet a minimum support (ms) threshold and a minimum confidence (mc) respectively. This is the famous support-confidence framework [1]. Recently, how to mine negative association rules (NARs) at the forms A B, A B and A B has been attracted attention and how to mine infrequent itemsets has also been attracted much attention because many valued NARs can be discovered from infrequent itemsets [2, 3, 4, 5, 6]. In ref. [5], a MLMS (Multiple Level Minimum Supports) model, which assigns different minimum supports to itemsets with different lengths, was proposed to discover infrequent itemsets and frequent itemsets simultaneously. In order to prune those uninteresting itemsets discovered by the MLMS model, in ref. [6], an IMLMS (Interesting MLMS) model was proposed later by using the similar pruning measures as that in [2]. One of the pruning measures used in IMLMS model, interest, can be described as follows: to two disjoint itemsets A,B, if interest(a,b)= s(a B) - s(a)s(b) <mi, then A B is recognized as uninteresting itemset and is pruned, where s( ) is the support H. Deng et al. (Eds.): AICI 2011, Part I, LNAI 7002, pp , Springer-Verlag Berlin Heidelberg 2011

2 438 X. Dong and mi a minimum interestingness threshold. This measure, however, is a bit difficult for users to set the value of mi because interest (A, B) highly depends on the values of s( ). In fact, many interesting measures have been proposed in association rules mining, such as interestingness, chi-squared test, correlation coefficient, Laplace, Gini-index, Piatetsky-Shapiro, Conviction and so on, and many researches have discussed how to select the right measure [7, 8, 9, 10, 11]. Among these measures, correlation coefficient is a good one and the authors in [11] used it to mine negative association rules. In this paper, we also use correlation coefficient to replace the measure interest which was used in IMLMS model to improve performance, and we call the new measure as minimum correlation strength (MCS), denoted as ρ MCS. That is, to two disjoint itemsets A, B, if the correlation coefficient of A, B, ρ(a, B), is less than a given minimum correlation strength ρ MCS, then A B is recognized as uninteresting itemset and is pruned. Later discussion will show that the measure MCS has better performance and is easier to be set than interest. The main contributions of this paper are as follows: 1. We propose a new pruning measure named MCS to improve the performance of IMLMS model. 2. We demonstrate the validity of the measure MCS by theoretical analysis and experiments. The rest of the paper is organized as follows: Section 2 discusses the MCS pruning method. Section 3 is the experiments and section 4 is conclusions. 2 MCS Pruning Method 2.1 Review of the IMLMS Model Let I={i 1, i 2,, i n } be a set of n distinct literals called items, and TD a transaction database of variable-length transactions over I, and the number of transactions in TD is denoted as TD. Each transaction contains a set of item i 1, i 2,,i m I and each transaction is associated with a unique identifier TID. A set of distinct items from I is called an itemset. The number of items in an itemset A is the length of the itemset, denoted by len(a). An itemset of length k are referred to as k-itemset. Each itemset has an associated statistical measure called support, denoted by s. For an itemset A I, s(a)=a.count / TD, where A.count is the number of transactions containing itemsets A in TD. The support of a rule A B is denoted as s(a B) or s(a B), where A, B I,and A B =Φ. The confidence of the rule A B is defined as the ratio of s(a B) over s(a), i.e., c(a B) = s(a B) / s( A). In MLMS model, different minimum supports are assigned to itemsets with different lengths. Let ms(k) be the minimum support of k-itemsets (k=1,2,,n), ms(0) be a threshold for infrequent itemsets, ms(1) ms(2),, ms(n) ms(0)>0, for any itemset A, if s(a) ms(len(a)), then A is a frequent itemset; and if s(a) < ms(len(a)) and s(a) ms(0), then A is an infrequent itemset. The IMLMS model use the modified pruning method used in [2] to prune uninteresting frequent itemsets by the equation 1-3, and to prune uninteresting infrequent itemsets by the equation 4-6.

3 Mining Interesting Infrequent and Frequent Itemsets Based on MCS 439 M is a frequent itemset of potential interest in the MLMS model if fipi(m) = s(m) ms(len(m)) ( A, B: A B = M ) fipis(a, B ), (1) where fipis(a, B)=(A B =Φ) ( f (A,B, ms(len(a B)), mi) = 1), (2) s( A B) + interest ( A, B) ( ms ( len( A B)) + mi ) + 1 f (A, B, ms(len(a B)), mi)= s( A B) ms( len ( A B)) + interest ( A, B) mi + 1 (3) N is an infrequent itemset of potential interest if iipi(n) = s(n) < ms(len(n)) s(n) ms(0) (4) ( A, B: A B = N ) iipis(a, B), where iipis(a, B) =(A B =Φ) ( f (A,B, ms(0), mi) = 1), (5) f (A, B, ms(0), mi) = s( A B) + interest( A, B) ( ms(0) + mi) + 1 s( A B) ms(0) + interest( A, B) mi + 1. (6) 2.2 MCS Pruning Method In equation 1-6, interest(a,b) is an interestingness measure and interest(a,b)= s(a B) - s(a)s(b), mi is a minimum interestingness threshold given by users or experts. This measure is first proposed by Piatetsky-Shapiro [12]. It s a good measure to prune uninteresting itemsets in some cases, but it is a bit difficult for users to set the value of mi because interest (A,B) changes with the values of s( ). Take the data in table 1 for example, the maximum value of interest(a,b) is when s(a), s(b)=0.01, while the maximum value of interest(a,b) is 0.09 when s(a), s(b)=0.1, how to give the value of mi? In a database, with the length of an itemset increasing, the support of the itemset decreases. So using a single minimum interest is unfair to all itemsets with different supports. Perhaps you may say use a changeable mi to itemsets with different supports, maybe this approach works, but now the problem becomes how to change the value of mi is fair? In fact, the essence of the Piatetsky-Shapiro measure interest(a,b)= s(a B) - s(a)s(b) is that the itemsets A,B is uninteresting if s(a B) - s(a)s(b) 0, i.e., if the correlation of itemsets A,B is not strong enough, itemsets A,B is not interesting and can be pruned. So we can use the measure Correlation coefficient as a substitute. Correlation coefficient measures the degree of linear dependency between a pair of random variables. The correlation coefficient between A and B can be written as ρ(a, B) = sa ( B)- sasb ( ) ( ), (7) sa ( )(1- sa ( )) sb ( )(1- sb ( )) where s(*) 0, 1 [3, 11]. The range of ρ(a, B) is between 1 and +1.The correlation coefficient and its strength are discussed in [13]. According to this book, a variable α (0 α 1) is used to express the correlation strength. α=0.5, the strength is large, 0.3, moderate, and 0.1, small. This means that the itemsets whose correlation is less than 0.1 is unvalued. In real application, the value of α can be given by users or experts.

4 440 X. Dong Table 1. Data for comparison between interest(a,b) and ρ(a, B) (a) (b) s(a) s(b) s(a B) interest(a,b) ρ(a,b) s(a) s(b) s(a B) interest(a,b) ρ(a, B) Now let s compare interest(a,b) and ρ(a, B). The data in table 1 demonstrate the cases that the value of s(a), s(b) is 0.01 and 0.1. The range of interest(a,b) is [0.0001,0.0099] in table 1 (a) when s(a), s(b) is 0.01, [0,0.09] in table 1 (b) when s(a), s(b) is 0.1. This means the range of interest(a,b) is greatly influenced by the value of s(a) and s(b). While the range of ρ(a,b) is [0,1] either in table 1 (a) or in table 1 (b). This means the range of ρ(a,b) is only influenced by the correlation strength of A and B, not by the value of s(a) and s(b). Fig. 1 (a), (b) shows the changes of the data in table 1 (a), (b) respectively. From figure 1 we can see the changes of the range of interest(a,b) and ρ(a, B) more clearly. Although the data in table 1 only show the case when s(a), s(b) is small, we will get the same result when s(a), s(b) is V$ V% V$ % LQWHUHVW$% _²$%_ Fig. 1. The changes of the data in table 1 (a), (b) large or when one is small, the other is large. So we can set a minimum correlation strength ρ MCS (0 ρ MCS 1) as a constraint to prune less correlative itemsets. In details, for itemsets A, B I, A B =Φ, if ρ(a, B) <ρ MCS, then A B is uninteresting and is pruned. This is the MCS pruning method. We don t need to modify the equation 1-6, the only things we do are: 1) to replace interest(a,b)= s(a B) - s(a)s(b) with interest(a,b)= ρ(a,b) ; and 2) to replace mi with ρ MCS. We don t need to modify the algorithm Apriori_IMLMS either, but we change the algorithm name to Apriori_IMLMS_MCS to distinguish the algorithm name. V$ V% V$ % LQWHUHVW$% _²$%_

5 Mining Interesting Infrequent and Frequent Itemsets Based on MCS Algorithm Algorithm Apriori_IMLMS_MCS Input: TD: Transaction Database; ms(k)(k=0,1,,n): minimum support threshold; Output: FIS: set of interesting frequent itemsets; infis: set of interesting infrequent itemsets; (1) FIS=Φ; infis=φ; (2) temp 1 = {A A 1-itemsets,s(A) ms(0)}; FIS 1 = {A A temp 1 s(a) ms(1)}; infis 1 = temp 1 -FIS 1 ; (3) for (k=2;temp k 1 Φ;k++) do begin (3.1) C k = apriori_gen(temp k 1, ms(0)); (3.2) for each transaction t TD do begin /*scan transaction database TD*/ C t =subset(c k, t); for each candidate c C t c.count++; end (3.3) temp k = {c c C k (c.count/ TD ) ms(0)}; FIS k = {A A temp k A.count/ TD ms(k)}; infis k = temp k FIS k (3.4) /*prune all uninteresting itemsets in FIS k */ for each itemset M in FIS k do if NOT (fipi(m)) then FIS k =FIS k { M } (3.5)/*prune all uninteresting itemsets in infis k */ for each itemset N in infis k do if NOT (iipi(n)) then infis k =infis k { N } end (4) FIS = FIS k ; infis = infis k ; (5) return FIS and infis; The explanation of the algorithm Apriori_IMLMS_MCS can be referred to [6]. 3 Experiments The real dataset records areas of that each user visited in a oneweek timeframe in February Summary statistical information of the dataset is: training instances, 5000 testing instances, 294 attributes and the mean area visits per case is 3.0 ( /CIS788_dm_proj.htm# datasets).

6 442 X. Dong Table 2. The number of the interesting infis and the interesting FIS with different mi. (ms(1)=0.025, ms(2)=0.02, ms(3)=0.017, ms(4)=0.013, ms(0)=0.01) ρ MCS k=1 k=2 k=3 k=4 Total 0 FIS infis FIS infis FIS infis FIS infis FIS infis FIS infis Table 2 shows the number of the interesting infrequent itemsets and the interesting frequent itemsets with different ρ MCS when ms(1)=0.025, ms(2)=0.02, ms(3)=0.017, ms(4)=0.013, and ms(0)=0.01. From table 4 we can see that the total number of FIS and infis is 150, 121, 95, 53, 39, 25 when ρ MCS =0, 0.05, 0.1, 0.2, 0.25, 0.3 respectively. With ρ MCS increasing, the total number decreases obviously. Table 4 also gives the number of FIS and infis in different k. These data show that the MCS pruning method can efficiently prune the uninteresting itemsets as the IMLMS model. 4 Conclusions In order to prune the uninteresting itemsets discovered by MLMS model, IMLMS model was proposed by using a pruning measure interest(a,b)= s(a B) - s(a)s(b) <mi. This measure, however, is not easy enough for users to set the value of mi because interest (A,B) highly depends on the values of s( ), as discussed in section 2.2. So in this paper, a new measure, minimum correlation strength MCS, has been proposed as a substitute. If ρ(a, B) <ρ MCS, then A B is uninteresting and is pruned. The theoretical analysis and experimental results show that MCS has better performance than interest and its value is very easy to be set. Acknowledgements. This work was partly supported by Excellent Young Scientist Foundation of Shandong Province of China under Grant No. 2006BS References 1. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Database. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp ACM Press, New York (1993)

7 Mining Interesting Infrequent and Frequent Itemsets Based on MCS Wu, X., Zhang, C., Zhang, S.: Efficient Mining of both Positive and Negative Association Rules. ACM Transactions on Information Systems, (2004) 3. Dong, X., Niu, Z., Shi, X., Zhang, X., Zhu, D.: Mining Both Positive and Negative Association Rules from Frequent and Infrequent Itemsets. In: Alhajj, R., Gao, H., Li, X., Li, J., Zaïane, O.R. (eds.) ADMA LNCS (LNAI), vol. 4632, pp Springer, Heidelberg (2007) 4. Dong, X., Wang, S., Song, H.: 2-level Support based Approach for Mining Positive & Negative Association Rules. Computer Engineering, (2005) 5. Dong, X., Zheng, Z., Niu, Z., Jia, Q.: Mining Infrequent Itemsets based on Multiple Level Minimum Supports. In: Proceedings of the Second International Conference on Innovative Computing, Information and Control (ICICIC 2007), Kumamoto (2007) 6. Dong, X., Niu, Z., Zhu, D., Zheng, Z., Jia, Q.: Mining Interesting Infrequent and frequent Itemsets based on MLMS Model. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds.) ADMA LNCS (LNAI), vol. 5139, pp Springer, Heidelberg (2008) 7. Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the Right Interestingness Measure for Association Patterns. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton (CA), pp (2002) 8. Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right objective measure for association analysis. Information Systems 29, (2004) 9. Hilderman, R.J., Hamilton, H.J.: Applying Objective Interestingness Measures in Data Mining Systems. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD LNCS (LNAI), vol. 1910, pp Springer, Heidelberg (2000) 10. Geng, L., Hamilton, H.J.: Interestingness Measures for Data Mining: A Survey. ACM Computing Surveys 38(3), Article 9 (2006) 11. Antonie, M.-L., Zaïane, O.: Mining Positive and Negative Association Rules: An Approach for Confined Rules. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD LNCS (LNAI), vol. 3202, pp Springer, Heidelberg (2004) 12. Piatetsky-Shapiro, G.: Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases, pp AAAI Press/MIT Press (1991) 13. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Lawrence Erlbaum, New Jersey (1988)

Mining Positive and Negative Fuzzy Association Rules

Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing