Integrating Classification and Association Rules by proposing adaptations to the CBA Algorithm

Size: px

Start display at page:

Download "Integrating Classification and Association Rules by proposing adaptations to the CBA Algorithm"

Kelly Todd
6 years ago
Views:

1 Itegratig Classificatio ad Associatio Rules by proposig adaptatios to the CBA Algorithm Davy Jasses Geert Wets Tom Brijs Koe Vahoof Limburg Uiversity Cetre, Uiversitaire Campus, gebouw D, B-3590 Diepebeek, Belgium Abstract I recet years, extesive research has bee carried out by focusig o associatio rules to build more accurate classifiers. These itegrated approaches maily focus o a limited subset of associatio rules, i.e. those rules where the cosequet of the rule is restricted to the classificatio class attribute. This paper aims to cotribute to this itegrated framework by adaptig the CBA (Classificatio Based o Associatios) algorithm. CBA was adapted by couplig it with aother measuremet of the quality of associatio rules: i.e. itesity of implicatio. The ew algorithm has bee implemeted ad empirically tested o a authetic fiacial dataset for purposes of bakruptcy predictio. We validated our results with a associatio ruleset, with C4.5, with origial CBA ad with CART by statistically comparig its performace via the area uder the ROC-curve. The adapted CBA algorithm preseted i this paper proved to geerate sigificatly better results tha the other classifiers at the 5% level of sigificace. 1. Itroductio Classificatio ad associatio-rule discovery are for sure the two tasks most addressed i data miig literature. Associatio rules have received sigificat attetio for extractig kowledge from large databases. Their study is focused o usig exhaustive search to fid all rules i data that satisfy the userspecified miimum support ad miimum cofidece criteria. The Apriori algorithm is the best kow algorithm i this field [1]. Probably, a eve more popular techique is classificatio rule miig. It aims to discover a small set of rules to form a accurate classifier. Give a set of cases with class labels as a traiig set, classificatio is to build a model (called classifier) to predict future data objects for which the class label is ukow. Quila s C4.5 classificatio system [2] is kow as the state-of-the-art method i classificatio rule miig. I recet years, extesive research has bee carried out to itegrate both approaches. By focusig o a limited subset of associatio rules, i.e. those rules where the cosequet of the rule is restricted to the classificatio class attribute, it is possible to build more accurate classifiers. Several publicatios ([3], [4], [5] ad [6]) have show that associatio-based classificatio i geeral geerates better accuracy tha state-of-the-art classificatio algorithms such as C4.5. The reasos for the good performace are obvious. Associatio rules will search globally for all rules that satisfy miimum support ad miimum cofidece orms. They will therefore cotai the full set of rules, which may icorporate importat iformatio. The richess of the rules gives this techique the potetial of reflectig the true classificatio structure i the data [6]. Associative classificatio is therefore gaiig icreasig popularity. However, the comprehesiveess ad complexity of dealig with the ofte large umber of associatio rules also suffers from weakesses ad difficulties which are part of a lot of research which is curretly goig o. Cotributios to tackle a umber of these difficulties ca be foud i [4], i [6] ad i [7]. Liu, Ma & Wog, eve proposed a improvemet of their origial CBA (classificatio based o associatios)-system [3] i [8] to cope with those weakesses. I spite of the fact that the preseted adaptatios of CBA are valuable, some importat issues still remai usolved. Our goal is to address them i this paper. The potetial weakess which we were able to determie is situated i the way CBA sorts its (class) associatio rules. As will be explaied i sectio 2, the sortig i CBA is quite importat because the rules for the fial classifier will be selected by followig the sorted sequece. CBA sorts its rules by usig the coditioal probability (cofidece). This is a good measure whe classes are equally distributed. However, as we will show, whe class distributios differ sigificatly, ad especially for classes whose frequecy is low, this is ot the most adequate approach to follow. For this reaso, we propose itesity of implicatio as a better measure to sort the class associatio rules. Sectio 3.1 elaborates more ito detail o this issue. Apart from this, the CBA algorithm which we have implemeted also traces the evolutio of

2 the umber of false positives (FP) ad false egatives (FN) ad ot oly of the total umber of errors, as the origial CBA algorithm does. The potetial advatages of this are discussed i sectio 3.2. Sectio 4 presets the results of our empirical evaluatio, ad fially coclusios ad recommedatios for further research are preseted i sectio Classificatio Based o Associatios Before we ca elaborate further o the chages we made to CBA, we thought it might be useful to give the reader a brief overview of the details of the origial algorithm. This is doe i this sectio. Agrawal, Imieliski, ad Swami [9] itroduced the cocepts behid associatio rules ad suggested algorithms for fidig such rules. They provided the followig formal descriptio of this techique. Let I = {i 1, i 2,,i k } be a set of literals, called items. Let D be a set of trasactios, where each trasactio T is a set of items such that T I. We say that a trasactio T cotais X, a set of items i I, if X T. A associatio rule is a implicatio of the form X => Y, where X I, Y I ad X Y =. The rule X => Y holds i the trasactio set D with cofidece c if c% of trasactios i D that cotai X also cotai Y. The rule X => Y has support s i the trasactio set D if s% of trasactios i D cotai X Y. Give a set of trasactios D, the problem of miig associatio rules is to geerate all associatio rules that have support ad cofidece greater tha a user-specified miimum support (misup) ad miimum cofidece (micof). To make associatios suitable for the classificatio task, the CBA method focuses o a special subset of associatio rules, i.e. associatio rules with a cosequet limited to class label values oly, called class associatio rules (CARs). Thus, we oly eed to geerate those rules of the form A c i where c i is a possible class. Therefore, the Apriori algorithm which is widely used for geeratig associatio rules, was modified to build the CARs. Details about how this is doe, ca be foud i [3]. To reduce the umber of rules geerated, the algorithm performs two types of pruig. The first type is the pessimistic error rate used i [2]. The secod type of pruig is kow as database coverage pruig [7]. Buildig a classifier i CBA is therefore also largely based o this coverage pruig method, which is applied after all the CARs have bee geerated. The origial algorithm which is used i CBA is show i figure 1. Before the pruig, the algorithm will first rak all the CARs. Give two rules r i ad r j, r i > r j (or r i has is said havig higher rak tha r j ), if (1) cof (r i ) > cof (r j ); or (2) cof (r i ) = cof (r j ), but sup (r i ) > sup (r j ); or (3) cof (r i ) = cof (r j ) ad sup (r i ) = sup (r j ), but r i is geerated before rj. Followig this sorted descedig sequece order; if at least oe case amog all the cases covered by the rule is classified correctly by the rule, the rule is iserted ito the classifier ad all the cases it covers are removed from the database. The rule isertio stops whe either all of the rules are used or o cases are left i the database. The majority class amog all cases left i the database is selected as the default class. The default class is used i case whe there are o coverig rules. The, the algorithm computes the total umber of errors, which is the sum of the umber of errors that have bee made by the selected rules i the curret classifier ad the umber of errors made by the default class i the traiig data. After this process, the first rule which has the least umber of errors is idetified as the cutoff rule. All the rules after this rule are ot icluded i the fial classifier sice they oly produce more errors [3]. R=sort (R); For each rule r R i sequece do temp = ø; for each case d D do if d satisfies the coditios of r the store d.id i temp ad mark r if it correctly classifies d; if r is marked the isert r at the ed of C; delete all the cases with the ids i temp from D; selectig a default class for the curret C; compute the total umber of errors of C; ed ed Fid the first rule p i C with the lowest total umber of errors ad drop all the rules after p i C; Add the default class associated with p to ed of C ad retur C (our classifier) Figure 1: Buildig a classifier i CBA (Liu, Hsu, Ma, 1998)

3 3. Idetifyig weakesses ad proposig adaptatios to CBA 3.1 Usig itesity of implicatio to sort the CARs A profoud examiatio of the algorithm lears us that a potetial weakess is the way the rules are sorted. Sice rules are iserted i the classifier followig the sorted cofidece order, this will determie to a large extet the accuracy of our fial classifier. Cofidece is a good measure for the quality of (class) associatio rules but also suffers from weakesses. Whe for a particular class, the misup parameter is set to 1% or eve lower, it might very well happe that some rules have a high cofidece parameter but o the other had they might be cofirmed by a very limited umber of istaces, ad that those rules stem from oise oly. This is why it is always dagerous to look for implicatios with small support eve though these rules might look very iterestig. This dager seems to exist all the more i CBA because the applicatio which implemets the algorithm eve offers a possibility to iclude rules with high cofidece who do ot satisfy the miimum support threshold i the fial classifier. As a result, choosig the most cofidet rules may ot always be the best selectio criterio. Therefore, a measure which was itroduced by Gras & Lahrer [10], i.e. itesity of implicatio, was used i the adjustmet of the origial CBA algorithm. Itesity of implicatio measures the distace to radom choices of small, eve o statistically sigificat, subsets. I other words, it measures the statistical surprise of havig few examples o a rule as compared with a radom draw [11]. Itesity of implicatio ca easily be derived from [12] as: ( ) k a b ( ) = a b K ab ϕ (X Y) 1 e (*) k = 0 k! where is the umber of cases, a is the umber of cases covered by the atecedet ad b is the umber of cases covered by the cosequet of the rule. The coefficiet ab represets the umber of cases which are covered by the atecedet ad the cosequet of the rule, while ab stads for the umber of cases which are covered by the atecedet but ot by the cosequet of the rule. Sice cofidece ad support are stadard measures for determiig the quality of associatio rules, it would be ice if those could be icorporated i the formula above. The quite straightforward procedure how this is doe, is show i Appedix A ad the fial result is give i the formula below. k support * cases * ( cases - abssupcos) cofidece support*cases 1 cases *( cases-abssupcos) support*cases* 1 cofidece K = cofidece ϕ(x Y) 1 * e cases k = 0 k! Guillaume et al. [11] claim that the relevace of the discovered associatio rules ca be sigificatly improved by usig itesity of implicatio. The measure is also clearly oise- resistat ad for those reasos we are cofidet that a more appropriate sortig ad as a cosequece also a better performace of the fial classifier is to be expected. Our experimetal evaluatio i sectio 4 will verify whether this is the case Tracig FP ad FN separately As poited out i figure 1, the CBA-algorithm computes the total umber of errors. I our implemetatio, we have chose to trace the evolutio of the umber of false positives (FP) ad false egatives (FN) separately. The fial result is the same but our implemetatio teds to be more traslucet sice our program geerates a cofusio matrix for every rule which is added to the classifier. As a result, the evolutio of both types of errors ca be aalysed ito detail ad by meas of visual ispectio, a appropriate cutpoit for our classifier ca be idicated. More specifically, the poit at which the umber of false positives surpasses the umber of true positives is take as the cutpoit, sice the performace of the classifier ca o loger be improved by addig more rules to the classifier. We will come back to this i our empirical sectio, described below.

4 4. Empirical Sectio 4.1. Descriptio of the data The traiig data beig used for this study cocers a satisfactio survey that was coducted amog customers of a major bak i Belgium i Natiowide, 7264 customers of the bak filled out a questioaire. This questioaire icludes questios probig for the level of satisfactio with respect to specific service aspects of the bak, questios o socio-demographic characteristics of the customers ad a questio probig for the overall level of satisfactio. Customers were asked to idicate to what extet they could agree with the statemets preseted i the questioaire. All statemets related to the bak s service aspects were measured o a 5-level ordial scale with resposes ragig from always (5), most ofte (4), sometimes (3), rarely (2), to ever (1) ad o opiio, the latter idicatig a missig value. I cosultatio with bak maagemet, the respose values for the target attribute (overall level of satisfactio), were recoded ito 2 groups, combiig the resposes always ad most ofte ito satisfied ad sometimes, rarely ad ever ito dissatisfied. Evetually, a total of 7264 istaces were obtaied of which oly 445 (6.1%) were classified i the group of dissatisfied customers, illustratig the skewess of the class frequecy distributio. The dissatisfied customers are arbitrarily defied as the positive class, the satisfied customers represet the egative class. The test data compreheds the same satisfactio survey coducted by the same bak, but carried out i Buildig the classifier First, all the class associatio rules were geerated by usig multiple miimum support criteria. This meas that i our experimetal study, 6 differet models were built by ragig the miimum support orm for the class represetig the customers which are satisfied, from 10% till 35%, while for the class represetig the dissatisfied customers, the threshold was raged from 0.5% till 3%. Those multiple miimum support criteria reflect the skewed class frequecy distributio. Miimum cofidece orms were kept relatively low at 10% to exploit the effectiveess of sortig the class associatio rules by meas of itesity of implicatio. After this, the modificatios of the origial CBA algorithm were implemeted. As explaied above, this resulted i a cofusio matrix for every rule which was added to the classifier. The evolutio of the umber of false egatives, true positives ad false positives is depicted i figure 2. It should be clear from this figure that o the left had side of the first vertical lie, the umber of TP lies above the umber of FP. I other words, the accuracy of the classifier ca still be improved i this case by addig more rules to the classifier. However, as both curves slowly grow towards each other, the arrow idicates the poit where FP surpasses TP. This poit is take as the cutoff poit ad addig more rules to the classifier would ot result i a better classifier for our traiig data. I the ext sectio, the costructed classifier is compared to other classifiers o our idepedet test set by meas of ROC curve aalysis. umber of FP, TP, FN FN TP FP umber of rules i the potetial classifier Figure 2: The evolutio of the umber of FP, TP ad FN for every rule which is added to the classifier

5 4.3. Results: Usig ROC-curve aalysis to compare differet classifiers ROC aalysis uses what is called a ROC space to give a graphical represetatio of the classifiers performace idepedetly of class distributios or error costs. This ROC space is a coordiate system where the rate of true positives is plotted o the Y-axis ad the rate of false positives is plotted o the X- axis. The true positive rate is defied as the fractio of positive cases classified correctly relative to the total umber of positive examples. The false positive rate is defied as the fractio of egative cases classified erroeously relative to the umber of all egative examples. From a visual perspective, oe poit i the ROC curve (represetig oe classifier with give parameters) is better tha aother if it is located more to the orth-west (TP is higher, FP is lower or both) o the ROC graph [13]. To be able to compare the performace of differet classifiers with ROC curves measured o the same data, a sigle umber measure which reflects the performace of the classifiers is eeded. The area uder the ROC curve (AUC) is geerally accepted as the preferred sigle umber measure. Trapezoidal itegratio was used to calculate the AUC, accordig to the formula give i [14]. Furthermore, to assess whether the differeces betwee the AUCs computed from the same data set are statistically sigificat, hypothesis testig ca be employed. The method for doig this is explaied i [15]. The ull hypothesis that both areas are equal was rejected whe the statistical test showed a p-value below I our experimetal study, six differet classifiers of differet types were evaluated. The classifiers which are evaluated are differet associatio rule models (partial classificatio), C4.5, C4.5 with groupig of symbolic values i the tree (C4.5 GSV), CART, origial CBA ad the modified CBA preseted i this paper. For a discussio of the experimetal desig of the first four types of classifiers, we refer to [16]. For the origial CBA, the differet poits o the ROC graph correspod to differet miimum support orms. For the adapted CBA, 6 models were built, with multiple miimum support orms ad miimum cofidece orms as metioed above. The ROC curves for those differet classifiers are depicted i figure 3. Whe pairwise comparisos betwee the adapted CBA algorithm ad the other five classifiers were coducted, the differeces tured out to be statistically sigificat. All the differeces have p-values below This is show i table 1. The performace of the adapted CBA algorithm was highly sigificat with respect to C4.5 ad C4.5 GSV, eve at a 1% level of sigificace. The same could ot be said with respect to AR ruleset, CART ad the origial CBA, but as metioed, the modified algorithm preseted i this paper, proved to geerate sigificatly better results at the 5% level of sigificace. 62,0% 60,0% 58,0% 56,0% 54,0% 52,0% 50,0% 48,0% 46,0% 44,0% 42,0% 40,0% 38,0% 36,0% 34,0% 32,0% 30,0% 28,0% 26,0% 24,0% 22,0% 20,0% 18,0% 16,0% 14,0% 12,0% 10,0% 8,0% 6,0% 4,0% 2,0% 0,0% 0,0% 1,0% 2,0% 3,0% 4,0% 5,0% 6,0% 7,0% 8,0% AR ruleset CART C4.5 C4.5 GSV Origial CBA Adapted CBA Figure 3: ROC curves comparig the performace of differet classifiers o a idepedet testset

6 Table 1: p-values of the Adapted CBA algorithm versus other classifiers p-values AR Ruleset CART C4.5 C4.5 GSV Origial CBA Adapted CBA <<0.01 << Coclusio The algorithm preseted i this paper is a modified versio of the CBA algorithm, which ca be used to build classifiers based o associatio rules. CBA was adapted by couplig it with itesity of implicatio ad by tracig the evolutio of FP ad FN separately. The results proved to be sigificatly better tha the other classifiers at the 5% level of sigificace. Further research is still eeded to verify whether similar good results ca be achieved o other datasets. Refereces [1]Agrawal, R., Srikat, R.(1994). Fast algorithms for miig associatio rules. I Proc. of the 20th Iteratioal coferece o Very Large Databases (VLDB), Satiago, Chile, p [2] Quila, J.R., C4.5: Programs for Machie Learig: Los Altos : Morga Kaufma, [3] Liu, B., Hsu, W., Ma, Y. (1998). Itegratig Classificatio ad Associatio Rule Miig. I Proc. of the Fourth Iteratioal Coferece o Kowledge Discovery ad Data Miig (KDD-98), New York, p [4] Dog, G., Zhag, X., Wog, L., & Li, J. (1999). CAEP: Classificatio by aggregatig emergig patters. I Proc. of the Secod Iteratioal Coferece o Discovery Sciece, Tokyo, Japa, p [5] Let, B., Swami, A.N., Widom, J. (1997). Clusterig associatio rules. I Proc. of the Thirteeth Iteratioal Coferece o Data Egieerig, Birmigham, U.K, p [6] Wag, K., Zhou, S., He, Y. (2000).Growig decisio tree o support-less associatio rules. I Proc. of the Sixth ACM SIGKDD Iteratioal Coferece o Kowledge Discovery ad Data Miig (KDD'00), Bosto, p [7] Li, W., Ha, J., Pei, J. (2001). CMAR: Accurate ad Efficiet Classificatio Based o Multiple Class-Associatio Rules. I Proc. of the 1st IEEE Iteratioal Coferece o Data Miig (ICDM 2001), Sa Jose, Califoria, p [8] Liu, B., Ma, Y., Wog, C. (2001). Classificatio usig Associatio Rules: Weakesses ad Ehacemets. To appear i Vipi Kumar, et al, (eds), Data miig for scietific ad egieerig applicatios, ISBN [9] Agrawal, R., Imieliski, T., Swami, A. (1993). Miig Associatio Rules betwee Sets of Items i Large Databases. I: Proc. of the ACM SIGMOD Coferece o Maagemet of Data, Washigto D.C., p [10] Gras, R., Lahrer, A., (1993) L implicatio statistique: ue ouvelle méthode d aalyse des doées, Mathématiques, Iformatique et Scieces Humaies 120. [11] Guillaume, S., Guillet F., Philippé, J. (1998) Improvig the discovery of associatio rules with itesity of implicatio. I Priciples of Data Miig ad Kowledge Discovery, volume 1510 of Lecture Notes i Artificial Itelligece, p [12] Suzuki, E. Kodratoff, Y. (1998) Discovery of surprisig exceptio rules based o itesity of implicatio. I Priciples of Data Miig ad Kowledge Discovery, (PKDD), p Berli: Spriger. [13] Provost, F., Fawcett, T. (1997), Aalysis ad Visualizatio of Classifier Performace: Compariso uder Imprecise Class ad Cost Distributios. I Proc. of the third iteratioal coferece o kowledge discovery ad data miig, Newport Beach, Califoria, p [14] Bradley, A.P. (1997) The Use of the Area Uder the ROC Curve i the Evaluatio of Machie Learig Algorithms. Patter Recogitio, Vol. 30, Number 7, p [15] Haley, J.A., McNeil, B.J. (1983) A method of comparig the areas uder receiver operatig characteristic curves derived from the same cases. Radiology, 148, p [16] Brijs, T., Swie, G., Vahoof, K., Wets, G. (2000). Comparig complete ad partial classificatio for idetifyig lately dissatisfied customers, 11th Europea Coferece o Machie Learig, Barceloa, May 31 Jue 2, ISBN , p Appedix A This appedix describes how itesity of implicatio ca be rewritte i terms of support ad cofidece. Rewritig K= ab gives K= a b = a - ab = ab * a ab = * a ab 1 = ab ab 1 * * ab 1 = Support*cases* 1 ab cofidece 1 a ( ) Rewritig a b gives a ( b ) = Note that support = ab / ad cofidece = ab / a. ab * * ab a ( ) b = support *cases * cofidece ( cases - abssupcos) cases

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with