ROC n Rule Learning - Towards a Better Understanding of Covering Algorithms

Size: px

Start display at page:

Download "ROC n Rule Learning - Towards a Better Understanding of Covering Algorithms"

Coleen Lucas
6 years ago
Views:

1 ROC n Rule Learning - Towards a Better Understanding of Covering Algorithms Johannes Fürnkranz (juffi@oefai.at) Austrian Research Institute for Artificial Intelligence, Schottengasse 3, A-1010 Wien, Austria Peter A. Flach (eter.flach@bristol.ac.uk) Det. of Comuter Science, University of Bristol, Woodland Road, Bristol BS8 1UB, UK Abstract. This aer rovides an analysis of the behavior of searate-and-conquer or covering rule learning algorithms by visualizing their evaluation metrics and their dynamics in PNsace, a variant of ROC-sace. Our results show that most commonly used search heuristics, including accuracy, weighted relative accuracy, entroy, and Gini index, are equivalent to one of two fundamental rototyes: recision, which tries to otimize the area under the ROC curve for unknown costs, and a cost-weighted difference between covered ositive and negative examles, which tries to find the otimal oint under known or assumed costs. We also show that a straightforward generalization of the m-estimate trades off these two rototyes. Furthermore, our results show that stoing and filtering criteria like CN2 s significance test focus on identifying significant deviations from random classification, which does not necessarily avoid overfitting. We also identify a roblem with Foil s MDL-based encoding length restriction, which roves to be largely equivalent to a variable threshold on the recall of the rule. In general, we interret these results as evidence that, contrary to common concetion, re-runing heuristics are not very well understood and deserve more investigation. 1. Introduction Most rule learning algorithms for classification roblems follow the so-called searate-and-conquer or covering strategy, i.e., they learn one rule at a time, each of them exlaining (covering) a art of the training examles. The examles covered by the last learned rule are removed from the training set (searated) before subsequent rules are learned (before the remaining training examles are conquered). Tyically, these algorithms oerate in a concet learning framework, i.e., they exect ositive and negative examles for an unknown concet. From this training data, they learn a set of rules that describe the underlying concet, i.e., that exlain all (or most) of the ositive examles and (almost) none of the negative examles. Various aroaches that adhere to this framework differ in the way single rules are learned. Common to all algorithms is that they have to use a metric or search heuristic for evaluating the quality of a candidate rule. A survey of the rule learning literature yields a vast number of rule learning metrics that have been roosed for evaluating the quality of a learned rule. Each of these metrics has its justification based in statistics, information theory, or related fields, but their relation to each other is not very well understood. c 2003 Kluwer Academic Publishers. Printed in the Netherlands. aer.tex; 5/11/2003; 21:06;.1

2 2 Johannes Fürnkranz and Peter Flach In this aer, we attemt to shed some light uon this issue by investigating commonly used evaluation metrics. Our basic tool for analysis will be visualization in PN-saces. PN-sace is quite similar to ROC-sace, the two-dimensional lane in which the oerating characteristics of classifiers are visualized. Our analysis will show that most commonly used heuristics, including accuracy, weighted relative accuracy, entroy, and Gini index, are equivalent to one of two fundamental rototyes: recision, which tries to otimize the area under the ROC curve for unknown costs, and a cost-weighted difference between covered ositive and negative examles, which tries to find the otimal oint under known or assumed costs. We also show that a straightforward generalization of the m-estimate trades off these two rototyes. Many rule learning algorithms aly additional criteria for filtering out uninteresting rules or for stoing the refinement rocess at an aroriate oint. Stoing and filtering criteria have received considerably less attention in the literature, mostly because their task is more frequently addressed by runing. The few existing roosals, most notably simle thresholding of the evaluation metric, significance testing, and MDL-based criteria, turn out to be quite diverse aroaches to this roblem. Although our analysis will show some interesting corresondences between some of these techniques, it is also clear that stoing criteria are far from being well-understood. The outline of the aer is as follows: We begin with setting u the formal framework of our analysis, most notably the introduction of PN-saces and isometrics (Section 2), and a brief review of covering algorithms in this context (Section 3). In Sections 4 and 5, we analyze some of the most commonly used evaluation metrics, and show that those with a linear isometric landscae all fit into a common framework. Subsequently, we discuss a few issues related to learning rule sets with the covering algorithm (Section 6). In Section 7, we will look at the most commonly stoing and filtering criteria. Finally, we discuss a few oen questions (Section 8) and related work (Section 9) before we conclude (Section 10). Parts of this aer have reviously aeared in (Fürnkranz and Flach, 2003). 2. Formal Framework In the following, we define the formal framework in which we erform our analysis. In articular, we will define PN-saces and discuss their relation to ROC-saces (Section 2.1) and rovide formal definitions of the equivalence of evaluation metrics and oint out the imortance of isometrics in PN-sace for identifying such equivalences (Section 2.2). aer.tex; 5/11/2003; 21:06;.2

3 ROC n Rule Learning 3 Note that in the remainder of the aer, we will use the terms search heuristic and evaluation metric interchangeably because most algorithms do not differentiate between them. This is further discussed in Section ROC-SPACE AND PN-SPACE In the following, we assume some basic knowledge of ROC analysis as it is tyically used for selecting the best classifier for unknown classification costs (Provost and Fawcett, 2001). In brief, ROC analysis comares different classifiers according to their true ositive rate (TPR) and false ositive rate (FPR), i.e., the ercentage of correctly classified ositive examles and incorrectly classified negative examles. This allows to lot classifiers in a two-dimensional sace, one dimension for each of these two measurements. The ideal classifier is at the uer left corner, at oint (1,0), while the origin (0,0) and the oint (1,1) corresond to the classifiers that classify all examles as negative and ositive resectively. If all classifiers are lotted in this ROC-sace, the best classifier is the one that is closest to (1,0), where the used distance measure deends on the relative misclassification costs of the two classes. One can also show that all classifiers that are otimal under some (linear) cost model lie on the convex hull of these oints in ROC-sace. Thus, all classifiers that do not lie on this convex hull, can be ruled out. In the articular context of rule learning, we believe it is more convenient to think in terms of absolute numbers of covered examles. Thus, instead of lotting TPR over FPR, we lot the absolute number of covered ositive examles over the absolute number of covered negative examles. We call such grahs PN-grahs. Figure 1 shows two examles, one for the case where the number of ositive examles P exceeds the number of negative examles Figure 1. PN-curves are ROC-curves based on absolute numbers of covered examles. aer.tex; 5/11/2003; 21:06;.3

4 4 Johannes Fürnkranz and Peter Flach N, and one for the oosite case. In all subsequent figures, we will assume P N (the left grah of Figure 1), but this does not affect our results. A PN-grah can be turned into a ROC grah by simly normalizing the P and N-axes to the scale [0,1] [0,1]. Consequently, the isometrics of a function in a PN-grah can be maed one-to-one to its isometrics in ROCsace. Nevertheless, PN-grahs have several interesting roerties that may be of interest deending on the urose of the visualization. Most notably, the absolute numbers of covered ositive and negative examles allow to ma the covering strategy into a sequence of nested PN-grahs. This is further discussed in Section 6.1. Table I comares some of the roerties of PNcurves to those of ROC-curves. Table I. PN-saces vs. ROC-saces. roerty ROC-sace PN-sace x-axis FPR = n N n y-axis TPR = P emty theory R 0 (0,0) (0,0) correct theory R (0,1) (0,P) universal theory R (1, 1) (N, P) resolution ( 1 N, 1 P ) (1,1) sloe of diagonal 1 P N sloe of = n line N P ISOMETRICS AND EQUIVALENCE In the remainder of the aer, we use caital letters to denote the total number of ositive (P) and negative (N) examles in the training set, whereas (r) and n(r) are used for the resective number of examles covered by a rule r. Heuristics are two-dimensional functions of the form h(, n). We use subscrits to the letter h to differentiate between different heuristics. For brevity and readability, we will abridge h((r), n(r)) with h(r), and omit the argument (r) from functions, n, and h when it is clear from the context. Table II shows a summary of our notational conventions in the form of a four-field confusion matrix. DEFINITION 2.1 (comatible). Two search heuristics h 1 and h 2 are comatible iff for all rules r,s: h 1 (r) > h 1 (s) h 2 (r) > h 2 (s). aer.tex; 5/11/2003; 21:06;.4

5 ROC n Rule Learning 5 Table II. Notational conventions for a confusion matrix of ositive and negative examles covered or not covered by a rule. covered by rule not covered by rule ositive examle P P negative examle n N n N + n P + N n P + N DEFINITION 2.2 (antagonistic). Two search heuristics h 1 and h 2 are antagonistic iff for all rules r,s: h 1 (r) > h 1 (s) h 2 (r) < h 2 (s). Obviously, comatibility and antagonicity are dual concets: LEMMA 2.3. h1 and h2 are antagonistic iff h1 and h2 are comatible. Proof. Follows immediately from h 2 (r) > h 2 (s) h 2 (r) < h 2 (s). Note that comatible or antagonistic heuristics have identical regions of equality: DEFINITION 2.4 (equality-reserving). Two search heuristics h 1 and h 2 are equality-reserving iff for all rules r,s: h 1 (r) = h 1 (s) h 2 (r) = h 2 (s). THEOREM 2.5. Comatible or antagonistic search heuristics are equalityreserving. Proof. Assume they would not be equality-reserving. This means there exist rules r and s with h 1 (r) = h 2 (s) but h 2 (r) h 2 (s). Without loss of generality assume h 2 (r) > h 2 (s). This imlies that h 1 (r) > h 1 (s) (for comatibility) or h 1 (r) < h 1 (s) (for antagonicity). Both cases contradict the assumtion. Note that equality-reserving search heuristics are not necessarily comatible or antagonistic, only if we make some straightforward continuity assumtions. Although all subsequently discussed functions are continuous, we note that this does no have to be the case. For all ractical uroses, h(,n) may be regarded as a looku-table that defines a value h for each integer-valued air (n, ). Based on the above, we can define the equivalence of search heuristics. The basic idea is that search heuristics are equivalent if they order a set of candidate rules in an identical way. It should also cature that some learning systems try to maximize a certain heuristic functions, while others try to minimize an equivalent function (e.g., maximizing information gain is equivalent to minimizing entroy). aer.tex; 5/11/2003; 21:06;.5

6 6 Johannes Fürnkranz and Peter Flach DEFINITION 2.6 (equivalence). Two search heuristics h 1 and h 2 are equivalent (h 1 h 2 ) if they are either comatible or antagonistic. This definition also underlines the imortance of isometrics for the analysis of search heuristics: DEFINITION 2.7 (isometric). An isometric of a heuristic h is a line (or curve) in PN-sace that connects, for some value c, all oints (n, ) for which h(,n) = c. Equality-reserving search heuristics can be recognized by examining their isometrics and establishing that for each isometric line for h 1 there is a corresonding isometric for h 2. Comatible and antagonistic search heuristics can be recognized by investigating corresonding isometrics and establishing that their associated heuristic values are in the same (the oosite) order. 3. Covering Rule Learning Algorithms The most oular strategy for learning classification rules is the covering or searate-and-conquer strategy. It has its roots in the early days of machine learning in the AQ family of rule learning algorithms (Michalski, 1969; Kaufman and Michalski, 2000). It is fairly oular in roositional rule learning (cf. CN2 (Clark and Niblett, 1989; Clark and Boswell, 1991), Rier (Cohen, 1995), or CBA (Liu et al., 1998)) as well as in inductive logic rogramming (ILP) (cf. Foil (Quinlan, 1990) and its successors (Džeroski and Bratko, 1992; Fürnkranz, 1994; De Raedt and Van Laer, 1995; Quinlan and Cameron-Jones, 1995) or Progol (Muggleton, 1995)). In the following, we will briefly recaitulate the key issues of this learning strategy, with a articular focus on its behavior in PN-sace. For a detailed survey of this large family of algorithms we refer to (Fürnkranz, 1999) SEPARATE-AND-CONQUER LEARNING The defining characteristic of this family of algorithms is that they successively learn rules that exlain art of the available training examles. Examles that are covered by reviously learned rules are searated from the training set, and the remaining examles are conquered by recursively calling the learner to find the next rule. This is reeated until all examles are exlained by a rule, or until some other, external stoing criterion secifies that the current rule set is satisfactory. Most algorithms of this family oerate in a concet learning framework, i.e., they learn the definition of an unknown concet from ositive examles (examles for the concet) and negative examles (counter-examles). Each aer.tex; 5/11/2003; 21:06;.6

7 ROC n Rule Learning 7 Figure 2. Schematic deiction of the aths through PN-sace that are described by (left) the covering strategy of learning a theory by adding one rule at a time and (right) greedy secialization of a single rule. of the learned rules covers some (as many as ossible) of the available ositive examles, and ossibly some (as few as ossible) of the negative examles as well. Thus, the learned rule set describes the hidden concet. For classifying new examles, each rule is tried in order. If any of the learned rules fires for a given examle, the examle is classified as ositive. If none of them fires, the examle is classified as negative. This corresonds to the closed-world assumtion in the semantics of rule sets (theories) and rules (clauses) in the PROLOG rogramming language COVERING IN PN-SPACE Adding a rule to a rule set means that more examles are classified as ositive, i.e., it increases the coverage of the rule set. All ositive examles that are uniquely covered by the newly added rule contribute to an increase of the true ositive rate on the training data. Conversely, covering additional negative examles may be viewed as increasing the false ositive rate on the training data. Therefore, adding rule r i+1 to rule set R i effectively moves from oint R i = (n i, i ) (corresonding to the number of negative and ositive examles that are covered by revious rules), to a new oint R i+1 = (n i+1, i+1 ) (corresonding to the examles covered by the new rule set). Moreover, R i+1 will tyically be closer to (N,P) and farther away from (0,0) than R i. Consequently, learning a rule set one rule at a time may be viewed as a ath through PN-sace, where each oint on the ath corresonds to the addition of a rule to the theory. Such a PN-ath starts at (0,0), which corresonds to the emty theory that does not cover any examles. Adding a rule moves to a new oint in ROC-sace, which corresonds to a theory consisting of all rules that have been learned so far. After the final rule has been learned, one can aer.tex; 5/11/2003; 21:06;.7

8 8 Johannes Fürnkranz and Peter Flach imagine adding yet another rule with a body that is always true. Adding such a rule has the effect that the theory now classifies all examles as ositive, i.e., it will take us to the oint R = (N,P). Figure 2 shows the PN-ath for a theory with three rules. Each oint R i reresents the rule set consisting of the first i rules. The final rule set learned by a classifier will tyically corresond to the last oint on the curve before it reaches (N,P). Note, however, that this need not be the case. This will be briefly discussed in section GREEDY SPECIALIZATION While the covering loo is essentially the same for all members of the searateand-conquer family of rule learning algorithms, the individual members differ in the way they learn single rules. Algorithms may use stochastic (Mladenić, 1993) or genetic search (Giordana and Sale, 1992), exhaustive search (Webb, 1995; Muggleton, 1995), or emloy association rule algorithms for finding a candidate set of rules (Liu et al., 1998; Jovanoski and Lavrač, 2001). Again, we refer to (Fürnkranz, 1999) for a survey. The vast majority of algorithms uses a heuristic to-down hill-climbing or beam search strategy, i.e., they search the sace of ossible rules by successively secializing the current best rule. Rules are secialized by greedily adding the condition which romises the highest gain according to some evaluation metric. Like with adding rules to a rule set, this successive refinement describes a ath trough PN-sace (see Figure 2, right). However, in this case, the ath starts at the uer right corner (covering all ositive and negative examles), and successively roceeds towards the origin (which would be a rule that is too secific to cover any examle). Note that rule learning algorithms that are based on iterative refinement of candidate rules tyically use the same metric for evaluating comlete and incomlete rules, i.e., rules that can be added to the theory and rules that are only interior nodes in the search grah. While the evaluation of comlete rules should measure the rule s otential of classifying unseen test cases, the evaluation of an incomlete rule should cature its otential to be refined into a high-quality comlete rule. In this case, the evaluation metric is used as a search heuristic. We note that, in rincile, different tyes of search heuristics are ossible (cf. also Section 8), but, like all refinement-based rule learning algorithms, we will not further differentiate between evaluation metrics and search heuristics, and use the terms interchangeably in this aer OVERFITTING AVOIDANCE In the simlest case, both for refining a rule set and refining a rule, all oints on its PN-ath are evaluated with an evaluation metric, and the rule set or rule that corresonds to the oint with the highest evaluation is selected. However, aer.tex; 5/11/2003; 21:06;.8

9 ROC n Rule Learning 9 such an aroach may often be too simlistic, in articular because of the danger of overfitting. A rule learner tyically evaluates a large number of candidate rules, which makes it quite likely that one of them fits the characteristics of the training set by chance. For examle, simly using the fraction of ositive examles covered by the rule suffers from overfitting because one can always find a rule that covers a single ositive examle and no negative examle by simly icking a unique ositive examle and transform it into a rule. Such a rule has an otimal value 1, but it is unlikely that the learned rule is useful because the obtained heuristic estimate is not reliable enough to generalize to unseen data. A simle way of countering overfitting is to exlicitly secify the region of the PN-sace for which it is believed that the rovided heuristic evaluations are too weak or too unreliable to be of interest. There are two rincial aroaches for dealing with this kind of roblem: Pre-runing algorithms emloy additional evaluation heuristics for filtering out unromising rules or for stoing the refinement rocess, whereas ost-runing aroaches deliberately learn an overfitting rule or theory and correct it in a ost-rocessing hase. For a discussion of the two alternatives and some roosals for combining them we refer to (Fürnkranz, 1997). In this aer, we will confine ourselves to re-runing heuristics (Section 7). 4. Linear Rule Evaluation Metrics The ultimate goal of learning is to reach oint (0,P) in PN-sace, i.e., to learn a correct theory that covers all ositive examles, but none of the negative examles. This will rarely ever be achieved in a single ste, but a set of rules will be needed to meet this objective. The urose of a rule evaluation metric is to estimate how close a rule takes you to this ideal oint. In the following, we analyze the most commonly used metrics for evaluating the quality of a rule in covering algorithms BASIC HEURISTICS A straightforward strategy for finding a rule that covers some ositive and as few negative examles as ossible is to minimize the number of covered negative examles for each individual rule (or maximize their negation). hn = n This, however, does not cature the intuition that we want to cover as many ositive examles as ossible. For this, h, the number of covered ositive examles, could be used. h = aer.tex; 5/11/2003; 21:06;.9

10 10 Johannes Fürnkranz and Peter Flach Figure 3. Isometrics for minimizing false ositives and for maximizing true ositives. Figure 3 shows the isometrics for hn and h, vertical and horizontal lines. All rules that cover the same number of negative (ositive) examles are evaluated equally, irresective of the number of ositive (negative) examles they cover. However, it is trivial to find theories that maximize either of them: hn is maximal for the emty theory R 0, which does not cover any negative examles, but also no ositive examles, and h is maximal for the universal theory R, which covers all ositive and no negative examles. Ideally, one would like to achieve both goals simultaneously ACCURACY, WEIGHTED RELATIVE ACCURACY, GENERAL COSTS A straightforward way for trading off covering many ositives and excluding many negatives is to simly add u hn and h: hacc = n The isometrics for this function are shown in the left grah of Figure 4. Note that the isometrics all have a 45 o angle. Thus this search heuristic basically otimizes accuracy: THEOREM 4.1. hacc is equivalent to accuracy. Proof. The accuracy of a theory (which may be a single rule) is the roortion of correctly exlained examles, i.e., ositive examles that are covered () and negative examles that are not covered (N n), in all examles (P + N). Thus the isometrics are of the form +(N n) P+N = c. As P and N are constant, these can be transformed into the isometrics of hacc: n = cacc = c(p + N) N. aer.tex; 5/11/2003; 21:06;.10

11 ROC n Rule Learning 11 Figure 4. Isometrics for accuracy and weighted relative accuracy. Accuracy has been used as a runing heuristic in I-REP (Fürnkranz and Widmer, 1994), a modification of hacc, which also subtracts the length of a rule, has been used in PROGOL (Muggleton, 1995). Otimizing accuracy gives equal weight to covering a single ositive examle and excluding a single negative examle. There are cases where this choice is arbitrary, for examle when misclassification costs are not known in advance or when the samles of the two classes are not reresentative. In such cases, it may be advisable to normalize with the samle sizes: hwra = P n N = T PR FPR The isometrics of this heuristic are shown in the right half of Figure 4. The main difference with accuracy is that the isometrics are arallel to the diagonal, which reflects that we now give equal weight to increasing the true ositive rate (TPR) or to decreasing the false ositive rate (FPR). Note that the diagonal encomasses all classifiers that make random redictions. hwra may be viewed as a simlification of weighted relative accuracy, which has been roosed as a rule learning heuristic by Lavrač et al., Todorovski et al. (1999, 2000): THEOREM 4.2. hwra is equivalent to weighted relative accuracy. Proof. Weighted relative accuracy is defined as h wra = +n P+N ( +n P P+N ). Using equivalence-reserving transformations (multilications with constant values like P + N), we obtain h wra = 1 P+N ( P P+N n P P+N ) N P+N n P P+N N np P n N = h wra. The two PN-grahs of Figure 4 are secial cases of a function that allows to incororate arbitrary cost ratios between false negatives and false ositives. The general form of this linear cost metric is hcosts = a bn c (1 c)n dn aer.tex; 5/11/2003; 21:06;.11

12 12 Johannes Fürnkranz and Peter Flach Figure 5. Isometrics for recision. Obviously, the accuracy isometrics can be obtained with a = b = d = 1 or c = 1/2, and the isometrics of weighted relative accuracy can be obtained by setting a = 1/P and b = 1/N or c = N/(P + N) or d = P/N. In general, the sloe of the arallel isometrics in the PN-grah is c 1 c RECALL AND PRECISION, SUPPORT AND CONFIDENCE The most commonly used heuristic for evaluating single rules is to look at the roortion of ositive examles in all examles covered by the rule. This metric is known under many different names, e.g., confidence in association rule mining, or recision in information retrieval. We will use the latter term: hr = + n Figure 5 shows the isometrics for this heuristic. Like h, recision considers all rules that cover only ositive examles to be equally good (the P-axis), and like hn, it considers all rules that only cover negative examles as equally bad (the N-axis). All other isometrics are obtained by rotation around the origin (0,0), for which the heuristic value is undefined. Uses of the ure recision heuristic in rule learning include (Pagallo and Haussler, 1990; Weiss and Indurkhya, 1991). However, several other, seemingly more comlex heuristics can be shown to be equivalent to recision. For examle, the heuristic that is used for runing in Rier (Cohen, 1995): THEOREM 4.3. Rier s runing heuristic h ri = n +n is equivalent to recision. Proof. h ri = +n n +n = +n (1 +n ) = 2 h r 1 In subsequent sections, we will see that more comlex heuristics, such as entroy and Gini index, are also essentially equivalent to recision. On the aer.tex; 5/11/2003; 21:06;.12

13 ROC n Rule Learning 13 Figure 6. Isometrics for entroy (left) and Gini index (right). other hand, seemingly minor modifications like the Lalace or m-estimates are not. Precision and confidence are more frequently used together with their resective counter-arts recall and suort. hrec = P = T PR As P is constant, hrec is obviously equivalent to h discussed above. Suort is usually defined as the fraction of examles that satisfy both the head and the body of the rule, i.e., hsu = P+N h rec INFORMATION CONTENT, ENTROPY AND GINI INDEX Some algorithms (e.g., PRISM (Cendrowska, 1987)) measure the information content h info = log 2 + n THEOREM 4.4. h info and hr are antagonistic and thus equivalent. Proof. h info = log 2 hr, thus h info (r) > h info (s) hr(r) < hr(s). The use of entroy (in the form of information gain) is very common in decision tree learning (Quinlan, 1986), but has also been suggested for rule learning in the original version of CN2 (Clark and Niblett, 1989). hent = ( + n log 2 + n + n + n log 2 n + n ) Entroy is not equivalent to information content and recision, even though it seems to have the same isometrics as these heuristics (see Figure 6). The aer.tex; 5/11/2003; 21:06;.13

14 14 Johannes Fürnkranz and Peter Flach difference is that the isometrics of entroy go through the undefined oint (0,0) and continue on the other side of the 45 o diagonal. The motivation for this is that the original version of CN2 did not assume a ositive class, but labeled its rules with the majority class among the examles covered. Thus rules r = (n, ) and r = (,n) can be considered to be of equal quality because one of them will be used for redicting the ositive class and the other for redicting the negative class. Based on this, we can, however rove the following THEOREM 4.5. hent and hr are antagonistic for n and comatible for n. Proof. hent = hr log 2 hr (1 hr)log 2 (1 hr) with hr [0,1]. This function has its maximum at hr = 1/2 = n. From the fact that it is strictly monotonically increasing for n follows that hr(x) < hr(y) hent(x) < hent(y) in this region. Analogously, hr(x) < hr(y) hent(x) > hent(y) for n, where hent is monotonically decreasing in hr. We will say that entroy is a class-neutral version of recision. In general, a class-neutral heuristic has isometrics that are line-mirrored across a symmetry line, tyically the 45 o line or the diagonal. A heuristic that is not classneutral can be made so by re-scaling to [ 1,1] and taking the absolute value. For instance, n +n is a class-neutral version of recision, and therefore (by construction) equivalent to entroy. In decision tree learning, the Gini index is also a very oular heuristic (Breiman et al., 1984). To our knowledge, it has not been used in rule learning, but we list it for comleteness: ( ) 2 ( ) n 2 h gini = 1 n + n + n ( + n) 2 As can be seen from Figure 6, the Gini index has the same isometric landscae as entroy, it only differs in the distribution of the values (hence the lines of the contour lot are a little denser near the axes and less dense near the diagonal). This, however, does not change the ordering of the rules. THEOREM 4.6. h gini and hent are equivalent. Proof. Like entroy, the Gini index can be formulated in terms of hr (h gini hr(1 hr)) and both functions have essentially the same shae, i.e., both functions grow or fall with hr in the same way. aer.tex; 5/11/2003; 21:06;.14

15 ROC n Rule Learning 15 Figure 7. Isometrics for the Lalace and m-estimates LAPLACE, m-estimate, F- AND g-measure The Lalace and m-estimates (Cestnik, 1990) are very common modifications of hr, which have, e.g., been used in the second version of CN2 (Clark and Boswell, 1991), m-foil (Džeroski and Bratko, 1992), and ICL (De Raedt and Van Laer, 1995). h la = n + 2 hm = + m P P+N + n + m The basic idea of these estimates is to assume that each rule covers a certain number of examles a riori. They comute a recision estimate, but start to count covered ositive or negative examles at a number > 0. With the Lalace estimate, both the ositive and negative coverage of a rule are initialized with 1 (thus assuming an equal rior distribution), while the m-estimate assumes a rior total coverage of m examles which are distributed according to the distribution of ositive and negative examles in the training set. In the PN-grahs, this modification results in a shift of the origin of the recision isometrics to the oint ( n m, m ), where n m = m = 1 in the case of the Lalace heuristic, and m = m P/(P + N) and n m = m m for the m-estimate (see Figure 7). The resulting isometric landscae is symmetric around the line that goes through ( n m, m ) and (0,0). Thus, the Lalace estimate is symmetric around the 45 o line, while the m-estimate is symmetric around the diagonal of the PN-grah. Another noticeable effect of the transformation is that the farther the origin ( n m, m ) moves away from (0,0), the more the isometrics in the relevant aer.tex; 5/11/2003; 21:06;.15

16 16 Johannes Fürnkranz and Peter Flach Figure 8. Isometrics of hg for increasing values of g. window (0,0) (P,N) aroach arallel lines. For examle, the isometrics of the m-estimate converge towards the isometrics of weighted relative accuracy for m (see theorem 4.8 below). This is maybe best illustrated if we set m = 0, i.e., if we assume the rotation oint to be on the N-axis. We will call this the g-measure: hg = + n + g Figure 8 shows how for g, the sloes of hg s isometrics becomes flatter and flatter and converge towards the axis-arallel lines of hrec. On the other hand, for g = 0, the measure is obviously equivalent to recision. Thus, the g-measure may be regarded as a simle way for trading off recision and recall, or suort and confidence. In fact, it was already shown in (Flach, 2003) that for g = P, hg is equivalent to the F-measure, a frequently used trade-off between recall and recision (van Rijsbergen, 1979). aer.tex; 5/11/2003; 21:06;.16

17 ROC n Rule Learning 17 Flach (2003) also roosed an equivalent simlification of the F-measure, which he called the G-measure (h G = n+p ). This, in turn, is a secial case of a heuristic that was recently roosed for subgrou discovery (Gamberger and Lavrač, 2002), which is equivalent to the g-measure: n+g THEOREM 4.7. For g 0, hg Proof. We show equivalence via comatibility. 1 hg(r 1 ) > hg(r 2 ) 1 + n 1 + g > n 2 + g 1 ( 2 + n 2 + g) > 2 ( 1 + n 1 + g) (n 2 + g) > (n 1 + g) 1 (n 2 + g) > 2 (n 1 + g) 1 n 1 + g > 2 n 2 + g 4.6. THE GENERALIZED m-estimate The above discussion leads us to the following straightforward generalization of the m-estimate, which takes the rotation oint of the recision isometrics as a arameter: hgm = + mc + n + m = + a ( + a) + (n + b) The second version of the heuristic basically defines the rotation oint by secifying its co-ordinates ( b, a) in PN-sace (a,b [0, ]). The first version uses m as a measure of how far from the origin the rotation oint lies using the sum of the co-ordinates as a distance measure. Hence, all oints with distance m lie on the line that connects (0, m) with ( m,0), and c secifies where on this line the rotation oint lies. For examle, c = 1 means ( m,0), whereas c = 0 denotes (0, m), resulting in the g-measure. The line that connects the rotation oint and (0,0) has a sloe of 1 c. Obviously, both c versions of hgm can be transformed into each other by choosing m = a + b and c = or a = mc and b = m(1 c). a a+b THEOREM 4.8. For m = 0, hgm is equivalent to hr, while for m, its isometrics converge to hcosts. Proof. m = 0: trivial. m : By construction, an isometric of hgm through the oint (n, ) connects this oint with the rotation oint ( (1 c)m, cm) and has the sloe +cm n+(1 c)m 1 c. For m, this sloe converges to c for all oints (n, ). Thus all isometrics converge towards arallel lines with the sloe c 1 c. aer.tex; 5/11/2003; 21:06;.17

18 18 Johannes Fürnkranz and Peter Flach Theorem 4.8 shows that hgm may be considered as a general model of heuristic functions with linear isometrics that has two arameters: c [0,1] for trading off the misclassification costs between the two classes, and m [0, ] for trading off between recision hr and the linear cost metric hcosts. 1 Therefore, all heuristics discussed in this section may be viewed as equivalent to some instantiation of this general model. 5. Non-Linear Rule Evaluation Metrics All heuristics considered so far have linear isometrics. This is a reasonable assumtion as concets are tyically evaluated with linear cost metrics (e.g., accuracy or cost matrices). However, these evaluation metrics are concerned with evaluating comlete rules. As the value of an incomlete rule lies not in its ability to discriminate between ositive and negative examles, but in its otential of being refinable into a high-quality rule, it might well be the case that different tyes of heuristics are useful for evaluating incomlete candidate rules. One could, e.g., argue that for candidate rules that cover many ositive examles it is less imortant to exclude negatives than for rules with low coverage, because high-coverage candidates may still be refined accordingly. Similarly, overfitting could be countered by enalizing regions with low coverage. Possibly, these and similar roblems could be better addressed with a non-linear isometric landscae, even though the learned rules will eventually be used under linear cost models. In this section, we will look at two non-linear information metrics: Foil s information gain, and the correlation heuristic used in Fossil FOIL S INFORMATION GAIN Foil s version of information gain (Quinlan, 1990), unlike ID3 s and C4.5 s version (Quinlan, 1986), is tailored to rule learning, where one only needs to otimize one successor branch as oosed to multile successor nodes in decision tree learning. It differs from the heuristics mentioned so far in that it does not evaluate an entire rule, but only the effect of secializing a rule by adding a condition. More recisely, it comutes the difference in information 1 The reader may have noted that for m, hgm c for all and n. Thus for m =, the function does not have isometrics because all evaluations are constant. However, this is not a roblem for the above construction because we are not concerned with the isometrics of the function hgm at the oint m =, but with the convergence of the isometrics of hgm for m. In other words, the isometrics of hcosts are not equivalent to the isometrics of hgm for m =, but they are equivalent to the limits to which the isometrics of hgm converge if m. aer.tex; 5/11/2003; 21:06;.18

19 ROC n Rule Learning 19 Figure 9. Isometrics for information gain as used in Foil. The curves show different values c for the recision of the arent rule. content of the current rule and its redecessor r, weighted by the number of covered ositive examles (as a bias for generality). The exact formula is 2 h foil = (log 2 + n log 2 c) where c = hr(r ) is the recision of the arent rule. For the following analysis, we will view c as a arameter taking values in the interval [0,1]. Figure 9 shows the isometrics of h foil for four different settings of c. Although the isometrics are non-linear, they aear to be linear in the region above the isometric that goes through (0, 0). Note that this isometric, which c 1 c we will call the base line, has a sloe of : In the first grah (c = P P+N ) it is the diagonal, in the second grah (c = 1/2) it has a 45 o sloe, and in the lower two grahs (c = 1 and c = 10 6 ) it coincides with the vertical and horizontal axes resectively. 2 This formulation assumes that we are learning in a roositional setting. For relational learning, Foil does not estimate the recision from the number of covered instances, but from the number of roofs for those instances. aer.tex; 5/11/2003; 21:06;.19

20 20 Johannes Fürnkranz and Peter Flach Note that the base line reresents the classification erformance of the arent rule. In the area below the base line are the cases where the recision of the rule is smaller than the recision of its redecessor. Such a refinement of a rule is usually not considered to be relevant. In fact, this is also the region where the information gain is negative, i.e., an information loss. The oints on the base line have information gain 0, and all oints above it have a ositive gain. For this reason, we will focus our analysis on the region above the base line, which is the area where the refined rule imroves uon its arent. In this region, the isometrics aear to be linear and arallel to the base line, just like the isometrics of hcosts. In fact, in (Fürnkranz and Flach, 2003), we conjectured that in this art of the ROC sace, h foil is equivalent to hcosts with cost arameter 1 c. However, closer investigation reveals that this is not the case. The farther the isometrics move away from the base line, the steeer they get. Also, they are not exactly linear but slightly bent towards the P-axis, i.e., urer rules are referred. Nevertheless, it is obvious that both heuristics are closely related, and in fact we can show the following: THEOREM 5.1. Let c = hr(r ) be the recision of r, the redecessor of rule r. In the relevant region +n > c, h costs(r) with costs 1 c is equivalent to the first-order Taylor aroximation of h foil (r). Proof. h foil = (log 2 + n log 2 c) = log c( + n) c( + n) 2 ln The Taylor exansion for ln(1 + x) converges for 1 < x 1, and the firstdegree aroximation is ln(1 + x) x. Recall that the interesting region of h foil is where +n c, i.e., where the recision of the current rule r exceeds the recision of the arent rule r (otherwise there would be no oint in refining r to r). In this region, 0 < c c/ +n = c(+n) 1. Thus c( + n) h foil ln ( c( + n) 1 = (1 c) cn c( + n) = ln(1 + ( ) = c( + n) 1)) = c( + n) = It should be ointed out that Theorem 5.1 should not be interreted in the way that hcosts is a good aroximation for h foil. In fact, while the firstorder Taylor aroximation of ln1 + x is quite good near x = 0, it can become aer.tex; 5/11/2003; 21:06;.20

21 ROC n Rule Learning 21 Figure 10. Isometrics for the four-field correlation coefficient. arbitrarily bad for x 1. In our case, we can exect a good aroximation if c is close to 1 and a rather bad one if c aroaches 0 because 0 < c x+1 = c( + n)/ 1. On the other hand, it is not strictly necessary to have a good aroximation for maintaining the isometric landscae of a function, as we have seen on several examles for equivalent heuristics in the revious section. There is a difference between one heuristic aroximating another, and its isometrics aroximating the other s. A full analysis of Foil s information gain requires a fuller understanding of what it means to aroximate an isometric landscae, which we leave for future work CORRELATION AND χ 2 STATISTIC The correlation heuristic has been roosed in (Fürnkranz, 1994) for the rule learner Fossil, a variant of Foil. It comutes the correlation coefficient between the redictions made by the rule and the true labels from a four-field confusion matrix. The formula is hcorr = (N n) (P )n N Pn = PN( + n)(p + N n) PN( + n)(p + N n) Recall that and N n denote the fields on the diagonal of the confusion matrix (the true ositives and the true negatives), whereas n and P denote the false ositives and false negatives resectively. The terms in the denominator are the sums of the rows and columns of the confusion matrix. Figure 10 shows the isometric lot of the correlation heuristic. The isometrics are clearly non-linear and symmetric around the base line on the diagonal (rules that have correlation 0). This symmetry around the diagonal is also characteristic for the linear weighted relative accuracy heuristic hwra, so it is interesting to comare the roerties of these two metrics. aer.tex; 5/11/2003; 21:06;.21

22 22 Johannes Fürnkranz and Peter Flach Above that base line, the isometrics are bent towards the P and N axes resectively. Thus, hcorr has a tendency to refer urer rules (covering fewer negative examles) and more comlete rules (covering more ositive examles). Moreover, the linear isometrics that result from connecting the oints where the isometrics intersect with the n = 0 axis (on the left) and the = P axis (on the to), are not arallel to the diagonal (as they are for hwra). In 1 /N fact, it can be shown that the sloe of these lines is 1 /P where (0, ) is the intersection of the isometric with the P-axis. Obviously, these lines are only arallel for a class balanced roblem (P = N). For P < N, the sloe of the isometrics will increase for increasing and aroach infinity for P. Thus, the closer a rule gets to the target oint (0,P), the more imortant it will become to exclude remaining covered negative examles, and the less imortant to cover additional ositive examles. For examle, the rule (0,P 1) that covers all but one ositive examles and no negatives and the rule (1,P) that covers all ositive examles have identical evaluations. On the other hand, the rule (0, 1) covering one ositive and no negatives has an evaluation roortional to 1/P and is therefore referred over the rule (N 1,P) that covers all ositives but excludes only a single negative examle (the rule has an evaluation roortional to 1/N). Kee in mind that we assumed P < N here; the oosite reference would be the case for P > N. In summary, the correlation heuristic has a tendency to refer urer or more comlete rules. Moreover, for rules that are far from the target (0,P), it gives higher reference to handling the minority class (i.e., cover ositive examles for P < N and exclude negative examles for N < P), whereas the riorities for these two objectives become more and more equal the closer a rule is to the target. 3 Note that the four-field correlation coefficient is basically a normalized version of the χ 2 statistic over the four events in the table (see, e.g., Mittenecker, 1983, or other textbooks on alied statistics). THEOREM 5.2. h chi 2 = (P + N) h 2 corr. Proof. The χ 2 statistic for a four-field confusion matrix (cf. Table II) is the sum of the squared differences between the exected and the observed values of the four fields in the matrix, divided by the exected values. The exected values for the four fields in the 2x2 confusion matrix are: ) ( P(+n) P+N P(P+N n) P+N N(+n) P+N N(P+N n) P+N 3 The references would even reverse if fractional examles (i.e., coverage counts smaller than 1) are ossible. We do not consider this case here. aer.tex; 5/11/2003; 21:06;.22

23 Subtracting this from the actual values (Table II): ( = P(+n) P+N P P(P+N n) P+N ( (P+N) P(+n) P+N (P )(P+N) P(P+N n) P+N ROC n Rule Learning 23 n N(+n) P+N N n N(P+N n) P+N = 1 ( ) N np np N = P + N np N N np ) = n(p+n) N(+n) P+N (N n)(p+n) N(P+N n) P+N ) ( ) N Pn 1 1 P + N 1 1 Squaring these differences, dividing them by the exected values, and summing u the four fields yields: h chi 2 = = = = (N Pn)2 (P + N) 2 (N Pn)2 P + N (N Pn)2 P + N ( P + N P( + n) + P + N N( + n) + P + N P(P + N n) + ) P + N = N(P + N n) N(P + N n) + P(P + N n) + N( + n) + P( + n) PN( + n)(p + N n) (P + N) 2 PN( + n)(p + N n) = (P + N)h2 corr Thus, h chi 2 may be viewed as a class-neutral version of hcorr. Its behavior has been reviously analyzed in ROC sace. Bradley (1996) argued that the non-linear isometrics of the χ 2 -statistic hels to discriminate classifiers that do little work from classifiers that achieve the same accuracy (or cost line) but are referable in terms of other metrics like sensitivity and secificity. The above-mentioned aer also shows how h chi 2 can be modified to incororate other cost models. Finally, we draw attention to the fact that the Gini slitting criterion, defined as imurity decrease from arent to children where imurity is defined by the Gini index (see Section 4.4), is equivalent to χ 2. Here, we restrict attention to binary slits and two classes: in this case, (P,N) can be interreted as the coverage of the unslit arent, and (,n) and (P,N n) denote the coverage of the two children. Gini-slit is then defined as h gini-slit = PN (P + N) 2 + n P + N n ( + n) 2 P + N n (P )(N n) P + N (P + N n) 2 = aer.tex; 5/11/2003; 21:06;.23

24 24 Johannes Fürnkranz and Peter Flach where the first term is the Gini index of the arent, and the second and third terms are the Gini index of the children, weighted with their coverage. This exression can be simlified to h gini-slit = 1 P + N [ PN P + N n ] (P )(N n) + n P + N n THEOREM 5.3. h gini-slit h chi 2. Proof. The exression inside the brackets in Equation (1) can be rewritten as a fraction with denominator (P+N)(+n)(P +N n) and enumerator PN( + n)(p + N n) n(p + N)(P + N n) (P )(N n)(p + N)( + n) = P 2 N PN 2 + PN 2 PN n + P 2 Nn PN n + PN 2 n PNn 2 P 2 n + P 2 n PN n + Pn 2 PN n + N 2 n N 2 n + N n 2 P 2 N P 2 Nn PN 2 PN 2 n + P 2 n + P 2 n 2 + PN n + PNn 2 +PN 2 + PN n + N N 2 n P 2 n Pn 2 N 2 n N n 2 All terms but the underlined ones vanish, so we have h gini-slit = 1 (N Pn) 2 P + N (P + N)( + n)(p + N n) = PN (P + N) 2 (N Pn) 2 PN( + n)(p + N n) = PN (1) (P + N) 3 h chi 2 We conjecture that this result can be generalised to non-binary slits and more than two classes, but this is currently left as an oen roblem. 6. Evaluation of Rule Sets So far, we have been considering heuristics for evaluating single rules, which ma to oints in PN-sace. In this section we look at evaluation of sets of rules. We show that the sequence of training sets and intermediate hyotheses can be viewed as a trajectory through PN-sace, resulting in a set of nested PN-saces (Section 6.1). We then roceed by investigating the behavior of recision and the linear cost metric in this context, and show that recision aims at (locally) otimizing the area under the curve (ignoring costs), while the cost metric tries to directly find a (global) otimum under known or aer.tex; 5/11/2003; 21:06;.24

25 ROC n Rule Learning 25 Figure 11. Isometrics for accuracy and recision in nested PN-saces. assumed costs (Section 6.2). Finally, we will briefly discuss the effect of re-ordering the rules in a rule set (Section 6.3) COVERING AND NESTED PN-SPACES Of articular interest for the covering aroach is the roerty that PN-grahs reflect a change in the total number or roortion of ositive (P) and negative (N) training examles via a corresonding change in the relative sizes of the P and N-axes. ROC analysis, on the other hand, would rescale the new dimensions to the range [0,1], which has the effect of changing the sloe of all lines that deend on the relative sizes of and n. As a consequence, the PN-grah for a subset of a training set can be drawn directly into the PN-grah of the entire set. In articular, the sequence of training sets that are roduced by the recursive calls of the covering strategy after each new rule all training examles that are covered by this rule are removed from the training set and the learner calls itself on the remaining examles can be visualized by a nested sequence of PN-grahs. This is illustrated in Figure 11, which shows a nested sequence of PNsaces. Each sace PN i has its origin in the oint R i = (n i, i ), which corresonds to the theory learned so far. Thus, its learning task consists of learning a theory for the remaining P i ositive and N n i negative examles that are not yet covered by theory R i. Each new rule will cover a few additional examles and thus reduce the corresonding PN-sace. Note that some searate-and-conquer algorithms only remove the covered ositive examles, but not the covered negative examles. This situation corresonds to reducing only the height of the PN-grah, but not its width. Obviously, this breaks the nesting roerty because rules are locally evaluated in a PN-grah with full width (all negative examles), but for getting their global evaluation in the context of reviously learned rules, all reviously covered negative examles have to be removed. While we do not exect a big aer.tex; 5/11/2003; 21:06;.25

26 26 Johannes Fürnkranz and Peter Flach ractical difference between both aroaches, removing all examles seems to be the concetually cleaner solution GLOBAL AND LOCAL OPTIMA The nesting roerty imlies that evaluating a single rule in the reduced dataset (N n i,p i ) amounts to evaluating the oint in the subsace PN i that has the oint (n i, i ) as its origin. Moreover, the rule that maximizes the chosen evaluation function in PN i is also a maximum oint in the full PNsace if the the evaluation metric does not deend on the size or the shae of the PN-grah. The linear cost metric hcosts is such a heuristic: a local otimum in the subsace PN i is also otimal in the global PN-sace because all isometrics are arallel lines with the same angle, and nested PN-saces (unlike nested ROC-saces) leave angles invariant. Precision, on the other hand, cannot be nested in this way. The evaluation of a given rule deends on its location relative to the origin of the current subsace PN i. This is illustrated in Figure 11. The subsaces PN i corresond to the situation after removing all examles covered by the rule set R i. The left grah of Figure 11 shows the case for accuracy: the accuracy isometrics are all arallel lines with a 45 o sloe. The right grah shows the situation for recision: each subsace PN i evaluates the rules relative to its origin, i.e., hr always rotates around (n i, i ) in the full PN-sace, which is (0,0) in the local sace. Note that global otimization for recision (or for hgm in general) could be realized by making the generalized m-measure hgm adative by choosing a = i and b = n i in subsace PN i. However, local otimization is not necessarily bad. At each oint (n, ), the sloe of the line connecting (0,0) with (n, ) is 1 c c = /n. 4 hr icks the rule that romises the steeest ascent from the origin. Thus, if we assume that the origin of the current subsace is a oint of a ROC-curve (or a PN-ath), we may interret hr as making a locally otimal choice for continuing a ROC-curve. The choice is otimal in the sense that icking the rule with the steeest ascent locally maximizes the area under the ROC-curve. Note, however, that if we know the cost model, this choice may not necessarily be globally otimal. For examle, if the oint R 2 in Figure 11 could be reached in one ste, hacc would directly go there because it has the better global value under the chosen cost model (accuracy), whereas hr would nevertheless first learn R 1 because it romises a greater area under the ROC curve. 4 One may say that hr assumes a different cost model for each oint in the sace, deending on the relative frequencies of the covered ositive and negative examles. Such local changes of cost models are investigated in more detail by Flach (2003). aer.tex; 5/11/2003; 21:06;.26

27 ROC n Rule Learning 27 Figure 12. A concavity in the ath of a rule learner (left), and the result of a successful fix by swaing the order of the second and third rule (right). In brief we may say that hr aims at otimizing under unknown costs by (locally) maximizing the area under the ROC curve, whereas hcosts tries to directly find a (global) otimum under known (or assumed) costs REORDERING RULE SETS In the revious sections we have seen that learning a set of rules is a ath through PN-sace starting at the oint R 0 = (0,0) (the theory covering no examles) and ending at R = (N,P) (the theory covering all examles). Each intermediate rule set R i is a otential classifier, and one of them has to be selected as the final classifier. However, the curve described by the learner will frequently not be convex. The left grah of Figure 12 shows a case where the rule set R 2 consisting of rules {r 1,r 2 } is not on the convex hull of the classifiers. Thus, only R 1 = {r 1 } and R 2 = {r 1,r 2,r 3 } are otential candidates for the final classifier. Interestingly, in some cases it may be quite easy to obtain better classifiers by simly re-ordering the rules. The right grah of Figure 12 shows the best result that can be obtained by swaing the order of rules r 2 and r 3 resulting in the classifier R 2 = {r 1,r 3 }. Note, however, that this grah is drawn under the assumtion that rules r 2 and r 3 cover disjoint sets of examles, in which case the quadrangle R 1 R 2 R 3 R 2 is a arallelogram. In general, however, r 3 may cover some of the examles that were reviously already covered by r 2, which may change the shae of the quadrangle considerably, deending on whether the majority of the overla are ositive or negative examles. Thus, it is not guaranteed that swaing of the rules will indeed increase the area under the ROC-curve or make the curve convex. In this context it is interesting to make a connection to the work by Ferri et al. (2002). They roosed a method to relabel decision trees in the folaer.tex; 5/11/2003; 21:06;.27

28 28 Johannes Fürnkranz and Peter Flach lowing way. First, all k leaves (or branches) are sorted in decreasing order according to hr. Then, a valid labelling is obtained by slitting the ordering in two, and labelling all leaves before the slit oint ositive and the rest negative. This results in k + 1 labelings, and the corresonding ROC curve is necessarily convex. An otimal oint can be chosen in the usual way, once the cost model is known. In our context, ordering branches corresonds to ordering rules, and relabeling corresonds to icking the right subset of the learned rules: labeling a rule as ositive corresonds to including the rule in the theory, while labeling a rule as negative effectively deletes the rule because it will be subsumed by the final default rule that always redicts the negative class. However, the chosen subset is only guaranteed to be otimal if the rules are mutually exclusive. Ferri et al. (2002) use this technique to derive a novel slitting criterion for decision tree learning. Presumably, a similar criterion could be derived for rule learning, which we regard as a romising toic for future work. Finally, we note that Flach and Wu (2003) have addressed the roblem of reairing concavities in ROC curves in the context of the naive Bayesian classifier. We are currently investigating whether similar techniques are alicable to rule learners. 7. Stoing and Filtering Criteria In addition to their regular evaluation metric, many rule learning algorithms emloy searate criteria to filter out uninteresting candidates and/or to fight overfitting. There are two slightly different aroaches: stoing criteria determine when the refinement rocess should sto and the current best candidate should be returned, whereas filtering criteria determine regions of accetable erformance. Both concets are closely related. In articular, filtering criteria are often also used as stoing criteria: If no further rule can be found within the accetable region of a filtering criterion, the learned theory is considered to be comlete. Basically the same technique is also used for refining single rules: if no refinement is in the accetable region, the rule is considered to be comlete, and the secialization rocess stos. For this reason, we will often use the term stoing criterion instead of filtering criterion because this is the more established terminology. The most oular aroach is to imose a threshold uon one or more evaluation metrics. Thus this tye of filtering criterion consists of a heuristic function, ossibly but not necessarily the same one that is used as a search heuristic, and a threshold value that secifies that only rules that have an evaluation above this value are of interest. The threshold may be chosen by aer.tex; 5/11/2003; 21:06;.28

29 ROC n Rule Learning 29 Figure 13. Thresholds on minimum coverage of ositive examles (left) and total number of examles (right). the user or be automatically adated to certain characteristics of the rule (e.g., its encoding length). In the following, we analyze a few oular filtering and stoing criteria for greedy secialization: minimum coverage constraints, suort and confidence, significance tests, encoding length restrictions, and Fossil s correlation cutoff. We use PN-sace to analyze stoing criteria, by visualizing regions of accetable hyotheses MINIMUM COVERAGE CONSTRAINTS The simlest form of overfitting avoidance is to disregard rules with low coverage. For examle, one could require that a rule covers a certain minimum number of examles or a minimum number of ositive examles. These two cases are illustrated in Figure 13. The grah on the left shows the requirement that a minimum fraction (here 20%) of the ositive examles in the training set are covered by the rule. All rules in the gray area are thus excluded from consideration. The right grah illustrates the case where a minimum fraction (here 20%) of examles needs to be covered by the rule, regardless of whether they are ositive or negative. Changing the size of the fraction will cut out different slices of the PN-sace, each delimited with a coverage isometric (-45 degrees lines). Clearly, in both cases the goal is to fight overfitting by filtering out rules whose quality cannot reliably estimated because of the small number of training examles they cover. Notice that we can include a cost model in calculating the coverage of ositive and negative examles, thereby changing the sloe of the coverage isometrics. The two cases in Figure 13 are then secial cases of this general cost-based coverage metric. aer.tex; 5/11/2003; 21:06;.29

30 Johannes Fürnkranz and Peter Flach Figure 14. Filtering rules with minimum suort and minimum confidence. 7.2.

30 30 Johannes Fürnkranz and Peter Flach Figure 14. Filtering rules with minimum suort and minimum confidence SUPPORT AND CONFIDENCE There is no reason why a single measure should be used for filtering out unromising rules. The most rominent examle for combining multile estimates are the thresholds on suort and confidence that are used mostly in association rule mining algorithms, but also in classification algorithms that obtain the candidate rules for the covering loo in an association rule learning framework (Liu et al., 1998; Liu et al., 2000; Jovanoski and Lavrač, 2001). Figure 14 illustrates the effect of thresholds on suort and confidence in PN-sace. Together, they secify an area for valid rules around the (0, P)- oint in ROC-sace. Rules in the grey areas will be filtered out. The dark grey region shows a less restrictive combination of the two thresholds, the light grey region a more restrictive setting. In effect, confidence constrains the quality of the rules, whereas suort aims at ensuring a minimum reliability by filtering out rules whose confidence estimate originates from too few ositive examles. The combined effect is quite similar to a threshold on the g- or F-measures. (Section 4.5), but it allows for more rules in the region near the intersection of the two threshold lines CN2 S SIGNIFICANCE TEST CN2 (Clark and Niblett, 1989; Clark and Boswell, 1991) filters rules for which the distribution of the covered examles is not statistically significantly different from the distribution of examles in the full data set. To this end, it comutes the likelihood ratio statistic: h lrs = 2(log e + nlog n e n ) aer.tex; 5/11/2003; 21:06;.30

ROC n Rule Learning 31 Figure 15. Illustration of CN2 s significance test. Shown are the regions that would not ass a 95% (dark grey) and a 99% (light grey) significant test.

31 ROC n Rule Learning 31 Figure 15. Illustration of CN2 s significance test. Shown are the regions that would not ass a 95% (dark grey) and a 99% (light grey) significant test. where e = ( + n) P P+N and e n = ( + n) N P+N = ( + n) e are the number of ositive and negative examles one could exect if the + n examles covered by the rule were distributed in the same way as the P + N examles in the full data set. Figure 15 illustrates CN2 s filtering criterion. The dark grey area shows the location of the rules that will be filtered out because it can not be established with 95% confidence that their distribution is different from the distribution in the full dataset. The light grey area shows the set of rules that will be filtered out if 99% confidence in the difference is required. Consequently, the area is symmetric around the diagonal, which corresonds to random guessing. Other significance tests than the likelihood ratio could be used. For instance, we have already discussed h chi 2 in Section 5.2. In the case of a two-by-two contingency table, both h chi 2 and h lrs are distributed as χ 2 with one degree of freedom, but that doesn t mean that they are equivalent as heuristics. In fact, there are some interesting differences between the isometric landscaes in Figures 10 and 15. More secifically, the likelihood ratio isometrics in Figure 15 do not deend on the size of the training set. Any grah corresonding to a bigger training set with the same class distribution (a roortion of P/(P + N) ositive examles) will contain this grah in its lower left corner. In other words, whether a rule covering ositive and n negative examles is significant according to h lrs does not deend on the size of the training set, but only on the class distribution of the examles it covers. In contrast, h chi 2 always fits the same isometric landscae into PN-sace. As a result, the evaluation of a oint (n, ) deends on its relative location aer.tex; 5/11/2003; 21:06;.31

$In this case, the same rule will always be evaluated in the same way, as long as it covers the same fraction of the training data. Only the location of the significance cutoff isometric (e.g. 95%) deends on P and N in the case of h chi 2.$

32 32 Johannes Fürnkranz and Peter Flach Figure 16. Illustration of Foil s encoding length restriction for domains with P < N (left) and P > N (right). Lighter shades of gray corresond to larger encoding lengths for the rule. (n/n, /P). In this case, the same rule will always be evaluated in the same way, as long as it covers the same fraction of the training data. Only the location of the significance cutoff isometric (e.g. 95%) deends on P and N in the case of h chi FOIL S ENCODING LENGTH RESTRICTION Foil (Quinlan, 1990) uses a criterion based on minimum descrition length (MDL) for deciding when to sto refining the current rule. For exlicitly indicating the ositive examles covered by the rule, one needs h MDL bits: h MDL = log 2 (P + N) + log 2 ( P + N This number is then comared to the number of bits needed to encode the rule, denoted by l(r). If h MDL (r) < l(r), i.e., if the encoding of the rule is longer than the encoding of the examles themselves, the rule is rejected. As l(r) deends solely on the encoding length (in bits) of the current rule, Foil s stoing criterion like CN2 s significance test also deends on the size of the training set: the same rule that is too long for a smaller training set might be good enough for a larger training set, in which it covers more examles. For the uroses of our analysis, we can neglect the rule length as a arameter, and interret that h MDL as a heuristic that is comared to a variable threshold the size of which deends on the length of the rule. Figure 16 shows the behavior of h MDL in PN-sace. The isometric landscae is equivalent to one for the minimum coverage criterion, namely arallel lines to the N-axis. ) aer.tex; 5/11/2003; 21:06;.32

33 ROC n Rule Learning 33 This is not surrising, considering that h MDL is indeendent of n, the number of covered negative examles. For P < N, h MDL is monotonically increasing with. As a consequence, for every rule encoding length l(r) there exists a threshold (l(r)) such that for all rules r: h MDL (r) > l(r) h(r) > (l(r)). Thus, in this case, h MDL and h are essentially equivalent. The more comlicated formulation in the MDL-framework basically serves the urose of automatically maing a rule to an aroriate threshold. It remains an oen question, whether there is a simler maing r (r) that allows to obtain equivalent (or similar) thresholds without the exensive comutation of l(r). Particularly interesting is the case P > N (right grah of Figure 16): the isometric landscae is still the same, but the labels are no longer monotonically increasing. In fact, h MDL has a maximum at the oint = (P + N)/2. Below this line (shown dashed in Figure 16), the function is monotonically increasing (as above), but above this line it starts to decrease again. Thus, we can formulate the following THEOREM 7.1. h MDL is comatible with h iff (P + N)/2 and it is antagonistic to h for > (P + N)/2. Proof. a) P+N 2 : ( ) P + N h MDL = log 2 (P + N) + log 2 i=1 (P + N + 1 i) log 2 i=1 i = = i=1 i=1 log 2 (P + N + 1 i) i=1 log 2 i (log 2 (P + N + 1 i) log 2 i) The terms inside the sum are all constant. Thus, for two rules r 1 and r 2 with 0 h(r 1 ) < h(r 2 ) P+N 2, the corresonding sums only differ in the number of terms. As all terms are > 0, h(r 1 ) < h(r 2 ) h MDL (r 1 ) < h MDL (r 2 ) b) > P+N 2 : ( ) ( ) P + N P + N h MDL log 2 = log 2 P + N Therefore, each rule r with coverage has the same evaluation as some rule r with coverage P + N < (P + N)/2. As the transformation P + N aer.tex; 5/11/2003; 21:06;.33

34 34 Johannes Fürnkranz and Peter Flach Figure 17. Illustration of Fossil s cutoff criterion. is monotonically decreasing, h MDL (r 1 ) > h MDL (r 2 ) h MDL (r 1) > h MDL (r 2) h(r 1) > h(r 2) h(r 1 ) < h(r 2 ) COROLLARY 7.2. h MDL is equivalent with h iff P N. Proof. h MDL reaches its maximum at = P+N 2, which is inside PN-sace only iff P > N. Thus, for skewed class distributions with many ositive examles, there might be cases where a rule r is accetable, while a rule r that has the same encoding length (l(r ) = l(r)), covers the same number or fewer negative examles (n(r ) n(r)), but more ositive examles ((r ) > (r)) is not accetable. For examle, in the examle shown on the right of Figure 16, for a certain rule length l, only rules that cover between 65% and 80% of the ositive examles are accetable. A rule of the same length that covers all ositive and no negative examles would not be accetable. This is very counter-intuitive and sheds some doubts uon the suitability of Foil s encoding length restriction for such domains FOSSIL S CORRELATION CUTOFF Fossil (Fürnkranz, 1994) imoses a threshold uon its correlation heuristic (Section 5.2). Only rules that evaluate above this threshold are admitted. Figure 17 shows the effect of this threshold for the stoing value 0.3 which aears to erform fairly well over a variety of domains (Fürnkranz, 1997). aer.tex; 5/11/2003; 21:06;.34

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit