ROC n Rule Learning - Towards a Better Understanding of Covering Algorithms

Size: px
Start display at page:

Download "ROC n Rule Learning - Towards a Better Understanding of Covering Algorithms"

Transcription

1 ROC n Rule Learning - Towards a Better Understanding of Covering Algorithms Johannes Fürnkranz (juffi@oefai.at) Austrian Research Institute for Artificial Intelligence, Schottengasse 3, A-1010 Wien, Austria Peter A. Flach (eter.flach@bristol.ac.uk) Det. of Comuter Science, University of Bristol, Woodland Road, Bristol BS8 1UB, UK Abstract. This aer rovides an analysis of the behavior of searate-and-conquer or covering rule learning algorithms by visualizing their evaluation metrics and their dynamics in PNsace, a variant of ROC-sace. Our results show that most commonly used search heuristics, including accuracy, weighted relative accuracy, entroy, and Gini index, are equivalent to one of two fundamental rototyes: recision, which tries to otimize the area under the ROC curve for unknown costs, and a cost-weighted difference between covered ositive and negative examles, which tries to find the otimal oint under known or assumed costs. We also show that a straightforward generalization of the m-estimate trades off these two rototyes. Furthermore, our results show that stoing and filtering criteria like CN2 s significance test focus on identifying significant deviations from random classification, which does not necessarily avoid overfitting. We also identify a roblem with Foil s MDL-based encoding length restriction, which roves to be largely equivalent to a variable threshold on the recall of the rule. In general, we interret these results as evidence that, contrary to common concetion, re-runing heuristics are not very well understood and deserve more investigation. 1. Introduction Most rule learning algorithms for classification roblems follow the so-called searate-and-conquer or covering strategy, i.e., they learn one rule at a time, each of them exlaining (covering) a art of the training examles. The examles covered by the last learned rule are removed from the training set (searated) before subsequent rules are learned (before the remaining training examles are conquered). Tyically, these algorithms oerate in a concet learning framework, i.e., they exect ositive and negative examles for an unknown concet. From this training data, they learn a set of rules that describe the underlying concet, i.e., that exlain all (or most) of the ositive examles and (almost) none of the negative examles. Various aroaches that adhere to this framework differ in the way single rules are learned. Common to all algorithms is that they have to use a metric or search heuristic for evaluating the quality of a candidate rule. A survey of the rule learning literature yields a vast number of rule learning metrics that have been roosed for evaluating the quality of a learned rule. Each of these metrics has its justification based in statistics, information theory, or related fields, but their relation to each other is not very well understood. c 2003 Kluwer Academic Publishers. Printed in the Netherlands. aer.tex; 5/11/2003; 21:06;.1

2 2 Johannes Fürnkranz and Peter Flach In this aer, we attemt to shed some light uon this issue by investigating commonly used evaluation metrics. Our basic tool for analysis will be visualization in PN-saces. PN-sace is quite similar to ROC-sace, the two-dimensional lane in which the oerating characteristics of classifiers are visualized. Our analysis will show that most commonly used heuristics, including accuracy, weighted relative accuracy, entroy, and Gini index, are equivalent to one of two fundamental rototyes: recision, which tries to otimize the area under the ROC curve for unknown costs, and a cost-weighted difference between covered ositive and negative examles, which tries to find the otimal oint under known or assumed costs. We also show that a straightforward generalization of the m-estimate trades off these two rototyes. Many rule learning algorithms aly additional criteria for filtering out uninteresting rules or for stoing the refinement rocess at an aroriate oint. Stoing and filtering criteria have received considerably less attention in the literature, mostly because their task is more frequently addressed by runing. The few existing roosals, most notably simle thresholding of the evaluation metric, significance testing, and MDL-based criteria, turn out to be quite diverse aroaches to this roblem. Although our analysis will show some interesting corresondences between some of these techniques, it is also clear that stoing criteria are far from being well-understood. The outline of the aer is as follows: We begin with setting u the formal framework of our analysis, most notably the introduction of PN-saces and isometrics (Section 2), and a brief review of covering algorithms in this context (Section 3). In Sections 4 and 5, we analyze some of the most commonly used evaluation metrics, and show that those with a linear isometric landscae all fit into a common framework. Subsequently, we discuss a few issues related to learning rule sets with the covering algorithm (Section 6). In Section 7, we will look at the most commonly stoing and filtering criteria. Finally, we discuss a few oen questions (Section 8) and related work (Section 9) before we conclude (Section 10). Parts of this aer have reviously aeared in (Fürnkranz and Flach, 2003). 2. Formal Framework In the following, we define the formal framework in which we erform our analysis. In articular, we will define PN-saces and discuss their relation to ROC-saces (Section 2.1) and rovide formal definitions of the equivalence of evaluation metrics and oint out the imortance of isometrics in PN-sace for identifying such equivalences (Section 2.2). aer.tex; 5/11/2003; 21:06;.2

3 ROC n Rule Learning 3 Note that in the remainder of the aer, we will use the terms search heuristic and evaluation metric interchangeably because most algorithms do not differentiate between them. This is further discussed in Section ROC-SPACE AND PN-SPACE In the following, we assume some basic knowledge of ROC analysis as it is tyically used for selecting the best classifier for unknown classification costs (Provost and Fawcett, 2001). In brief, ROC analysis comares different classifiers according to their true ositive rate (TPR) and false ositive rate (FPR), i.e., the ercentage of correctly classified ositive examles and incorrectly classified negative examles. This allows to lot classifiers in a two-dimensional sace, one dimension for each of these two measurements. The ideal classifier is at the uer left corner, at oint (1,0), while the origin (0,0) and the oint (1,1) corresond to the classifiers that classify all examles as negative and ositive resectively. If all classifiers are lotted in this ROC-sace, the best classifier is the one that is closest to (1,0), where the used distance measure deends on the relative misclassification costs of the two classes. One can also show that all classifiers that are otimal under some (linear) cost model lie on the convex hull of these oints in ROC-sace. Thus, all classifiers that do not lie on this convex hull, can be ruled out. In the articular context of rule learning, we believe it is more convenient to think in terms of absolute numbers of covered examles. Thus, instead of lotting TPR over FPR, we lot the absolute number of covered ositive examles over the absolute number of covered negative examles. We call such grahs PN-grahs. Figure 1 shows two examles, one for the case where the number of ositive examles P exceeds the number of negative examles Figure 1. PN-curves are ROC-curves based on absolute numbers of covered examles. aer.tex; 5/11/2003; 21:06;.3

4 4 Johannes Fürnkranz and Peter Flach N, and one for the oosite case. In all subsequent figures, we will assume P N (the left grah of Figure 1), but this does not affect our results. A PN-grah can be turned into a ROC grah by simly normalizing the P and N-axes to the scale [0,1] [0,1]. Consequently, the isometrics of a function in a PN-grah can be maed one-to-one to its isometrics in ROCsace. Nevertheless, PN-grahs have several interesting roerties that may be of interest deending on the urose of the visualization. Most notably, the absolute numbers of covered ositive and negative examles allow to ma the covering strategy into a sequence of nested PN-grahs. This is further discussed in Section 6.1. Table I comares some of the roerties of PNcurves to those of ROC-curves. Table I. PN-saces vs. ROC-saces. roerty ROC-sace PN-sace x-axis FPR = n N n y-axis TPR = P emty theory R 0 (0,0) (0,0) correct theory R (0,1) (0,P) universal theory R (1, 1) (N, P) resolution ( 1 N, 1 P ) (1,1) sloe of diagonal 1 P N sloe of = n line N P ISOMETRICS AND EQUIVALENCE In the remainder of the aer, we use caital letters to denote the total number of ositive (P) and negative (N) examles in the training set, whereas (r) and n(r) are used for the resective number of examles covered by a rule r. Heuristics are two-dimensional functions of the form h(, n). We use subscrits to the letter h to differentiate between different heuristics. For brevity and readability, we will abridge h((r), n(r)) with h(r), and omit the argument (r) from functions, n, and h when it is clear from the context. Table II shows a summary of our notational conventions in the form of a four-field confusion matrix. DEFINITION 2.1 (comatible). Two search heuristics h 1 and h 2 are comatible iff for all rules r,s: h 1 (r) > h 1 (s) h 2 (r) > h 2 (s). aer.tex; 5/11/2003; 21:06;.4

5 ROC n Rule Learning 5 Table II. Notational conventions for a confusion matrix of ositive and negative examles covered or not covered by a rule. covered by rule not covered by rule ositive examle P P negative examle n N n N + n P + N n P + N DEFINITION 2.2 (antagonistic). Two search heuristics h 1 and h 2 are antagonistic iff for all rules r,s: h 1 (r) > h 1 (s) h 2 (r) < h 2 (s). Obviously, comatibility and antagonicity are dual concets: LEMMA 2.3. h1 and h2 are antagonistic iff h1 and h2 are comatible. Proof. Follows immediately from h 2 (r) > h 2 (s) h 2 (r) < h 2 (s). Note that comatible or antagonistic heuristics have identical regions of equality: DEFINITION 2.4 (equality-reserving). Two search heuristics h 1 and h 2 are equality-reserving iff for all rules r,s: h 1 (r) = h 1 (s) h 2 (r) = h 2 (s). THEOREM 2.5. Comatible or antagonistic search heuristics are equalityreserving. Proof. Assume they would not be equality-reserving. This means there exist rules r and s with h 1 (r) = h 2 (s) but h 2 (r) h 2 (s). Without loss of generality assume h 2 (r) > h 2 (s). This imlies that h 1 (r) > h 1 (s) (for comatibility) or h 1 (r) < h 1 (s) (for antagonicity). Both cases contradict the assumtion. Note that equality-reserving search heuristics are not necessarily comatible or antagonistic, only if we make some straightforward continuity assumtions. Although all subsequently discussed functions are continuous, we note that this does no have to be the case. For all ractical uroses, h(,n) may be regarded as a looku-table that defines a value h for each integer-valued air (n, ). Based on the above, we can define the equivalence of search heuristics. The basic idea is that search heuristics are equivalent if they order a set of candidate rules in an identical way. It should also cature that some learning systems try to maximize a certain heuristic functions, while others try to minimize an equivalent function (e.g., maximizing information gain is equivalent to minimizing entroy). aer.tex; 5/11/2003; 21:06;.5

6 6 Johannes Fürnkranz and Peter Flach DEFINITION 2.6 (equivalence). Two search heuristics h 1 and h 2 are equivalent (h 1 h 2 ) if they are either comatible or antagonistic. This definition also underlines the imortance of isometrics for the analysis of search heuristics: DEFINITION 2.7 (isometric). An isometric of a heuristic h is a line (or curve) in PN-sace that connects, for some value c, all oints (n, ) for which h(,n) = c. Equality-reserving search heuristics can be recognized by examining their isometrics and establishing that for each isometric line for h 1 there is a corresonding isometric for h 2. Comatible and antagonistic search heuristics can be recognized by investigating corresonding isometrics and establishing that their associated heuristic values are in the same (the oosite) order. 3. Covering Rule Learning Algorithms The most oular strategy for learning classification rules is the covering or searate-and-conquer strategy. It has its roots in the early days of machine learning in the AQ family of rule learning algorithms (Michalski, 1969; Kaufman and Michalski, 2000). It is fairly oular in roositional rule learning (cf. CN2 (Clark and Niblett, 1989; Clark and Boswell, 1991), Rier (Cohen, 1995), or CBA (Liu et al., 1998)) as well as in inductive logic rogramming (ILP) (cf. Foil (Quinlan, 1990) and its successors (Džeroski and Bratko, 1992; Fürnkranz, 1994; De Raedt and Van Laer, 1995; Quinlan and Cameron-Jones, 1995) or Progol (Muggleton, 1995)). In the following, we will briefly recaitulate the key issues of this learning strategy, with a articular focus on its behavior in PN-sace. For a detailed survey of this large family of algorithms we refer to (Fürnkranz, 1999) SEPARATE-AND-CONQUER LEARNING The defining characteristic of this family of algorithms is that they successively learn rules that exlain art of the available training examles. Examles that are covered by reviously learned rules are searated from the training set, and the remaining examles are conquered by recursively calling the learner to find the next rule. This is reeated until all examles are exlained by a rule, or until some other, external stoing criterion secifies that the current rule set is satisfactory. Most algorithms of this family oerate in a concet learning framework, i.e., they learn the definition of an unknown concet from ositive examles (examles for the concet) and negative examles (counter-examles). Each aer.tex; 5/11/2003; 21:06;.6

7 ROC n Rule Learning 7 Figure 2. Schematic deiction of the aths through PN-sace that are described by (left) the covering strategy of learning a theory by adding one rule at a time and (right) greedy secialization of a single rule. of the learned rules covers some (as many as ossible) of the available ositive examles, and ossibly some (as few as ossible) of the negative examles as well. Thus, the learned rule set describes the hidden concet. For classifying new examles, each rule is tried in order. If any of the learned rules fires for a given examle, the examle is classified as ositive. If none of them fires, the examle is classified as negative. This corresonds to the closed-world assumtion in the semantics of rule sets (theories) and rules (clauses) in the PROLOG rogramming language COVERING IN PN-SPACE Adding a rule to a rule set means that more examles are classified as ositive, i.e., it increases the coverage of the rule set. All ositive examles that are uniquely covered by the newly added rule contribute to an increase of the true ositive rate on the training data. Conversely, covering additional negative examles may be viewed as increasing the false ositive rate on the training data. Therefore, adding rule r i+1 to rule set R i effectively moves from oint R i = (n i, i ) (corresonding to the number of negative and ositive examles that are covered by revious rules), to a new oint R i+1 = (n i+1, i+1 ) (corresonding to the examles covered by the new rule set). Moreover, R i+1 will tyically be closer to (N,P) and farther away from (0,0) than R i. Consequently, learning a rule set one rule at a time may be viewed as a ath through PN-sace, where each oint on the ath corresonds to the addition of a rule to the theory. Such a PN-ath starts at (0,0), which corresonds to the emty theory that does not cover any examles. Adding a rule moves to a new oint in ROC-sace, which corresonds to a theory consisting of all rules that have been learned so far. After the final rule has been learned, one can aer.tex; 5/11/2003; 21:06;.7

8 8 Johannes Fürnkranz and Peter Flach imagine adding yet another rule with a body that is always true. Adding such a rule has the effect that the theory now classifies all examles as ositive, i.e., it will take us to the oint R = (N,P). Figure 2 shows the PN-ath for a theory with three rules. Each oint R i reresents the rule set consisting of the first i rules. The final rule set learned by a classifier will tyically corresond to the last oint on the curve before it reaches (N,P). Note, however, that this need not be the case. This will be briefly discussed in section GREEDY SPECIALIZATION While the covering loo is essentially the same for all members of the searateand-conquer family of rule learning algorithms, the individual members differ in the way they learn single rules. Algorithms may use stochastic (Mladenić, 1993) or genetic search (Giordana and Sale, 1992), exhaustive search (Webb, 1995; Muggleton, 1995), or emloy association rule algorithms for finding a candidate set of rules (Liu et al., 1998; Jovanoski and Lavrač, 2001). Again, we refer to (Fürnkranz, 1999) for a survey. The vast majority of algorithms uses a heuristic to-down hill-climbing or beam search strategy, i.e., they search the sace of ossible rules by successively secializing the current best rule. Rules are secialized by greedily adding the condition which romises the highest gain according to some evaluation metric. Like with adding rules to a rule set, this successive refinement describes a ath trough PN-sace (see Figure 2, right). However, in this case, the ath starts at the uer right corner (covering all ositive and negative examles), and successively roceeds towards the origin (which would be a rule that is too secific to cover any examle). Note that rule learning algorithms that are based on iterative refinement of candidate rules tyically use the same metric for evaluating comlete and incomlete rules, i.e., rules that can be added to the theory and rules that are only interior nodes in the search grah. While the evaluation of comlete rules should measure the rule s otential of classifying unseen test cases, the evaluation of an incomlete rule should cature its otential to be refined into a high-quality comlete rule. In this case, the evaluation metric is used as a search heuristic. We note that, in rincile, different tyes of search heuristics are ossible (cf. also Section 8), but, like all refinement-based rule learning algorithms, we will not further differentiate between evaluation metrics and search heuristics, and use the terms interchangeably in this aer OVERFITTING AVOIDANCE In the simlest case, both for refining a rule set and refining a rule, all oints on its PN-ath are evaluated with an evaluation metric, and the rule set or rule that corresonds to the oint with the highest evaluation is selected. However, aer.tex; 5/11/2003; 21:06;.8

9 ROC n Rule Learning 9 such an aroach may often be too simlistic, in articular because of the danger of overfitting. A rule learner tyically evaluates a large number of candidate rules, which makes it quite likely that one of them fits the characteristics of the training set by chance. For examle, simly using the fraction of ositive examles covered by the rule suffers from overfitting because one can always find a rule that covers a single ositive examle and no negative examle by simly icking a unique ositive examle and transform it into a rule. Such a rule has an otimal value 1, but it is unlikely that the learned rule is useful because the obtained heuristic estimate is not reliable enough to generalize to unseen data. A simle way of countering overfitting is to exlicitly secify the region of the PN-sace for which it is believed that the rovided heuristic evaluations are too weak or too unreliable to be of interest. There are two rincial aroaches for dealing with this kind of roblem: Pre-runing algorithms emloy additional evaluation heuristics for filtering out unromising rules or for stoing the refinement rocess, whereas ost-runing aroaches deliberately learn an overfitting rule or theory and correct it in a ost-rocessing hase. For a discussion of the two alternatives and some roosals for combining them we refer to (Fürnkranz, 1997). In this aer, we will confine ourselves to re-runing heuristics (Section 7). 4. Linear Rule Evaluation Metrics The ultimate goal of learning is to reach oint (0,P) in PN-sace, i.e., to learn a correct theory that covers all ositive examles, but none of the negative examles. This will rarely ever be achieved in a single ste, but a set of rules will be needed to meet this objective. The urose of a rule evaluation metric is to estimate how close a rule takes you to this ideal oint. In the following, we analyze the most commonly used metrics for evaluating the quality of a rule in covering algorithms BASIC HEURISTICS A straightforward strategy for finding a rule that covers some ositive and as few negative examles as ossible is to minimize the number of covered negative examles for each individual rule (or maximize their negation). hn = n This, however, does not cature the intuition that we want to cover as many ositive examles as ossible. For this, h, the number of covered ositive examles, could be used. h = aer.tex; 5/11/2003; 21:06;.9

10 10 Johannes Fürnkranz and Peter Flach Figure 3. Isometrics for minimizing false ositives and for maximizing true ositives. Figure 3 shows the isometrics for hn and h, vertical and horizontal lines. All rules that cover the same number of negative (ositive) examles are evaluated equally, irresective of the number of ositive (negative) examles they cover. However, it is trivial to find theories that maximize either of them: hn is maximal for the emty theory R 0, which does not cover any negative examles, but also no ositive examles, and h is maximal for the universal theory R, which covers all ositive and no negative examles. Ideally, one would like to achieve both goals simultaneously ACCURACY, WEIGHTED RELATIVE ACCURACY, GENERAL COSTS A straightforward way for trading off covering many ositives and excluding many negatives is to simly add u hn and h: hacc = n The isometrics for this function are shown in the left grah of Figure 4. Note that the isometrics all have a 45 o angle. Thus this search heuristic basically otimizes accuracy: THEOREM 4.1. hacc is equivalent to accuracy. Proof. The accuracy of a theory (which may be a single rule) is the roortion of correctly exlained examles, i.e., ositive examles that are covered () and negative examles that are not covered (N n), in all examles (P + N). Thus the isometrics are of the form +(N n) P+N = c. As P and N are constant, these can be transformed into the isometrics of hacc: n = cacc = c(p + N) N. aer.tex; 5/11/2003; 21:06;.10

11 ROC n Rule Learning 11 Figure 4. Isometrics for accuracy and weighted relative accuracy. Accuracy has been used as a runing heuristic in I-REP (Fürnkranz and Widmer, 1994), a modification of hacc, which also subtracts the length of a rule, has been used in PROGOL (Muggleton, 1995). Otimizing accuracy gives equal weight to covering a single ositive examle and excluding a single negative examle. There are cases where this choice is arbitrary, for examle when misclassification costs are not known in advance or when the samles of the two classes are not reresentative. In such cases, it may be advisable to normalize with the samle sizes: hwra = P n N = T PR FPR The isometrics of this heuristic are shown in the right half of Figure 4. The main difference with accuracy is that the isometrics are arallel to the diagonal, which reflects that we now give equal weight to increasing the true ositive rate (TPR) or to decreasing the false ositive rate (FPR). Note that the diagonal encomasses all classifiers that make random redictions. hwra may be viewed as a simlification of weighted relative accuracy, which has been roosed as a rule learning heuristic by Lavrač et al., Todorovski et al. (1999, 2000): THEOREM 4.2. hwra is equivalent to weighted relative accuracy. Proof. Weighted relative accuracy is defined as h wra = +n P+N ( +n P P+N ). Using equivalence-reserving transformations (multilications with constant values like P + N), we obtain h wra = 1 P+N ( P P+N n P P+N ) N P+N n P P+N N np P n N = h wra. The two PN-grahs of Figure 4 are secial cases of a function that allows to incororate arbitrary cost ratios between false negatives and false ositives. The general form of this linear cost metric is hcosts = a bn c (1 c)n dn aer.tex; 5/11/2003; 21:06;.11

12 12 Johannes Fürnkranz and Peter Flach Figure 5. Isometrics for recision. Obviously, the accuracy isometrics can be obtained with a = b = d = 1 or c = 1/2, and the isometrics of weighted relative accuracy can be obtained by setting a = 1/P and b = 1/N or c = N/(P + N) or d = P/N. In general, the sloe of the arallel isometrics in the PN-grah is c 1 c RECALL AND PRECISION, SUPPORT AND CONFIDENCE The most commonly used heuristic for evaluating single rules is to look at the roortion of ositive examles in all examles covered by the rule. This metric is known under many different names, e.g., confidence in association rule mining, or recision in information retrieval. We will use the latter term: hr = + n Figure 5 shows the isometrics for this heuristic. Like h, recision considers all rules that cover only ositive examles to be equally good (the P-axis), and like hn, it considers all rules that only cover negative examles as equally bad (the N-axis). All other isometrics are obtained by rotation around the origin (0,0), for which the heuristic value is undefined. Uses of the ure recision heuristic in rule learning include (Pagallo and Haussler, 1990; Weiss and Indurkhya, 1991). However, several other, seemingly more comlex heuristics can be shown to be equivalent to recision. For examle, the heuristic that is used for runing in Rier (Cohen, 1995): THEOREM 4.3. Rier s runing heuristic h ri = n +n is equivalent to recision. Proof. h ri = +n n +n = +n (1 +n ) = 2 h r 1 In subsequent sections, we will see that more comlex heuristics, such as entroy and Gini index, are also essentially equivalent to recision. On the aer.tex; 5/11/2003; 21:06;.12

13 ROC n Rule Learning 13 Figure 6. Isometrics for entroy (left) and Gini index (right). other hand, seemingly minor modifications like the Lalace or m-estimates are not. Precision and confidence are more frequently used together with their resective counter-arts recall and suort. hrec = P = T PR As P is constant, hrec is obviously equivalent to h discussed above. Suort is usually defined as the fraction of examles that satisfy both the head and the body of the rule, i.e., hsu = P+N h rec INFORMATION CONTENT, ENTROPY AND GINI INDEX Some algorithms (e.g., PRISM (Cendrowska, 1987)) measure the information content h info = log 2 + n THEOREM 4.4. h info and hr are antagonistic and thus equivalent. Proof. h info = log 2 hr, thus h info (r) > h info (s) hr(r) < hr(s). The use of entroy (in the form of information gain) is very common in decision tree learning (Quinlan, 1986), but has also been suggested for rule learning in the original version of CN2 (Clark and Niblett, 1989). hent = ( + n log 2 + n + n + n log 2 n + n ) Entroy is not equivalent to information content and recision, even though it seems to have the same isometrics as these heuristics (see Figure 6). The aer.tex; 5/11/2003; 21:06;.13

14 14 Johannes Fürnkranz and Peter Flach difference is that the isometrics of entroy go through the undefined oint (0,0) and continue on the other side of the 45 o diagonal. The motivation for this is that the original version of CN2 did not assume a ositive class, but labeled its rules with the majority class among the examles covered. Thus rules r = (n, ) and r = (,n) can be considered to be of equal quality because one of them will be used for redicting the ositive class and the other for redicting the negative class. Based on this, we can, however rove the following THEOREM 4.5. hent and hr are antagonistic for n and comatible for n. Proof. hent = hr log 2 hr (1 hr)log 2 (1 hr) with hr [0,1]. This function has its maximum at hr = 1/2 = n. From the fact that it is strictly monotonically increasing for n follows that hr(x) < hr(y) hent(x) < hent(y) in this region. Analogously, hr(x) < hr(y) hent(x) > hent(y) for n, where hent is monotonically decreasing in hr. We will say that entroy is a class-neutral version of recision. In general, a class-neutral heuristic has isometrics that are line-mirrored across a symmetry line, tyically the 45 o line or the diagonal. A heuristic that is not classneutral can be made so by re-scaling to [ 1,1] and taking the absolute value. For instance, n +n is a class-neutral version of recision, and therefore (by construction) equivalent to entroy. In decision tree learning, the Gini index is also a very oular heuristic (Breiman et al., 1984). To our knowledge, it has not been used in rule learning, but we list it for comleteness: ( ) 2 ( ) n 2 h gini = 1 n + n + n ( + n) 2 As can be seen from Figure 6, the Gini index has the same isometric landscae as entroy, it only differs in the distribution of the values (hence the lines of the contour lot are a little denser near the axes and less dense near the diagonal). This, however, does not change the ordering of the rules. THEOREM 4.6. h gini and hent are equivalent. Proof. Like entroy, the Gini index can be formulated in terms of hr (h gini hr(1 hr)) and both functions have essentially the same shae, i.e., both functions grow or fall with hr in the same way. aer.tex; 5/11/2003; 21:06;.14

15 ROC n Rule Learning 15 Figure 7. Isometrics for the Lalace and m-estimates LAPLACE, m-estimate, F- AND g-measure The Lalace and m-estimates (Cestnik, 1990) are very common modifications of hr, which have, e.g., been used in the second version of CN2 (Clark and Boswell, 1991), m-foil (Džeroski and Bratko, 1992), and ICL (De Raedt and Van Laer, 1995). h la = n + 2 hm = + m P P+N + n + m The basic idea of these estimates is to assume that each rule covers a certain number of examles a riori. They comute a recision estimate, but start to count covered ositive or negative examles at a number > 0. With the Lalace estimate, both the ositive and negative coverage of a rule are initialized with 1 (thus assuming an equal rior distribution), while the m-estimate assumes a rior total coverage of m examles which are distributed according to the distribution of ositive and negative examles in the training set. In the PN-grahs, this modification results in a shift of the origin of the recision isometrics to the oint ( n m, m ), where n m = m = 1 in the case of the Lalace heuristic, and m = m P/(P + N) and n m = m m for the m-estimate (see Figure 7). The resulting isometric landscae is symmetric around the line that goes through ( n m, m ) and (0,0). Thus, the Lalace estimate is symmetric around the 45 o line, while the m-estimate is symmetric around the diagonal of the PN-grah. Another noticeable effect of the transformation is that the farther the origin ( n m, m ) moves away from (0,0), the more the isometrics in the relevant aer.tex; 5/11/2003; 21:06;.15

16 16 Johannes Fürnkranz and Peter Flach Figure 8. Isometrics of hg for increasing values of g. window (0,0) (P,N) aroach arallel lines. For examle, the isometrics of the m-estimate converge towards the isometrics of weighted relative accuracy for m (see theorem 4.8 below). This is maybe best illustrated if we set m = 0, i.e., if we assume the rotation oint to be on the N-axis. We will call this the g-measure: hg = + n + g Figure 8 shows how for g, the sloes of hg s isometrics becomes flatter and flatter and converge towards the axis-arallel lines of hrec. On the other hand, for g = 0, the measure is obviously equivalent to recision. Thus, the g-measure may be regarded as a simle way for trading off recision and recall, or suort and confidence. In fact, it was already shown in (Flach, 2003) that for g = P, hg is equivalent to the F-measure, a frequently used trade-off between recall and recision (van Rijsbergen, 1979). aer.tex; 5/11/2003; 21:06;.16

17 ROC n Rule Learning 17 Flach (2003) also roosed an equivalent simlification of the F-measure, which he called the G-measure (h G = n+p ). This, in turn, is a secial case of a heuristic that was recently roosed for subgrou discovery (Gamberger and Lavrač, 2002), which is equivalent to the g-measure: n+g THEOREM 4.7. For g 0, hg Proof. We show equivalence via comatibility. 1 hg(r 1 ) > hg(r 2 ) 1 + n 1 + g > n 2 + g 1 ( 2 + n 2 + g) > 2 ( 1 + n 1 + g) (n 2 + g) > (n 1 + g) 1 (n 2 + g) > 2 (n 1 + g) 1 n 1 + g > 2 n 2 + g 4.6. THE GENERALIZED m-estimate The above discussion leads us to the following straightforward generalization of the m-estimate, which takes the rotation oint of the recision isometrics as a arameter: hgm = + mc + n + m = + a ( + a) + (n + b) The second version of the heuristic basically defines the rotation oint by secifying its co-ordinates ( b, a) in PN-sace (a,b [0, ]). The first version uses m as a measure of how far from the origin the rotation oint lies using the sum of the co-ordinates as a distance measure. Hence, all oints with distance m lie on the line that connects (0, m) with ( m,0), and c secifies where on this line the rotation oint lies. For examle, c = 1 means ( m,0), whereas c = 0 denotes (0, m), resulting in the g-measure. The line that connects the rotation oint and (0,0) has a sloe of 1 c. Obviously, both c versions of hgm can be transformed into each other by choosing m = a + b and c = or a = mc and b = m(1 c). a a+b THEOREM 4.8. For m = 0, hgm is equivalent to hr, while for m, its isometrics converge to hcosts. Proof. m = 0: trivial. m : By construction, an isometric of hgm through the oint (n, ) connects this oint with the rotation oint ( (1 c)m, cm) and has the sloe +cm n+(1 c)m 1 c. For m, this sloe converges to c for all oints (n, ). Thus all isometrics converge towards arallel lines with the sloe c 1 c. aer.tex; 5/11/2003; 21:06;.17

18 18 Johannes Fürnkranz and Peter Flach Theorem 4.8 shows that hgm may be considered as a general model of heuristic functions with linear isometrics that has two arameters: c [0,1] for trading off the misclassification costs between the two classes, and m [0, ] for trading off between recision hr and the linear cost metric hcosts. 1 Therefore, all heuristics discussed in this section may be viewed as equivalent to some instantiation of this general model. 5. Non-Linear Rule Evaluation Metrics All heuristics considered so far have linear isometrics. This is a reasonable assumtion as concets are tyically evaluated with linear cost metrics (e.g., accuracy or cost matrices). However, these evaluation metrics are concerned with evaluating comlete rules. As the value of an incomlete rule lies not in its ability to discriminate between ositive and negative examles, but in its otential of being refinable into a high-quality rule, it might well be the case that different tyes of heuristics are useful for evaluating incomlete candidate rules. One could, e.g., argue that for candidate rules that cover many ositive examles it is less imortant to exclude negatives than for rules with low coverage, because high-coverage candidates may still be refined accordingly. Similarly, overfitting could be countered by enalizing regions with low coverage. Possibly, these and similar roblems could be better addressed with a non-linear isometric landscae, even though the learned rules will eventually be used under linear cost models. In this section, we will look at two non-linear information metrics: Foil s information gain, and the correlation heuristic used in Fossil FOIL S INFORMATION GAIN Foil s version of information gain (Quinlan, 1990), unlike ID3 s and C4.5 s version (Quinlan, 1986), is tailored to rule learning, where one only needs to otimize one successor branch as oosed to multile successor nodes in decision tree learning. It differs from the heuristics mentioned so far in that it does not evaluate an entire rule, but only the effect of secializing a rule by adding a condition. More recisely, it comutes the difference in information 1 The reader may have noted that for m, hgm c for all and n. Thus for m =, the function does not have isometrics because all evaluations are constant. However, this is not a roblem for the above construction because we are not concerned with the isometrics of the function hgm at the oint m =, but with the convergence of the isometrics of hgm for m. In other words, the isometrics of hcosts are not equivalent to the isometrics of hgm for m =, but they are equivalent to the limits to which the isometrics of hgm converge if m. aer.tex; 5/11/2003; 21:06;.18

19 ROC n Rule Learning 19 Figure 9. Isometrics for information gain as used in Foil. The curves show different values c for the recision of the arent rule. content of the current rule and its redecessor r, weighted by the number of covered ositive examles (as a bias for generality). The exact formula is 2 h foil = (log 2 + n log 2 c) where c = hr(r ) is the recision of the arent rule. For the following analysis, we will view c as a arameter taking values in the interval [0,1]. Figure 9 shows the isometrics of h foil for four different settings of c. Although the isometrics are non-linear, they aear to be linear in the region above the isometric that goes through (0, 0). Note that this isometric, which c 1 c we will call the base line, has a sloe of : In the first grah (c = P P+N ) it is the diagonal, in the second grah (c = 1/2) it has a 45 o sloe, and in the lower two grahs (c = 1 and c = 10 6 ) it coincides with the vertical and horizontal axes resectively. 2 This formulation assumes that we are learning in a roositional setting. For relational learning, Foil does not estimate the recision from the number of covered instances, but from the number of roofs for those instances. aer.tex; 5/11/2003; 21:06;.19

20 20 Johannes Fürnkranz and Peter Flach Note that the base line reresents the classification erformance of the arent rule. In the area below the base line are the cases where the recision of the rule is smaller than the recision of its redecessor. Such a refinement of a rule is usually not considered to be relevant. In fact, this is also the region where the information gain is negative, i.e., an information loss. The oints on the base line have information gain 0, and all oints above it have a ositive gain. For this reason, we will focus our analysis on the region above the base line, which is the area where the refined rule imroves uon its arent. In this region, the isometrics aear to be linear and arallel to the base line, just like the isometrics of hcosts. In fact, in (Fürnkranz and Flach, 2003), we conjectured that in this art of the ROC sace, h foil is equivalent to hcosts with cost arameter 1 c. However, closer investigation reveals that this is not the case. The farther the isometrics move away from the base line, the steeer they get. Also, they are not exactly linear but slightly bent towards the P-axis, i.e., urer rules are referred. Nevertheless, it is obvious that both heuristics are closely related, and in fact we can show the following: THEOREM 5.1. Let c = hr(r ) be the recision of r, the redecessor of rule r. In the relevant region +n > c, h costs(r) with costs 1 c is equivalent to the first-order Taylor aroximation of h foil (r). Proof. h foil = (log 2 + n log 2 c) = log c( + n) c( + n) 2 ln The Taylor exansion for ln(1 + x) converges for 1 < x 1, and the firstdegree aroximation is ln(1 + x) x. Recall that the interesting region of h foil is where +n c, i.e., where the recision of the current rule r exceeds the recision of the arent rule r (otherwise there would be no oint in refining r to r). In this region, 0 < c c/ +n = c(+n) 1. Thus c( + n) h foil ln ( c( + n) 1 = (1 c) cn c( + n) = ln(1 + ( ) = c( + n) 1)) = c( + n) = It should be ointed out that Theorem 5.1 should not be interreted in the way that hcosts is a good aroximation for h foil. In fact, while the firstorder Taylor aroximation of ln1 + x is quite good near x = 0, it can become aer.tex; 5/11/2003; 21:06;.20

21 ROC n Rule Learning 21 Figure 10. Isometrics for the four-field correlation coefficient. arbitrarily bad for x 1. In our case, we can exect a good aroximation if c is close to 1 and a rather bad one if c aroaches 0 because 0 < c x+1 = c( + n)/ 1. On the other hand, it is not strictly necessary to have a good aroximation for maintaining the isometric landscae of a function, as we have seen on several examles for equivalent heuristics in the revious section. There is a difference between one heuristic aroximating another, and its isometrics aroximating the other s. A full analysis of Foil s information gain requires a fuller understanding of what it means to aroximate an isometric landscae, which we leave for future work CORRELATION AND χ 2 STATISTIC The correlation heuristic has been roosed in (Fürnkranz, 1994) for the rule learner Fossil, a variant of Foil. It comutes the correlation coefficient between the redictions made by the rule and the true labels from a four-field confusion matrix. The formula is hcorr = (N n) (P )n N Pn = PN( + n)(p + N n) PN( + n)(p + N n) Recall that and N n denote the fields on the diagonal of the confusion matrix (the true ositives and the true negatives), whereas n and P denote the false ositives and false negatives resectively. The terms in the denominator are the sums of the rows and columns of the confusion matrix. Figure 10 shows the isometric lot of the correlation heuristic. The isometrics are clearly non-linear and symmetric around the base line on the diagonal (rules that have correlation 0). This symmetry around the diagonal is also characteristic for the linear weighted relative accuracy heuristic hwra, so it is interesting to comare the roerties of these two metrics. aer.tex; 5/11/2003; 21:06;.21

22 22 Johannes Fürnkranz and Peter Flach Above that base line, the isometrics are bent towards the P and N axes resectively. Thus, hcorr has a tendency to refer urer rules (covering fewer negative examles) and more comlete rules (covering more ositive examles). Moreover, the linear isometrics that result from connecting the oints where the isometrics intersect with the n = 0 axis (on the left) and the = P axis (on the to), are not arallel to the diagonal (as they are for hwra). In 1 /N fact, it can be shown that the sloe of these lines is 1 /P where (0, ) is the intersection of the isometric with the P-axis. Obviously, these lines are only arallel for a class balanced roblem (P = N). For P < N, the sloe of the isometrics will increase for increasing and aroach infinity for P. Thus, the closer a rule gets to the target oint (0,P), the more imortant it will become to exclude remaining covered negative examles, and the less imortant to cover additional ositive examles. For examle, the rule (0,P 1) that covers all but one ositive examles and no negatives and the rule (1,P) that covers all ositive examles have identical evaluations. On the other hand, the rule (0, 1) covering one ositive and no negatives has an evaluation roortional to 1/P and is therefore referred over the rule (N 1,P) that covers all ositives but excludes only a single negative examle (the rule has an evaluation roortional to 1/N). Kee in mind that we assumed P < N here; the oosite reference would be the case for P > N. In summary, the correlation heuristic has a tendency to refer urer or more comlete rules. Moreover, for rules that are far from the target (0,P), it gives higher reference to handling the minority class (i.e., cover ositive examles for P < N and exclude negative examles for N < P), whereas the riorities for these two objectives become more and more equal the closer a rule is to the target. 3 Note that the four-field correlation coefficient is basically a normalized version of the χ 2 statistic over the four events in the table (see, e.g., Mittenecker, 1983, or other textbooks on alied statistics). THEOREM 5.2. h chi 2 = (P + N) h 2 corr. Proof. The χ 2 statistic for a four-field confusion matrix (cf. Table II) is the sum of the squared differences between the exected and the observed values of the four fields in the matrix, divided by the exected values. The exected values for the four fields in the 2x2 confusion matrix are: ) ( P(+n) P+N P(P+N n) P+N N(+n) P+N N(P+N n) P+N 3 The references would even reverse if fractional examles (i.e., coverage counts smaller than 1) are ossible. We do not consider this case here. aer.tex; 5/11/2003; 21:06;.22

23 Subtracting this from the actual values (Table II): ( = P(+n) P+N P P(P+N n) P+N ( (P+N) P(+n) P+N (P )(P+N) P(P+N n) P+N ROC n Rule Learning 23 n N(+n) P+N N n N(P+N n) P+N = 1 ( ) N np np N = P + N np N N np ) = n(p+n) N(+n) P+N (N n)(p+n) N(P+N n) P+N ) ( ) N Pn 1 1 P + N 1 1 Squaring these differences, dividing them by the exected values, and summing u the four fields yields: h chi 2 = = = = (N Pn)2 (P + N) 2 (N Pn)2 P + N (N Pn)2 P + N ( P + N P( + n) + P + N N( + n) + P + N P(P + N n) + ) P + N = N(P + N n) N(P + N n) + P(P + N n) + N( + n) + P( + n) PN( + n)(p + N n) (P + N) 2 PN( + n)(p + N n) = (P + N)h2 corr Thus, h chi 2 may be viewed as a class-neutral version of hcorr. Its behavior has been reviously analyzed in ROC sace. Bradley (1996) argued that the non-linear isometrics of the χ 2 -statistic hels to discriminate classifiers that do little work from classifiers that achieve the same accuracy (or cost line) but are referable in terms of other metrics like sensitivity and secificity. The above-mentioned aer also shows how h chi 2 can be modified to incororate other cost models. Finally, we draw attention to the fact that the Gini slitting criterion, defined as imurity decrease from arent to children where imurity is defined by the Gini index (see Section 4.4), is equivalent to χ 2. Here, we restrict attention to binary slits and two classes: in this case, (P,N) can be interreted as the coverage of the unslit arent, and (,n) and (P,N n) denote the coverage of the two children. Gini-slit is then defined as h gini-slit = PN (P + N) 2 + n P + N n ( + n) 2 P + N n (P )(N n) P + N (P + N n) 2 = aer.tex; 5/11/2003; 21:06;.23

24 24 Johannes Fürnkranz and Peter Flach where the first term is the Gini index of the arent, and the second and third terms are the Gini index of the children, weighted with their coverage. This exression can be simlified to h gini-slit = 1 P + N [ PN P + N n ] (P )(N n) + n P + N n THEOREM 5.3. h gini-slit h chi 2. Proof. The exression inside the brackets in Equation (1) can be rewritten as a fraction with denominator (P+N)(+n)(P +N n) and enumerator PN( + n)(p + N n) n(p + N)(P + N n) (P )(N n)(p + N)( + n) = P 2 N PN 2 + PN 2 PN n + P 2 Nn PN n + PN 2 n PNn 2 P 2 n + P 2 n PN n + Pn 2 PN n + N 2 n N 2 n + N n 2 P 2 N P 2 Nn PN 2 PN 2 n + P 2 n + P 2 n 2 + PN n + PNn 2 +PN 2 + PN n + N N 2 n P 2 n Pn 2 N 2 n N n 2 All terms but the underlined ones vanish, so we have h gini-slit = 1 (N Pn) 2 P + N (P + N)( + n)(p + N n) = PN (P + N) 2 (N Pn) 2 PN( + n)(p + N n) = PN (1) (P + N) 3 h chi 2 We conjecture that this result can be generalised to non-binary slits and more than two classes, but this is currently left as an oen roblem. 6. Evaluation of Rule Sets So far, we have been considering heuristics for evaluating single rules, which ma to oints in PN-sace. In this section we look at evaluation of sets of rules. We show that the sequence of training sets and intermediate hyotheses can be viewed as a trajectory through PN-sace, resulting in a set of nested PN-saces (Section 6.1). We then roceed by investigating the behavior of recision and the linear cost metric in this context, and show that recision aims at (locally) otimizing the area under the curve (ignoring costs), while the cost metric tries to directly find a (global) otimum under known or aer.tex; 5/11/2003; 21:06;.24

25 ROC n Rule Learning 25 Figure 11. Isometrics for accuracy and recision in nested PN-saces. assumed costs (Section 6.2). Finally, we will briefly discuss the effect of re-ordering the rules in a rule set (Section 6.3) COVERING AND NESTED PN-SPACES Of articular interest for the covering aroach is the roerty that PN-grahs reflect a change in the total number or roortion of ositive (P) and negative (N) training examles via a corresonding change in the relative sizes of the P and N-axes. ROC analysis, on the other hand, would rescale the new dimensions to the range [0,1], which has the effect of changing the sloe of all lines that deend on the relative sizes of and n. As a consequence, the PN-grah for a subset of a training set can be drawn directly into the PN-grah of the entire set. In articular, the sequence of training sets that are roduced by the recursive calls of the covering strategy after each new rule all training examles that are covered by this rule are removed from the training set and the learner calls itself on the remaining examles can be visualized by a nested sequence of PN-grahs. This is illustrated in Figure 11, which shows a nested sequence of PNsaces. Each sace PN i has its origin in the oint R i = (n i, i ), which corresonds to the theory learned so far. Thus, its learning task consists of learning a theory for the remaining P i ositive and N n i negative examles that are not yet covered by theory R i. Each new rule will cover a few additional examles and thus reduce the corresonding PN-sace. Note that some searate-and-conquer algorithms only remove the covered ositive examles, but not the covered negative examles. This situation corresonds to reducing only the height of the PN-grah, but not its width. Obviously, this breaks the nesting roerty because rules are locally evaluated in a PN-grah with full width (all negative examles), but for getting their global evaluation in the context of reviously learned rules, all reviously covered negative examles have to be removed. While we do not exect a big aer.tex; 5/11/2003; 21:06;.25

26 26 Johannes Fürnkranz and Peter Flach ractical difference between both aroaches, removing all examles seems to be the concetually cleaner solution GLOBAL AND LOCAL OPTIMA The nesting roerty imlies that evaluating a single rule in the reduced dataset (N n i,p i ) amounts to evaluating the oint in the subsace PN i that has the oint (n i, i ) as its origin. Moreover, the rule that maximizes the chosen evaluation function in PN i is also a maximum oint in the full PNsace if the the evaluation metric does not deend on the size or the shae of the PN-grah. The linear cost metric hcosts is such a heuristic: a local otimum in the subsace PN i is also otimal in the global PN-sace because all isometrics are arallel lines with the same angle, and nested PN-saces (unlike nested ROC-saces) leave angles invariant. Precision, on the other hand, cannot be nested in this way. The evaluation of a given rule deends on its location relative to the origin of the current subsace PN i. This is illustrated in Figure 11. The subsaces PN i corresond to the situation after removing all examles covered by the rule set R i. The left grah of Figure 11 shows the case for accuracy: the accuracy isometrics are all arallel lines with a 45 o sloe. The right grah shows the situation for recision: each subsace PN i evaluates the rules relative to its origin, i.e., hr always rotates around (n i, i ) in the full PN-sace, which is (0,0) in the local sace. Note that global otimization for recision (or for hgm in general) could be realized by making the generalized m-measure hgm adative by choosing a = i and b = n i in subsace PN i. However, local otimization is not necessarily bad. At each oint (n, ), the sloe of the line connecting (0,0) with (n, ) is 1 c c = /n. 4 hr icks the rule that romises the steeest ascent from the origin. Thus, if we assume that the origin of the current subsace is a oint of a ROC-curve (or a PN-ath), we may interret hr as making a locally otimal choice for continuing a ROC-curve. The choice is otimal in the sense that icking the rule with the steeest ascent locally maximizes the area under the ROC-curve. Note, however, that if we know the cost model, this choice may not necessarily be globally otimal. For examle, if the oint R 2 in Figure 11 could be reached in one ste, hacc would directly go there because it has the better global value under the chosen cost model (accuracy), whereas hr would nevertheless first learn R 1 because it romises a greater area under the ROC curve. 4 One may say that hr assumes a different cost model for each oint in the sace, deending on the relative frequencies of the covered ositive and negative examles. Such local changes of cost models are investigated in more detail by Flach (2003). aer.tex; 5/11/2003; 21:06;.26

27 ROC n Rule Learning 27 Figure 12. A concavity in the ath of a rule learner (left), and the result of a successful fix by swaing the order of the second and third rule (right). In brief we may say that hr aims at otimizing under unknown costs by (locally) maximizing the area under the ROC curve, whereas hcosts tries to directly find a (global) otimum under known (or assumed) costs REORDERING RULE SETS In the revious sections we have seen that learning a set of rules is a ath through PN-sace starting at the oint R 0 = (0,0) (the theory covering no examles) and ending at R = (N,P) (the theory covering all examles). Each intermediate rule set R i is a otential classifier, and one of them has to be selected as the final classifier. However, the curve described by the learner will frequently not be convex. The left grah of Figure 12 shows a case where the rule set R 2 consisting of rules {r 1,r 2 } is not on the convex hull of the classifiers. Thus, only R 1 = {r 1 } and R 2 = {r 1,r 2,r 3 } are otential candidates for the final classifier. Interestingly, in some cases it may be quite easy to obtain better classifiers by simly re-ordering the rules. The right grah of Figure 12 shows the best result that can be obtained by swaing the order of rules r 2 and r 3 resulting in the classifier R 2 = {r 1,r 3 }. Note, however, that this grah is drawn under the assumtion that rules r 2 and r 3 cover disjoint sets of examles, in which case the quadrangle R 1 R 2 R 3 R 2 is a arallelogram. In general, however, r 3 may cover some of the examles that were reviously already covered by r 2, which may change the shae of the quadrangle considerably, deending on whether the majority of the overla are ositive or negative examles. Thus, it is not guaranteed that swaing of the rules will indeed increase the area under the ROC-curve or make the curve convex. In this context it is interesting to make a connection to the work by Ferri et al. (2002). They roosed a method to relabel decision trees in the folaer.tex; 5/11/2003; 21:06;.27

28 28 Johannes Fürnkranz and Peter Flach lowing way. First, all k leaves (or branches) are sorted in decreasing order according to hr. Then, a valid labelling is obtained by slitting the ordering in two, and labelling all leaves before the slit oint ositive and the rest negative. This results in k + 1 labelings, and the corresonding ROC curve is necessarily convex. An otimal oint can be chosen in the usual way, once the cost model is known. In our context, ordering branches corresonds to ordering rules, and relabeling corresonds to icking the right subset of the learned rules: labeling a rule as ositive corresonds to including the rule in the theory, while labeling a rule as negative effectively deletes the rule because it will be subsumed by the final default rule that always redicts the negative class. However, the chosen subset is only guaranteed to be otimal if the rules are mutually exclusive. Ferri et al. (2002) use this technique to derive a novel slitting criterion for decision tree learning. Presumably, a similar criterion could be derived for rule learning, which we regard as a romising toic for future work. Finally, we note that Flach and Wu (2003) have addressed the roblem of reairing concavities in ROC curves in the context of the naive Bayesian classifier. We are currently investigating whether similar techniques are alicable to rule learners. 7. Stoing and Filtering Criteria In addition to their regular evaluation metric, many rule learning algorithms emloy searate criteria to filter out uninteresting candidates and/or to fight overfitting. There are two slightly different aroaches: stoing criteria determine when the refinement rocess should sto and the current best candidate should be returned, whereas filtering criteria determine regions of accetable erformance. Both concets are closely related. In articular, filtering criteria are often also used as stoing criteria: If no further rule can be found within the accetable region of a filtering criterion, the learned theory is considered to be comlete. Basically the same technique is also used for refining single rules: if no refinement is in the accetable region, the rule is considered to be comlete, and the secialization rocess stos. For this reason, we will often use the term stoing criterion instead of filtering criterion because this is the more established terminology. The most oular aroach is to imose a threshold uon one or more evaluation metrics. Thus this tye of filtering criterion consists of a heuristic function, ossibly but not necessarily the same one that is used as a search heuristic, and a threshold value that secifies that only rules that have an evaluation above this value are of interest. The threshold may be chosen by aer.tex; 5/11/2003; 21:06;.28

29 ROC n Rule Learning 29 Figure 13. Thresholds on minimum coverage of ositive examles (left) and total number of examles (right). the user or be automatically adated to certain characteristics of the rule (e.g., its encoding length). In the following, we analyze a few oular filtering and stoing criteria for greedy secialization: minimum coverage constraints, suort and confidence, significance tests, encoding length restrictions, and Fossil s correlation cutoff. We use PN-sace to analyze stoing criteria, by visualizing regions of accetable hyotheses MINIMUM COVERAGE CONSTRAINTS The simlest form of overfitting avoidance is to disregard rules with low coverage. For examle, one could require that a rule covers a certain minimum number of examles or a minimum number of ositive examles. These two cases are illustrated in Figure 13. The grah on the left shows the requirement that a minimum fraction (here 20%) of the ositive examles in the training set are covered by the rule. All rules in the gray area are thus excluded from consideration. The right grah illustrates the case where a minimum fraction (here 20%) of examles needs to be covered by the rule, regardless of whether they are ositive or negative. Changing the size of the fraction will cut out different slices of the PN-sace, each delimited with a coverage isometric (-45 degrees lines). Clearly, in both cases the goal is to fight overfitting by filtering out rules whose quality cannot reliably estimated because of the small number of training examles they cover. Notice that we can include a cost model in calculating the coverage of ositive and negative examles, thereby changing the sloe of the coverage isometrics. The two cases in Figure 13 are then secial cases of this general cost-based coverage metric. aer.tex; 5/11/2003; 21:06;.29

30 30 Johannes Fürnkranz and Peter Flach Figure 14. Filtering rules with minimum suort and minimum confidence SUPPORT AND CONFIDENCE There is no reason why a single measure should be used for filtering out unromising rules. The most rominent examle for combining multile estimates are the thresholds on suort and confidence that are used mostly in association rule mining algorithms, but also in classification algorithms that obtain the candidate rules for the covering loo in an association rule learning framework (Liu et al., 1998; Liu et al., 2000; Jovanoski and Lavrač, 2001). Figure 14 illustrates the effect of thresholds on suort and confidence in PN-sace. Together, they secify an area for valid rules around the (0, P)- oint in ROC-sace. Rules in the grey areas will be filtered out. The dark grey region shows a less restrictive combination of the two thresholds, the light grey region a more restrictive setting. In effect, confidence constrains the quality of the rules, whereas suort aims at ensuring a minimum reliability by filtering out rules whose confidence estimate originates from too few ositive examles. The combined effect is quite similar to a threshold on the g- or F-measures. (Section 4.5), but it allows for more rules in the region near the intersection of the two threshold lines CN2 S SIGNIFICANCE TEST CN2 (Clark and Niblett, 1989; Clark and Boswell, 1991) filters rules for which the distribution of the covered examles is not statistically significantly different from the distribution of examles in the full data set. To this end, it comutes the likelihood ratio statistic: h lrs = 2(log e + nlog n e n ) aer.tex; 5/11/2003; 21:06;.30

31 ROC n Rule Learning 31 Figure 15. Illustration of CN2 s significance test. Shown are the regions that would not ass a 95% (dark grey) and a 99% (light grey) significant test. where e = ( + n) P P+N and e n = ( + n) N P+N = ( + n) e are the number of ositive and negative examles one could exect if the + n examles covered by the rule were distributed in the same way as the P + N examles in the full data set. Figure 15 illustrates CN2 s filtering criterion. The dark grey area shows the location of the rules that will be filtered out because it can not be established with 95% confidence that their distribution is different from the distribution in the full dataset. The light grey area shows the set of rules that will be filtered out if 99% confidence in the difference is required. Consequently, the area is symmetric around the diagonal, which corresonds to random guessing. Other significance tests than the likelihood ratio could be used. For instance, we have already discussed h chi 2 in Section 5.2. In the case of a two-by-two contingency table, both h chi 2 and h lrs are distributed as χ 2 with one degree of freedom, but that doesn t mean that they are equivalent as heuristics. In fact, there are some interesting differences between the isometric landscaes in Figures 10 and 15. More secifically, the likelihood ratio isometrics in Figure 15 do not deend on the size of the training set. Any grah corresonding to a bigger training set with the same class distribution (a roortion of P/(P + N) ositive examles) will contain this grah in its lower left corner. In other words, whether a rule covering ositive and n negative examles is significant according to h lrs does not deend on the size of the training set, but only on the class distribution of the examles it covers. In contrast, h chi 2 always fits the same isometric landscae into PN-sace. As a result, the evaluation of a oint (n, ) deends on its relative location aer.tex; 5/11/2003; 21:06;.31

32 32 Johannes Fürnkranz and Peter Flach Figure 16. Illustration of Foil s encoding length restriction for domains with P < N (left) and P > N (right). Lighter shades of gray corresond to larger encoding lengths for the rule. (n/n, /P). In this case, the same rule will always be evaluated in the same way, as long as it covers the same fraction of the training data. Only the location of the significance cutoff isometric (e.g. 95%) deends on P and N in the case of h chi FOIL S ENCODING LENGTH RESTRICTION Foil (Quinlan, 1990) uses a criterion based on minimum descrition length (MDL) for deciding when to sto refining the current rule. For exlicitly indicating the ositive examles covered by the rule, one needs h MDL bits: h MDL = log 2 (P + N) + log 2 ( P + N This number is then comared to the number of bits needed to encode the rule, denoted by l(r). If h MDL (r) < l(r), i.e., if the encoding of the rule is longer than the encoding of the examles themselves, the rule is rejected. As l(r) deends solely on the encoding length (in bits) of the current rule, Foil s stoing criterion like CN2 s significance test also deends on the size of the training set: the same rule that is too long for a smaller training set might be good enough for a larger training set, in which it covers more examles. For the uroses of our analysis, we can neglect the rule length as a arameter, and interret that h MDL as a heuristic that is comared to a variable threshold the size of which deends on the length of the rule. Figure 16 shows the behavior of h MDL in PN-sace. The isometric landscae is equivalent to one for the minimum coverage criterion, namely arallel lines to the N-axis. ) aer.tex; 5/11/2003; 21:06;.32

33 ROC n Rule Learning 33 This is not surrising, considering that h MDL is indeendent of n, the number of covered negative examles. For P < N, h MDL is monotonically increasing with. As a consequence, for every rule encoding length l(r) there exists a threshold (l(r)) such that for all rules r: h MDL (r) > l(r) h(r) > (l(r)). Thus, in this case, h MDL and h are essentially equivalent. The more comlicated formulation in the MDL-framework basically serves the urose of automatically maing a rule to an aroriate threshold. It remains an oen question, whether there is a simler maing r (r) that allows to obtain equivalent (or similar) thresholds without the exensive comutation of l(r). Particularly interesting is the case P > N (right grah of Figure 16): the isometric landscae is still the same, but the labels are no longer monotonically increasing. In fact, h MDL has a maximum at the oint = (P + N)/2. Below this line (shown dashed in Figure 16), the function is monotonically increasing (as above), but above this line it starts to decrease again. Thus, we can formulate the following THEOREM 7.1. h MDL is comatible with h iff (P + N)/2 and it is antagonistic to h for > (P + N)/2. Proof. a) P+N 2 : ( ) P + N h MDL = log 2 (P + N) + log 2 i=1 (P + N + 1 i) log 2 i=1 i = = i=1 i=1 log 2 (P + N + 1 i) i=1 log 2 i (log 2 (P + N + 1 i) log 2 i) The terms inside the sum are all constant. Thus, for two rules r 1 and r 2 with 0 h(r 1 ) < h(r 2 ) P+N 2, the corresonding sums only differ in the number of terms. As all terms are > 0, h(r 1 ) < h(r 2 ) h MDL (r 1 ) < h MDL (r 2 ) b) > P+N 2 : ( ) ( ) P + N P + N h MDL log 2 = log 2 P + N Therefore, each rule r with coverage has the same evaluation as some rule r with coverage P + N < (P + N)/2. As the transformation P + N aer.tex; 5/11/2003; 21:06;.33

34 34 Johannes Fürnkranz and Peter Flach Figure 17. Illustration of Fossil s cutoff criterion. is monotonically decreasing, h MDL (r 1 ) > h MDL (r 2 ) h MDL (r 1) > h MDL (r 2) h(r 1) > h(r 2) h(r 1 ) < h(r 2 ) COROLLARY 7.2. h MDL is equivalent with h iff P N. Proof. h MDL reaches its maximum at = P+N 2, which is inside PN-sace only iff P > N. Thus, for skewed class distributions with many ositive examles, there might be cases where a rule r is accetable, while a rule r that has the same encoding length (l(r ) = l(r)), covers the same number or fewer negative examles (n(r ) n(r)), but more ositive examles ((r ) > (r)) is not accetable. For examle, in the examle shown on the right of Figure 16, for a certain rule length l, only rules that cover between 65% and 80% of the ositive examles are accetable. A rule of the same length that covers all ositive and no negative examles would not be accetable. This is very counter-intuitive and sheds some doubts uon the suitability of Foil s encoding length restriction for such domains FOSSIL S CORRELATION CUTOFF Fossil (Fürnkranz, 1994) imoses a threshold uon its correlation heuristic (Section 5.2). Only rules that evaluate above this threshold are admitted. Figure 17 shows the effect of this threshold for the stoing value 0.3 which aears to erform fairly well over a variety of domains (Fürnkranz, 1997). aer.tex; 5/11/2003; 21:06;.34

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit

More information

MATH 2710: NOTES FOR ANALYSIS

MATH 2710: NOTES FOR ANALYSIS MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite

More information

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze)

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze) Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze) Learning Individual Rules and Subgroup Discovery Introduction Batch Learning Terminology Coverage Spaces Descriptive vs. Predictive

More information

On the Chvatál-Complexity of Knapsack Problems

On the Chvatál-Complexity of Knapsack Problems R u t c o r Research R e o r t On the Chvatál-Comlexity of Knasack Problems Gergely Kovács a Béla Vizvári b RRR 5-08, October 008 RUTCOR Rutgers Center for Oerations Research Rutgers University 640 Bartholomew

More information

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules

More information

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle]

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle] Chater 5 Model checking, verification of CTL One must verify or exel... doubts, and convert them into the certainty of YES or NO. [Thomas Carlyle] 5. The verification setting Page 66 We introduce linear

More information

Feedback-error control

Feedback-error control Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller

More information

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO) Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment

More information

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions.

More information

Statics and dynamics: some elementary concepts

Statics and dynamics: some elementary concepts 1 Statics and dynamics: some elementary concets Dynamics is the study of the movement through time of variables such as heartbeat, temerature, secies oulation, voltage, roduction, emloyment, rices and

More information

Analysis of some entrance probabilities for killed birth-death processes

Analysis of some entrance probabilities for killed birth-death processes Analysis of some entrance robabilities for killed birth-death rocesses Master s Thesis O.J.G. van der Velde Suervisor: Dr. F.M. Sieksma July 5, 207 Mathematical Institute, Leiden University Contents Introduction

More information

Elementary Analysis in Q p

Elementary Analysis in Q p Elementary Analysis in Q Hannah Hutter, May Szedlák, Phili Wirth November 17, 2011 This reort follows very closely the book of Svetlana Katok 1. 1 Sequences and Series In this section we will see some

More information

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES AARON ZWIEBACH Abstract. In this aer we will analyze research that has been recently done in the field of discrete

More information

8 STOCHASTIC PROCESSES

8 STOCHASTIC PROCESSES 8 STOCHASTIC PROCESSES The word stochastic is derived from the Greek στoχαστικoς, meaning to aim at a target. Stochastic rocesses involve state which changes in a random way. A Markov rocess is a articular

More information

Brownian Motion and Random Prime Factorization

Brownian Motion and Random Prime Factorization Brownian Motion and Random Prime Factorization Kendrick Tang June 4, 202 Contents Introduction 2 2 Brownian Motion 2 2. Develoing Brownian Motion.................... 2 2.. Measure Saces and Borel Sigma-Algebras.........

More information

CMSC 425: Lecture 4 Geometry and Geometric Programming

CMSC 425: Lecture 4 Geometry and Geometric Programming CMSC 425: Lecture 4 Geometry and Geometric Programming Geometry for Game Programming and Grahics: For the next few lectures, we will discuss some of the basic elements of geometry. There are many areas

More information

Convex Optimization methods for Computing Channel Capacity

Convex Optimization methods for Computing Channel Capacity Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem

More information

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III AI*IA 23 Fusion of Multile Pattern Classifiers PART III AI*IA 23 Tutorial on Fusion of Multile Pattern Classifiers by F. Roli 49 Methods for fusing multile classifiers Methods for fusing multile classifiers

More information

COMMUNICATION BETWEEN SHAREHOLDERS 1

COMMUNICATION BETWEEN SHAREHOLDERS 1 COMMUNICATION BTWN SHARHOLDRS 1 A B. O A : A D Lemma B.1. U to µ Z r 2 σ2 Z + σ2 X 2r ω 2 an additive constant that does not deend on a or θ, the agents ayoffs can be written as: 2r rθa ω2 + θ µ Y rcov

More information

Named Entity Recognition using Maximum Entropy Model SEEM5680

Named Entity Recognition using Maximum Entropy Model SEEM5680 Named Entity Recognition using Maximum Entroy Model SEEM5680 Named Entity Recognition System Named Entity Recognition (NER): Identifying certain hrases/word sequences in a free text. Generally it involves

More information

Topic 7: Using identity types

Topic 7: Using identity types Toic 7: Using identity tyes June 10, 2014 Now we would like to learn how to use identity tyes and how to do some actual mathematics with them. By now we have essentially introduced all inference rules

More information

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

John Weatherwax. Analysis of Parallel Depth First Search Algorithms Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel

More information

Multi-Operation Multi-Machine Scheduling

Multi-Operation Multi-Machine Scheduling Multi-Oeration Multi-Machine Scheduling Weizhen Mao he College of William and Mary, Williamsburg VA 3185, USA Abstract. In the multi-oeration scheduling that arises in industrial engineering, each job

More information

Linear diophantine equations for discrete tomography

Linear diophantine equations for discrete tomography Journal of X-Ray Science and Technology 10 001 59 66 59 IOS Press Linear diohantine euations for discrete tomograhy Yangbo Ye a,gewang b and Jiehua Zhu a a Deartment of Mathematics, The University of Iowa,

More information

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost

More information

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK Towards understanding the Lorenz curve using the Uniform distribution Chris J. Stehens Newcastle City Council, Newcastle uon Tyne, UK (For the Gini-Lorenz Conference, University of Siena, Italy, May 2005)

More information

Positive decomposition of transfer functions with multiple poles

Positive decomposition of transfer functions with multiple poles Positive decomosition of transfer functions with multile oles Béla Nagy 1, Máté Matolcsi 2, and Márta Szilvási 1 Deartment of Analysis, Technical University of Budaest (BME), H-1111, Budaest, Egry J. u.

More information

HENSEL S LEMMA KEITH CONRAD

HENSEL S LEMMA KEITH CONRAD HENSEL S LEMMA KEITH CONRAD 1. Introduction In the -adic integers, congruences are aroximations: for a and b in Z, a b mod n is the same as a b 1/ n. Turning information modulo one ower of into similar

More information

arxiv:cond-mat/ v2 25 Sep 2002

arxiv:cond-mat/ v2 25 Sep 2002 Energy fluctuations at the multicritical oint in two-dimensional sin glasses arxiv:cond-mat/0207694 v2 25 Se 2002 1. Introduction Hidetoshi Nishimori, Cyril Falvo and Yukiyasu Ozeki Deartment of Physics,

More information

Distributed Rule-Based Inference in the Presence of Redundant Information

Distributed Rule-Based Inference in the Presence of Redundant Information istribution Statement : roved for ublic release; distribution is unlimited. istributed Rule-ased Inference in the Presence of Redundant Information June 8, 004 William J. Farrell III Lockheed Martin dvanced

More information

Why Proofs? Proof Techniques. Theorems. Other True Things. Proper Proof Technique. How To Construct A Proof. By Chuck Cusack

Why Proofs? Proof Techniques. Theorems. Other True Things. Proper Proof Technique. How To Construct A Proof. By Chuck Cusack Proof Techniques By Chuck Cusack Why Proofs? Writing roofs is not most student s favorite activity. To make matters worse, most students do not understand why it is imortant to rove things. Here are just

More information

A Social Welfare Optimal Sequential Allocation Procedure

A Social Welfare Optimal Sequential Allocation Procedure A Social Welfare Otimal Sequential Allocation Procedure Thomas Kalinowsi Universität Rostoc, Germany Nina Narodytsa and Toby Walsh NICTA and UNSW, Australia May 2, 201 Abstract We consider a simle sequential

More information

Sets of Real Numbers

Sets of Real Numbers Chater 4 Sets of Real Numbers 4. The Integers Z and their Proerties In our revious discussions about sets and functions the set of integers Z served as a key examle. Its ubiquitousness comes from the fact

More information

FE FORMULATIONS FOR PLASTICITY

FE FORMULATIONS FOR PLASTICITY G These slides are designed based on the book: Finite Elements in Plasticity Theory and Practice, D.R.J. Owen and E. Hinton, 1970, Pineridge Press Ltd., Swansea, UK. 1 Course Content: A INTRODUCTION AND

More information

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar 15-859(M): Randomized Algorithms Lecturer: Anuam Guta Toic: Lower Bounds on Randomized Algorithms Date: Setember 22, 2004 Scribe: Srinath Sridhar 4.1 Introduction In this lecture, we will first consider

More information

dn i where we have used the Gibbs equation for the Gibbs energy and the definition of chemical potential

dn i where we have used the Gibbs equation for the Gibbs energy and the definition of chemical potential Chem 467 Sulement to Lectures 33 Phase Equilibrium Chemical Potential Revisited We introduced the chemical otential as the conjugate variable to amount. Briefly reviewing, the total Gibbs energy of a system

More information

State Estimation with ARMarkov Models

State Estimation with ARMarkov Models Deartment of Mechanical and Aerosace Engineering Technical Reort No. 3046, October 1998. Princeton University, Princeton, NJ. State Estimation with ARMarkov Models Ryoung K. Lim 1 Columbia University,

More information

Notes on Instrumental Variables Methods

Notes on Instrumental Variables Methods Notes on Instrumental Variables Methods Michele Pellizzari IGIER-Bocconi, IZA and frdb 1 The Instrumental Variable Estimator Instrumental variable estimation is the classical solution to the roblem of

More information

A Special Case Solution to the Perspective 3-Point Problem William J. Wolfe California State University Channel Islands

A Special Case Solution to the Perspective 3-Point Problem William J. Wolfe California State University Channel Islands A Secial Case Solution to the Persective -Point Problem William J. Wolfe California State University Channel Islands william.wolfe@csuci.edu Abstract In this aer we address a secial case of the ersective

More information

MATH 361: NUMBER THEORY EIGHTH LECTURE

MATH 361: NUMBER THEORY EIGHTH LECTURE MATH 361: NUMBER THEORY EIGHTH LECTURE 1. Quadratic Recirocity: Introduction Quadratic recirocity is the first result of modern number theory. Lagrange conjectured it in the late 1700 s, but it was first

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.

More information

On split sample and randomized confidence intervals for binomial proportions

On split sample and randomized confidence intervals for binomial proportions On slit samle and randomized confidence intervals for binomial roortions Måns Thulin Deartment of Mathematics, Usala University arxiv:1402.6536v1 [stat.me] 26 Feb 2014 Abstract Slit samle methods have

More information

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process P. Mantalos a1, K. Mattheou b, A. Karagrigoriou b a.deartment of Statistics University of Lund

More information

Plotting the Wilson distribution

Plotting the Wilson distribution , Survey of English Usage, University College London Setember 018 1 1. Introduction We have discussed the Wilson score interval at length elsewhere (Wallis 013a, b). Given an observed Binomial roortion

More information

Asymptotically Optimal Simulation Allocation under Dependent Sampling

Asymptotically Optimal Simulation Allocation under Dependent Sampling Asymtotically Otimal Simulation Allocation under Deendent Samling Xiaoing Xiong The Robert H. Smith School of Business, University of Maryland, College Park, MD 20742-1815, USA, xiaoingx@yahoo.com Sandee

More information

On a Markov Game with Incomplete Information

On a Markov Game with Incomplete Information On a Markov Game with Incomlete Information Johannes Hörner, Dinah Rosenberg y, Eilon Solan z and Nicolas Vieille x{ January 24, 26 Abstract We consider an examle of a Markov game with lack of information

More information

The inverse Goldbach problem

The inverse Goldbach problem 1 The inverse Goldbach roblem by Christian Elsholtz Submission Setember 7, 2000 (this version includes galley corrections). Aeared in Mathematika 2001. Abstract We imrove the uer and lower bounds of the

More information

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning TNN-2009-P-1186.R2 1 Uncorrelated Multilinear Princial Comonent Analysis for Unsuervised Multilinear Subsace Learning Haiing Lu, K. N. Plataniotis and A. N. Venetsanooulos The Edward S. Rogers Sr. Deartment

More information

Finite-State Verification or Model Checking. Finite State Verification (FSV) or Model Checking

Finite-State Verification or Model Checking. Finite State Verification (FSV) or Model Checking Finite-State Verification or Model Checking Finite State Verification (FSV) or Model Checking Holds the romise of roviding a cost effective way of verifying imortant roerties about a system Not all faults

More information

16.2. Infinite Series. Introduction. Prerequisites. Learning Outcomes

16.2. Infinite Series. Introduction. Prerequisites. Learning Outcomes Infinite Series 6.2 Introduction We extend the concet of a finite series, met in Section 6., to the situation in which the number of terms increase without bound. We define what is meant by an infinite

More information

Cryptanalysis of Pseudorandom Generators

Cryptanalysis of Pseudorandom Generators CSE 206A: Lattice Algorithms and Alications Fall 2017 Crytanalysis of Pseudorandom Generators Instructor: Daniele Micciancio UCSD CSE As a motivating alication for the study of lattice in crytograhy we

More information

Characteristics of Beam-Based Flexure Modules

Characteristics of Beam-Based Flexure Modules Shorya Awtar e-mail: shorya@mit.edu Alexander H. Slocum e-mail: slocum@mit.edu Precision Engineering Research Grou, Massachusetts Institute of Technology, Cambridge, MA 039 Edi Sevincer Omega Advanced

More information

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points. Solved Problems Solved Problems P Solve the three simle classification roblems shown in Figure P by drawing a decision boundary Find weight and bias values that result in single-neuron ercetrons with the

More information

Variable Selection and Model Building

Variable Selection and Model Building LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 38 Variable Selection and Model Building Dr. Shalabh Deartment of Mathematics and Statistics Indian Institute of Technology Kanur Evaluation of subset regression

More information

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK Comuter Modelling and ew Technologies, 5, Vol.9, o., 3-39 Transort and Telecommunication Institute, Lomonosov, LV-9, Riga, Latvia MATHEMATICAL MODELLIG OF THE WIRELESS COMMUICATIO ETWORK M. KOPEETSK Deartment

More information

An Improved Calibration Method for a Chopped Pyrgeometer

An Improved Calibration Method for a Chopped Pyrgeometer 96 JOURNAL OF ATMOSPHERIC AND OCEANIC TECHNOLOGY VOLUME 17 An Imroved Calibration Method for a Choed Pyrgeometer FRIEDRICH FERGG OtoLab, Ingenieurbüro, Munich, Germany PETER WENDLING Deutsches Forschungszentrum

More information

General Linear Model Introduction, Classes of Linear models and Estimation

General Linear Model Introduction, Classes of Linear models and Estimation Stat 740 General Linear Model Introduction, Classes of Linear models and Estimation An aim of scientific enquiry: To describe or to discover relationshis among events (variables) in the controlled (laboratory)

More information

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules. Introduction: The is widely used in industry to monitor the number of fraction nonconforming units. A nonconforming unit is

More information

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule The Grah Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule STEFAN D. BRUDA Deartment of Comuter Science Bisho s University Lennoxville, Quebec J1M 1Z7 CANADA bruda@cs.ubishos.ca

More information

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems Int. J. Oen Problems Comt. Math., Vol. 3, No. 2, June 2010 ISSN 1998-6262; Coyright c ICSRS Publication, 2010 www.i-csrs.org Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various

More information

Section 0.10: Complex Numbers from Precalculus Prerequisites a.k.a. Chapter 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative

Section 0.10: Complex Numbers from Precalculus Prerequisites a.k.a. Chapter 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative Section 0.0: Comlex Numbers from Precalculus Prerequisites a.k.a. Chater 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative Commons Attribution-NonCommercial-ShareAlike.0 license.

More information

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek Use of Transformations and the Reeated Statement in PROC GLM in SAS Ed Stanek Introduction We describe how the Reeated Statement in PROC GLM in SAS transforms the data to rovide tests of hyotheses of interest.

More information

NUMERICAL AND THEORETICAL INVESTIGATIONS ON DETONATION- INERT CONFINEMENT INTERACTIONS

NUMERICAL AND THEORETICAL INVESTIGATIONS ON DETONATION- INERT CONFINEMENT INTERACTIONS NUMERICAL AND THEORETICAL INVESTIGATIONS ON DETONATION- INERT CONFINEMENT INTERACTIONS Tariq D. Aslam and John B. Bdzil Los Alamos National Laboratory Los Alamos, NM 87545 hone: 1-55-667-1367, fax: 1-55-667-6372

More information

8.7 Associated and Non-associated Flow Rules

8.7 Associated and Non-associated Flow Rules 8.7 Associated and Non-associated Flow Rules Recall the Levy-Mises flow rule, Eqn. 8.4., d ds (8.7.) The lastic multilier can be determined from the hardening rule. Given the hardening rule one can more

More information

ECE 534 Information Theory - Midterm 2

ECE 534 Information Theory - Midterm 2 ECE 534 Information Theory - Midterm Nov.4, 009. 3:30-4:45 in LH03. You will be given the full class time: 75 minutes. Use it wisely! Many of the roblems have short answers; try to find shortcuts. You

More information

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS

DETC2003/DAC AN EFFICIENT ALGORITHM FOR CONSTRUCTING OPTIMAL DESIGN OF COMPUTER EXPERIMENTS Proceedings of DETC 03 ASME 003 Design Engineering Technical Conferences and Comuters and Information in Engineering Conference Chicago, Illinois USA, Setember -6, 003 DETC003/DAC-48760 AN EFFICIENT ALGORITHM

More information

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018 Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3

More information

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition A Qualitative Event-based Aroach to Multile Fault Diagnosis in Continuous Systems using Structural Model Decomosition Matthew J. Daigle a,,, Anibal Bregon b,, Xenofon Koutsoukos c, Gautam Biswas c, Belarmino

More information

THE ERDÖS - MORDELL THEOREM IN THE EXTERIOR DOMAIN

THE ERDÖS - MORDELL THEOREM IN THE EXTERIOR DOMAIN INTERNATIONAL JOURNAL OF GEOMETRY Vol. 5 (2016), No. 1, 31-38 THE ERDÖS - MORDELL THEOREM IN THE EXTERIOR DOMAIN PETER WALKER Abstract. We show that in the Erd½os-Mordell theorem, the art of the region

More information

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch

More information

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM JOHN BINDER Abstract. In this aer, we rove Dirichlet s theorem that, given any air h, k with h, k) =, there are infinitely many rime numbers congruent to

More information

#A37 INTEGERS 15 (2015) NOTE ON A RESULT OF CHUNG ON WEIL TYPE SUMS

#A37 INTEGERS 15 (2015) NOTE ON A RESULT OF CHUNG ON WEIL TYPE SUMS #A37 INTEGERS 15 (2015) NOTE ON A RESULT OF CHUNG ON WEIL TYPE SUMS Norbert Hegyvári ELTE TTK, Eötvös University, Institute of Mathematics, Budaest, Hungary hegyvari@elte.hu François Hennecart Université

More information

Multiplicative group law on the folium of Descartes

Multiplicative group law on the folium of Descartes Multilicative grou law on the folium of Descartes Steluţa Pricoie and Constantin Udrişte Abstract. The folium of Descartes is still studied and understood today. Not only did it rovide for the roof of

More information

The non-stochastic multi-armed bandit problem

The non-stochastic multi-armed bandit problem Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) auer@igi.tu-graz.ac.at

More information

A New Perspective on Learning Linear Separators with Large L q L p Margins

A New Perspective on Learning Linear Separators with Large L q L p Margins A New Persective on Learning Linear Searators with Large L q L Margins Maria-Florina Balcan Georgia Institute of Technology Christoher Berlind Georgia Institute of Technology Abstract We give theoretical

More information

By Evan Chen OTIS, Internal Use

By Evan Chen OTIS, Internal Use Solutions Notes for DNY-NTCONSTRUCT Evan Chen January 17, 018 1 Solution Notes to TSTST 015/5 Let ϕ(n) denote the number of ositive integers less than n that are relatively rime to n. Prove that there

More information

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Technical Sciences and Alied Mathematics MODELING THE RELIABILITY OF CISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Cezar VASILESCU Regional Deartment of Defense Resources Management

More information

Pretest (Optional) Use as an additional pacing tool to guide instruction. August 21

Pretest (Optional) Use as an additional pacing tool to guide instruction. August 21 Trimester 1 Pretest (Otional) Use as an additional acing tool to guide instruction. August 21 Beyond the Basic Facts In Trimester 1, Grade 8 focus on multilication. Daily Unit 1: Rational vs. Irrational

More information

Advanced Calculus I. Part A, for both Section 200 and Section 501

Advanced Calculus I. Part A, for both Section 200 and Section 501 Sring 2 Instructions Please write your solutions on your own aer. These roblems should be treated as essay questions. A roblem that says give an examle requires a suorting exlanation. In all roblems, you

More information

Game Specification in the Trias Politica

Game Specification in the Trias Politica Game Secification in the Trias Politica Guido Boella a Leendert van der Torre b a Diartimento di Informatica - Università di Torino - Italy b CWI - Amsterdam - The Netherlands Abstract In this aer we formalize

More information

Dialectical Theory for Multi-Agent Assumption-based Planning

Dialectical Theory for Multi-Agent Assumption-based Planning Dialectical Theory for Multi-Agent Assumtion-based Planning Damien Pellier, Humbert Fiorino Laboratoire Leibniz, 46 avenue Félix Viallet F-38000 Grenboble, France {Damien.Pellier,Humbert.Fiorino}.imag.fr

More information

Chapter 7 Rational and Irrational Numbers

Chapter 7 Rational and Irrational Numbers Chater 7 Rational and Irrational Numbers In this chater we first review the real line model for numbers, as discussed in Chater 2 of seventh grade, by recalling how the integers and then the rational numbers

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques SAT based Abstraction-Refinement using ILP and Machine Learning Techniques 1 SAT based Abstraction-Refinement using ILP and Machine Learning Techniques Edmund Clarke James Kukula Anubhav Guta Ofer Strichman

More information

Complex Analysis Homework 1

Complex Analysis Homework 1 Comlex Analysis Homework 1 Steve Clanton Sarah Crimi January 27, 2009 Problem Claim. If two integers can be exressed as the sum of two squares, then so can their roduct. Proof. Call the two squares that

More information

Quantitative estimates of propagation of chaos for stochastic systems with W 1, kernels

Quantitative estimates of propagation of chaos for stochastic systems with W 1, kernels oname manuscrit o. will be inserted by the editor) Quantitative estimates of roagation of chaos for stochastic systems with W, kernels Pierre-Emmanuel Jabin Zhenfu Wang Received: date / Acceted: date Abstract

More information

19th Bay Area Mathematical Olympiad. Problems and Solutions. February 28, 2017

19th Bay Area Mathematical Olympiad. Problems and Solutions. February 28, 2017 th Bay Area Mathematical Olymiad February, 07 Problems and Solutions BAMO- and BAMO- are each 5-question essay-roof exams, for middle- and high-school students, resectively. The roblems in each exam are

More information

On the Toppling of a Sand Pile

On the Toppling of a Sand Pile Discrete Mathematics and Theoretical Comuter Science Proceedings AA (DM-CCG), 2001, 275 286 On the Toling of a Sand Pile Jean-Christohe Novelli 1 and Dominique Rossin 2 1 CNRS, LIFL, Bâtiment M3, Université

More information

Phase transition. Asaf Pe er Background

Phase transition. Asaf Pe er Background Phase transition Asaf Pe er 1 November 18, 2013 1. Background A hase is a region of sace, throughout which all hysical roerties (density, magnetization, etc.) of a material (or thermodynamic system) are

More information

Strong Matching of Points with Geometric Shapes

Strong Matching of Points with Geometric Shapes Strong Matching of Points with Geometric Shaes Ahmad Biniaz Anil Maheshwari Michiel Smid School of Comuter Science, Carleton University, Ottawa, Canada December 9, 05 In memory of Ferran Hurtado. Abstract

More information

Fig. 4. Example of Predicted Workers. Fig. 3. A Framework for Tackling the MQA Problem.

Fig. 4. Example of Predicted Workers. Fig. 3. A Framework for Tackling the MQA Problem. 217 IEEE 33rd International Conference on Data Engineering Prediction-Based Task Assignment in Satial Crowdsourcing Peng Cheng #, Xiang Lian, Lei Chen #, Cyrus Shahabi # Hong Kong University of Science

More information

Supplementary Materials for Robust Estimation of the False Discovery Rate

Supplementary Materials for Robust Estimation of the False Discovery Rate Sulementary Materials for Robust Estimation of the False Discovery Rate Stan Pounds and Cheng Cheng This sulemental contains roofs regarding theoretical roerties of the roosed method (Section S1), rovides

More information

Principles of Computed Tomography (CT)

Principles of Computed Tomography (CT) Page 298 Princiles of Comuted Tomograhy (CT) The theoretical foundation of CT dates back to Johann Radon, a mathematician from Vienna who derived a method in 1907 for rojecting a 2-D object along arallel

More information

Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu

Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu and Markov Models Huizhen Yu janey.yu@cs.helsinki.fi Det. Comuter Science, Univ. of Helsinki Some Proerties of Probabilistic Models, Sring, 200 Huizhen Yu (U.H.) and Markov Models Jan. 2 / 32 Huizhen Yu

More information

On the capacity of the general trapdoor channel with feedback

On the capacity of the general trapdoor channel with feedback On the caacity of the general tradoor channel with feedback Jui Wu and Achilleas Anastasooulos Electrical Engineering and Comuter Science Deartment University of Michigan Ann Arbor, MI, 48109-1 email:

More information

HARMONIC EXTENSION ON NETWORKS

HARMONIC EXTENSION ON NETWORKS HARMONIC EXTENSION ON NETWORKS MING X. LI Abstract. We study the imlication of geometric roerties of the grah of a network in the extendibility of all γ-harmonic germs at an interior node. We rove that

More information

A Note on Guaranteed Sparse Recovery via l 1 -Minimization

A Note on Guaranteed Sparse Recovery via l 1 -Minimization A Note on Guaranteed Sarse Recovery via l -Minimization Simon Foucart, Université Pierre et Marie Curie Abstract It is roved that every s-sarse vector x C N can be recovered from the measurement vector

More information

ON THE NORM OF AN IDEMPOTENT SCHUR MULTIPLIER ON THE SCHATTEN CLASS

ON THE NORM OF AN IDEMPOTENT SCHUR MULTIPLIER ON THE SCHATTEN CLASS PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 00, Number 0, Pages 000 000 S 000-9939XX)0000-0 ON THE NORM OF AN IDEMPOTENT SCHUR MULTIPLIER ON THE SCHATTEN CLASS WILLIAM D. BANKS AND ASMA HARCHARRAS

More information

RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO

RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO MARIA ARTALE AND DAVID A. BUCHSBAUM Abstract. We find an exlicit descrition of the terms and boundary mas for the three-rowed

More information

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-step monotone missing data Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo

More information