An asymmetric entropy measure for decision trees

Size: px

Start display at page:

Download "An asymmetric entropy measure for decision trees"

Shavonne Sullivan
5 years ago
Views:

1 An asymmetric entropy measure for decision trees Simon Marcellin Laboratoire ERIC Université Lumière Lyon 2 5 av. Pierre Mendès-France BRON Cedex France simon.marcellin@univ-lyon2.fr Djamel A. Zighed Laboratoire ERIC Université Lumière Lyon 2 5 av. Pierre Mendès-France BRON Cedex France zighed@univ-lyon2.fr Gilbert Ritschard THEMES, Department of Econometrics Uni-Mail, 4, bd du Pont-d Arve, room 5232 CH-2 Geneva 4 Switzerland Gilbert.Ritschard@themes.unige.ch Abstract In this paper we present a new entropy measure to grow decision trees. This measure has the particularity to be asymmetric, allowing the user to grow trees which better correspond to his expectations in terms of recall and precision on each class. Then we propose decision rules adapted to such trees. Experiments have been led on real medical data from breast cancer screening units. Keywords: decision trees, entropy, class imbalance, asymmetric learning, decision rules. Introduction Classical decision trees verify two properties linked one to each other: on the one hand, the symmetry of the splitting criterion (Shannon entropy for ID3 [2] or Gini measure for CART [4]), meaning that maximal uncertainty is reached for equiprobability situations; on the other hand, the application of the majority rule to affect a class to a leaf. But, in many problems, distributions are not balanced and the different classes have not the same importance [8] for the user. This happens for example in marketing [5], medical computer-aided diagnosis or fraud detection [], where some classes are much more important than others. Thus users requirements towards a classification model are different. To illustrate this problem one should consider the example of computer-aided diagnosis of breast cancer. The aim is to label regions as cancer or non-cancer on digitized films. In this domain the cancer class is less represented in the datasets, but it is imperative to class all them well. A standard decision tree considers that the maximum uncertainty is reached in leaves having 5% of cancer and 5% of non-cancer. Likewise, the leaves are classified as cancer as soon as cancer proportion exceeds 5% and noncancer otherwise. We may see in this example that this is not suitable to this kind of problems. Indeed as non-cancer decision may have serious sequel, we could decide to classify in this category only the leaves with a non-cancer proportion exceeding 9%. On the other hand, it is so important to find all cancers that we could decide to classify leaves as cancer from a proportion of 2%. Some approaches have permitted to deal with this problem. We may class them into four categories []. First, the cost-sensitive approaches allow penalizing some types of errors, balancing the number of examples of the concerned class in the confusion matrix. Then the sampling methods allow to over-represent the minority class or under-represent the majority class [, 4]. Some wrapper methods as MetaCost, that produce several instances of a classifier by bootstrap, re-label each example by votes and build another model using the new labels, have also been proposed [7]. At last, other methods closer to our try to include a bias directly in the classifier, in particular by using a cost function instead of an entropy criterion [9, 6]. Concerning more precisely de-

2 Uncertainty. Shannon Quadratic. Probability of class Figure : Examples of standard entropy measures for a two-class problem cision trees, it is also possible to modify the decision rules of the leaves. This does not change the produced tree but just the decision for each leaf. In order to have an entirely coherent model, we propose to adapt the splitting criterion to these rules, so that the maximum uncertainty situation corresponds to the indecision situation in the decision rules. For example, in our two-class problem, if we classify a leaf as cancer when the frequency of cancers exceeds 2%, and as non-cancer when their frequency exceeds 9%, the maximal uncertainty must be reached when the proportion of cancer class is between % and 2%. We will see in section 2 how we have determined an asymmetric uncertainty measure. Section 3 presents our asymmetric entropy. In section 4 we present how to adapt the decision rules to this asymmetric entropy measure. Section 5 details the results achieved on a real medical dataset within the framework of a computer-aided diagnosis of breast cancer system, and two standard datasets from the UCI repository [9]. Finally, section 6 concludes and proposes extensions to our work. 2 Asymmetric uncertainty measure The different splitting criteria that have been proposed are symmetric, meaning that maximal uncertainty is reached in equiprobability situations [5]. Yet we need the maximal uncertainty to be reached for a given probability, noted w. This criterion should verify the properties of the entropy, especially the concavity. More formally, we search for a function h of p (frequency of the considered class in a leaf) with a parameter w (value for p where the uncertainty is maximal) with p [, ] and w [, ]. This function should respect the entropy properties, except that the maximum should be reached in w instead of. So the constraints are [5]: Minimality: Strict concavity: Maximality: h w (p = ) = () h w (p = ) = (2) d 2 h w (p) dp 2 < (3) dh w (p = w) dp = (4) We assume that it exists a rational function verifying those conditions: h w (p) = ap2 + bp + c dp + e To remove a degree of freedom we may add a constraint on the maximum value: h w (p = w) = (5) Those constraints allow us to get the following function: h w (p) = p 2 + p ( 2w + )p + w 2 = p( p) ( 2w + )p + w 2 With this function we may represent the contribution of a class to the global uncertainty of a node. It verifies the hypotheses () to (5).

3 Class Class 2 Normalized sum Uncertainty Uncertainty... p. Probability of class Figure 2: Asymmetric uncertainty measure for w= Figure 3: Two-class asymmetric entropy, W = [, ] 3 Asymmetric entropy So we have determined an uncertainty function. To obtain the asymmetric entropy of a node in a k-class problem, we propose to use a sum: k H W (P) = h wi (p i ) i= with P = [p,..., p k ], W = [w,..., w k ] Then to normalize it between and. This function has the following properties: Minimality: H W (P) = if i p i = : the leaf is pure. Strict concavity: As the sum of concave functions is concave, H W (P) is strictly concave. Maximality: the unique analytic solution P = [p,..., p k ] of H W (P) = i dh w i (p i ) = i p i dp i exists. We just give here an approximation of its position. The maximum of the asymmetric entropy function is located in the interval defined as follows: p i > w i i if k i= w i < p i < w i i if k i= w i > p i = w i i if k i= w i = Figure 4 shows the value of the entropy for a three-class problem, in function of p and p 2 (the third class probability is implicit because p 3 = p p 2 ). We can now grow a decision tree taking into account users requirements in terms of recall and precision for each class. At each new split in the decision tree, we try to get rid of the uncertainty situations represented by our asymmetric entropy measure. The aim of the asymmetric entropy is to produce a tree tailored the best as possible to the decision rules provided by the user. To be coherent, the parameters should be set in order to reflect those rules. We will see in the next section the way those rules are managed. 4 Impact on decision rules How may we classify a leaf having the distribution P = [p,..., p k ]? Standard trees set the class C with the majority rule: (R) C = i if p i > p j j i According to the construction of our entropy measure, we propose the rule: (R) C = i if p i > w i

4 Uncertainty Ø C=.. Probability of class Figure 4: Three-class asymmetric entropy, W = [.,, ] Figure 5: Decision rules for a two-class problem, W = [, ] Nevertheless, this rule is not sufficient: conflicts may occur if more than one class satisfy the rule R at the same time (see figure 6) Moreover, if k i= w i >, we will confront situations where the rule R is not satisfied for any class (see figure 5). If R is satisfied for several classes, we must determine the winning class. We may propose different rules, according to the considered problem: set the leaf to the class i having the smallest constraint (min(w) = w i, to privilege recall rate on minority class), or to the class having the largest p i /w i rate. Nevertheless, it seems important to us that the rules and the entropy should be coherent, i.e. the frontier point between the choice of one class or another corresponds to a maximal entropy. So we propose the following method. A leaf is litigious if more than one class satisfy the rule R: p > w. Those classes form the ensemble of candidates: S. To decide, we compare all the pairs of distinct classes a and b of S. For each pair, we will determine the winning class. As we will see, the choice is based on the entropy. Its concavity property implies that except in indeterminate cases, one class dominate all others. Probability of class 2. or 3 C=3 C=, 2 or 3 C= or 2 C= or 3. Probability of class Figure 6: Decision rules for a three-class problem, W = [.,, ] C=

5 We search the distribution between p a and p b which provides us a maximal entropy, setting all other probabilities to their observed value p i = α i : H = h wi (α i ) h wa (p a ) h wb (p b ) h wk (α k ) If we note δ = p i i a, i b, we obtain: H = h wi (α i ) h wa (p a ) h wb ( p a δ) h wk (α k ) Probability of class 2 or 3 C=3 Ø C= or 2 C= This expression is only function of p a. We search its maximum, i.e.: dh = d[h w a (p a ) + h wb ( p a δ)] = dp a dp a. C= or 3. Probability of class Figure 7: Decision rules for a three-class problem, W = [,, ] The single solution p a exists. We can prove that this solution is such that p a (w a, δ w b ) (see the maximality property of the asymmetric entropy). Then we consider three situations: If p a > p a, class a win If p a < p a, class b win If p a = p a: indeterminate Unless we are in an indeterminate situation, one and just one class dominates all others within the ensemble S. This class is assigned to the leaf. Figures 7 and 8 illustrate this rule. In figure 8, to decide in the areas where R is satisfied for several classes we consider the indecision curves given by the crests of the entropy as defined above. The choice is made in function of the side of the crest where the considered distribution is located. So, using the asymmetric entropy could produce leaves for which we can t decide, and thus let some examples unclassified. This asymmetric entropy may be compared to cost measures ([]) but our measure respects the expected properties of an entropy function. Moreover the parameters are set intelligibly. Probability of class 2 or 3. C=3 Ø C= or 2 C= or 3. Probability of class Figure 8: Indecision curves ( crests ) of entropy for a three-class problem, W = [,, ] C=

6 5 Experiments We have experimented our method on real data from breast cancer screening units. The aim is to detect tumours on digitized mammograms. Several hundreds of films have been annotated by radiologists. Then, those films have been segmented with imaging methods, in order to obtain regions of interest. Those zones have been labelled as cancer if they correspond to a radiologist annotation or non-cancer otherwise. At last, a great number of features (based on grey-level histogram, shape and texture) have been computed. The aim is to build a model able to separate cancers and non-cancers. (See the description of the mammo dataset table ). Table : Description of the mammo dataset Dataset # of # of % of features examples cancer Mammo % To allow comparisons with other approaches, we have also tested our method on two standard machine learning datasets [9, 6]. On these imbalanced datasets we tried to obtain the best recall on the minority class, keeping a correct precision on this class. Table 2: Description of the UCI dataset Dataset # of # of % of feat. ex. min. class Hypothyroid % Satimage % With each dataset we have built three series of classifiers using folds cross-validation: standard C4.5 [3] and RandomForest [3, 2]; C4.5 and RandomForest using cost matrix with the WEKA software [4]; and tree and RandomForest with the asymmetric entropy criterion we proposed (on each dataset, different Random Forests are built with the same number of trees and features). We also compare our results with those obtained by Chen and Liaw [6]. For each test we present recall and precision rates on the minority class. As we present only two-class problems, the results on the majority class are implicit. So we do not give them here. Table 3: Parameters for mammo Method Parameters C4.5 - Random Forest trees, 28 features C4.5 with cost () cost : C4.5 with cost (2) cost :2 RF with cost cost :35 Asymmetric tree w =, w 2 = Asymmetric RF w =, w 2 =. Table 4: Results for mammo Method Recall Precision C4.5 3 Random Forest.. C4.5 with cost ().8 C4.5 with cost (2). 5 RF with cost 7.4 Asymmetric tree. Asymmetric RF 8. For asymmetric methods and cost matrix we used several sets of parameters. Tables 4, 5 and 6 present the best results for each method. We may see that the best results are always obtained with an asymmetric Random Forest. For the real-data mammo dataset, asymmetric entropy with a single decision tree does not improve results (cost-sensitive C4.5 is better), but causes real enhancement with the Random Forest. Table 5: Parameters for hypothyroid Method Parameters C4.5 - Random Forest 3 trees, 4 features C4.5 with cost cost :5 RF with cost cost :5 Asymmetric tree w =, w 2 =. Asymmetric RF w =, w 2 =. BRF cutoff= WRF weight=:5 For the hypothyroid dataset, we also present the results obtained with Balanced Random

7 Table 6: Results for hypothyroid Method Recall Precision C Random Forest 6 6 C4.5 with cost 8 6 RF with cost 9 6 Asymmetric tree 8 7 Asymmetric RF. 8 BRF 5 3 WRF 3 3 Forest and Weighted Random Forest [6]. On this dataset, asymmetric Random Forest dominates all other methods, whatever the parameters. Table 8: Results for satimage Method Recall Precision C Random Forest 4 4 C4.5 with cost () 2 9 C4.5 with cost (2) 2 RF with cost () 5 3 RF with cost (2) 4 3 Asymmetric tree () 3 Asymmetric tree (2) 9 Asymmetric RF () 9 9 Asymmetric RF (2) 6 BRF () 7 4 BRF (2) 7 6 WRF () 9 9 WRF (2) 7 Table 7: Parameters for satimage Method Parameters C4.5 - Random Forest 3 trees, 6 features C4.5 with cost () cost :5 C4.5 with cost (2) cost :5 RF with cost () cost :2 RF with cost (2) cost :5 Asymmetric tree () w = 8, w 2 =. Asymmetric tree (2) w =, w 2 = Asymmetric RF () w = 8, w 2 =. Asymmetric RF (2) w =, w 2 = BRF () cutoff= BRF (2) cutoff= WRF () (Not specified) WRF (2) (Not specified) With the satimage dataset, we have chosen the minority class and merged the five others to create the majority class. We also present BRF and WRF results. We may observe on figure 9 that for the satimage dataset, the asymmetric Random Forest dominates other methods again. The Random Forest with cost matrix is very close to WRF and BRF. Finally we may notice that for this dataset, asymmetric tree is a bit better than C4.5 using cost matrix. Figure 9: Recall and precision for satimage with different parameters 6 Conclusion and future works We propose an asymmetric entropy measure for decision trees. By using it with adapted decision rules, we can grow trees taking into account the user s specifications in terms of recall and precision for each class, and thus obtain better results on imbalanced datasets. We may notice that this method entails no additional computing complexity, and that the parameters can be set intelligibly by the user. In the future we plan to improve our works on different points. Particularly, the entropy that we propose may be enhanced, in particular the aggregation of the uncertainty (we use a sum) for each class. Experiments with more than two class problems will be led. We

8 also will lead a theoretical comparison between this approach and the ones that try to introduce costs in the splitting criterion. Indeed, as those two approaches may be expressed in the same way, it seems to us that using costs entails ruptures in the concavity of the splitting criterion, which could produce under-optimal trees. Acknowledgements This work has been realized within a thesis cofinanced by the French Ministry of Research and Industry. We would like to thank the ARDOC breast cancer screening unit and the senology unit of Centre-République (Clermont-Ferrand, France) for their expertise and the mammography data. We also thank Pierre-Emmanuel Jouve, Julien Thomas and Jérémy Clech (Fenics company, Lyon, France) for their help and advices. References [] R. Barandela, J. S. Sanchez, et al. (23). Strategies for learning in class imbalance problems, In Pattern Recognition, volume 36(3), pages [2] L. Breiman (2). Random Forests, In Machine Learning, volume 45(), pages [3] L. Breiman (22). Looking inside the black box. In WALD lectures, the 277th meeting of the Institute of Mathematical Statistics, Banff, Alberta, Canada. [4] L. Breiman, J. H. Friedman, et al. (984). Classification and Regression Trees. Belmont, Wadsworth. [5] J.-H. Chauchat, R. Rakotomalala, et al. (2). Targeting Customer Groups using Gain and Cost Matrix: a Marketing Application. In Proceedings of the Datamining and Marketing Whorkshop, 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD ), pages -4. Data. Technical Report. Berkeley, Department of Statistics, University of California. [7] P. Domingos (999). MetaCost: A general method for making classifiers costsensitive. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99), pages [8] C. Elkan (2). The Foundations of Cost-Sensitive Learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJ- CAI ), pages [9] S. Hettich and S. D. Bay (999). The UCI KDD Archive. Irvine, california, USA, University of California, Department of Information and Computer Science. [] C. X. Ling, Q. Yang, et al. (24). Decision trees with minimal costs. In Proceedings of the twenty-first international conference on Machine learning, 69, Banff, Alberta, Canada. [] F. Provost (2). Learning with Imbalanced Data Sets. In Invited paper for the AAAI 2 Workshop on Imbalanced Data Sets. [2] J. R. Quinlan (986). Induction of Decision Trees, In Machine Learning volume (), pages 8-6. [3] J. R. Quinlan (993). C4.5: programs for machine learning. San Francisco, Morgan Kaufmann Publishers Inc. [4] I. H. Witten and E. Frank (25). Data Mining: Practical machine learning tools and techniques. San Francisco. [5] D. Zighed and R. Rakotomalala (2). Graphe d induction Apprentissage et Data Mining. Paris, Hermès. [6] C. Chen, A. Liaw, et al. (24). Using Random Forest to Learn Imbalanced

On Multi-Class Cost-Sensitive Learning

On Multi-Class Cost-Sensitive Learning Zhi-Hua Zhou and Xu-Ying Liu National Laboratory for Novel Software Technology Nanjing University, Nanjing 210093, China {zhouzh, liuxy}@lamda.nju.edu.cn Abstract