Use of Dempster-Shafer theory to combine classifiers which use different class boundaries

Use of Dempster-Shafer theory to combine classifiers which use different class boundaries Mohammad Reza Ahmadzadeh and Maria Petrou Centre for Vision, Speech and Signal Processing, School of Electronics, Computing and Mathematics, University of Surrey, Guildford GU2 7XH, UK Tel: +44 143 6901, Fax: +44 143 66031 E-mail:M.Petrou@surrey.ac.uk Abstract In this paper we present the Dempster-Shafer theory as a framework within which the results of a Bayesian Network classifier and a fuzzy logic-based classifier are combined to produce a better final classification. We deal with the case when the two original classifiers use different classes for the outcome. The problem of different classes is solved by using a superset of finer classes which can be combined to produce classes according to either of the two classifiers. Within the Dempster-Shafer formalism not only the problem of different number of classes can be solved, but also the relative reliability of the classifiers can be taken into consideration. Keywords Classifier Combination, Dempster-Shafer Theory, Bayesian Networks, Fuzzy Logic, Expert Rules 1 Introduction It has been established recently that combining classifiers improves the classification accuracy for many problems. This has been established both theoretically, mainly within the framework of probability theory [7], and experimentally by many researchers. In addition, in the neural network field several approaches have been introduced to combine individual neural networks to improve the overall accuracy and performance. Currently with the School of Electronic Technology, Shiraz University, Shiraz, Iran, Tel: +9 711 7266262, Fax: +9 711 7262102, Email: Ahmadzadeh@sanat.shirazu.ac.ir 1

Combining multiple neural network classifiers can be divided into two categories: Ensemble and modular [14], [15]. In the ensemble-based approaches, the number of output classes is the same for all classifiers. Each classifier has a complete solution for the problem, however combination of classifiers is used to improve the classification rate. On the other hand, in the modular approach, a problem is broken into several simple sub-problems. For example, a problem with 5 output classes can be changed into several subproblems with 2 output classes. Each sub-problem can be solved using a neural network. Combination of all classifiers provides a solution for the original problem. We can see that in this approach each classifier does not provide a solution for the problem, but all classifiers together are complementarily used to find the final classification. Combination of classifiers has also been investigated extensively when other types of classifier are used. Generally, classifiers can be combined at different levels: abstract level, ranking level and measurement level [19], [6]. In the abstract level only the top choice of each classifier is used for the purpose of combination. In the measurement level complete information for the outputs of the classifiers e.g. score for each possible output, is available and is used in the combination process. Although the combination at the abstract level uses the least information (only the top choice of the classifiers), it has been used frequently because all kinds of classifiers, such as statistical and syntactic, can be combined easily [19]. The ranking level approach may also be used to combine classifiers of any type at the expense of decision detail. It can be used not only to combine classifiers the outputs of which are either class rankings or confidence measures that can easily be converted into class rankings, but also classifiers that output single classes. The latter, however, is achieved by coarsening the the output of the other classifiers to comply with the classifier that outputs the least information [5]. Approaches that combine classifiers at the measurement level can combine any kind of classifiers that output measurements, but for the purpose of combination, these measurements should be translated into the same kind of measurement. For example, a classifier which supplies information at the output based on distances can not be directly combined with a classifier which outputs post-probability. Xu et al. in [19] used three methods of combining classifiers, all at the abstract level. They combined classifiers by using the Bayesian formalism, the voting principle and the Dempster-Shafer formalism. Their results on a case study showed that the Dempster-Shafer theory of evidence had the best results in comparison with the other methods. They used the recognition rate and substitution rate of classifiers to define 2

) the mass functions. The mass function of the selected output (top choice) was defined from the recognition rate and the mass function of the other outputs (complement of the selected output) was defined from the substitution rate. If the summation of the recognition rate and the substitution rate was less than 100%, the remaining was called rejection rate and was assigned to the frame of discernment. This remaining indicated that the classifier was not able to decide, so it was interpreted as lack of information according to Dempster-Shafer theory. Rogova in [10] uses the Dempster-Shafer theory of evidence to combine neural network classifiers. All classifiers have the same number of outputs. So the frame of discernment,, is the same for all classifiers. represents the hypothesis that the output vector is of class. Let the th classifier be denoted by, the input vector by, and the output vector by! #", where the mean vector of the outputs of classifier ' be denoted by ( $$%. Further, let when the input is an element of the training set for class. A proximity measure can be defined using ( and. Rogova used a proximity measure, * % (, to define the mass functions. She defined various proximity measures in order to identify the best one. For any classifier and each class ) the proximity measure was defined to represent +, the support for hypothesis. Any evidence against or pro, -/. ),, was denoted by. Proximity measures which were considered as mass functions of the simple support functions were combined using a simplified version of Dempster s rule of combination. Having combined evidence from all classifiers, Dempster s rule of combination was used again to find the total confidence for each class. The class with the maximum confidence was singled out as the output of the classification. Rogova claimed that this method of combing classifiers could reduce misclassification by 15-30% compared with the best individual classifiers [10]. There are other classifier combination approaches in the literature, some of which were compared in [1]. The average [17], the weighted average, the Borda count [2], the fuzzy integral, the fuzzy connectives [9], the fuzzy templates and neural network approaches are among those which have been investigated in the literature. For an up-to-date overview of this research area, see for example the collection of papers in []. In all studies found in the literature so far, the classifiers combined are expected to use the same classes to classify the objects in question. In this paper we address the problem of different classes, which however span the same classification space. Some clarification is necessary here: When the classes of objects are 3

expressed by nouns, they are discrete and they are uniquely defined. Examples of such classes are chair, door, etc. However, there are problems where the classification of objects refers to some of their properties which may vary continuously. In such cases the defined classes are ad hoc quantisations of otherwise continuously varying properties. Such classes are for example very hot, hot, lukewarm, cold etc. In all such cases there is a hidden measurable quantity which takes continuous values and which characterises the state of the object. In the particular example, the measurable quantity is the temperature of the object, and the attribute according to which the object has to be classified is its thermal state. The division of this continuity of possible states into a small number of discrete classes can be done on the basis of the actual temperature value, using class boundaries that may be specified arbitrarily. Two different experts (classifiers) may specify different class boundaries. It is this type of problem we address here, with the help of the Dempster-Shafer theory of evidence, where the results of the two classifiers are considered as items of evidence in support of a certain proposition. The problem of different classes is solved by using a superset of finer classes which can be combined to produce classes according to either of the two classifiers. The nearest research to our work is that of using error correcting codes (ECC) [3] for classification. An ECC uses more characteristics than there are necessary to classify something and then it maps the superset of characteristics to the minimum meaningful number that are needed for the classification. One may see each characteristic as defining a different class and consider that ECC maps a larger number of classes to a smaller one (the correct ones). Our method differs in several respects: 1) The classes used by either of the classifiers are both legitimate sets of classes. 2) Our method can be used to refine/improve the performance of either classifier. 3) ECC uses classifiers that have as output yes/no answers, while the approach we use here comes up with probabilities assigned to each possible class. Our method is demonstrated in conjunction with the problem of predicting the risk of soil erosion of burned forests in the Mediterranean region using data concerning relevant factors like soil depth, ground slope and rock permeability. This problem has been solved in the past using Pearl-Bayes networks [16] and Fuzzy Logic [12], [11]. The results of these classifiers were available to us and they are combined to produce a more reliable classification. 4

2 Data Description Soil erosion depends on three variables: slope, soil depth and rock permeability. Other factors that may influence soil erosion are not taken into account as they were uniform in the area of study which our data refer to. Geophysical data are available from 39 sites of four areas in Greece. For each of these sites the classification by a human expert to one of five possible classes of soil erosion is also available. Each of the problem variables takes values from a small set of possible classes. Stassopoulou et al. [16] implemented a Pearl-Bayes network with which they solved the problem of combining the values of the attributes, alongside the uncertainties associated with them, in order to infer the probability with which a site belonged to one of the possible classes of risk of soil erosion. The use of a Pearl-Bayes network involved the use of conditional probability functions. For the case when the combined attributes and inferred conclusions are discrete valued quantities, these conditional probabilities are matrices. In the particular case, as three attributes were combined to assess the risk of soil erosion, if each variable involved could take 0 possible values, the matrix ought to have been 0213041504130. So, for 0 76, there should be 625 elements of the matrix, each expressing the probability of the site to belong to a certain class of soil erosion, given that the site attributes have certain combination of classes. The calculation of such a large number of probabilities, however, required the availability of a large number of data. In research problems one very seldomly has at one s disposal enough relevant data for such estimation. To reduce the severity of the problem, Stassopoulou et al. quantised all variables of the problem into three classes each, thus having to compute only 1 conditional probability values. Their results were quite satisfactory: They obtained consistent results on the training set for 2 out of the 30 training sites, and hardening their conclusions produced agreement with the expert in 7 out of the 9 test sites. However, in spite of their accuracy, these results used gross classes, as each variable was quantised only into one of 3 possible classes. Sasikala et al. [11], solved the same problem, using the same data, but as no numerical restriction existed, their results classified the risk of soil erosion into one of five possible classes, the same ones used by the expert who had made the assessment in the first place. Sasikala et al., in order to solve this problem developed a new fuzzy methodology, which involved a training stage: weights were used for the membership functions to reflect the relative importance of the combined attributes, and many different combination rules were tried. The system was trained for the selection of the best set of weights and the best combination rule. Upon hardening the final classification, they could have consistency in the training 5

D _??? D Z, data in 1 out of the 30 sites and they could predict correctly the class of the test sites in 5 out of the 9 cases. However, the use of weights and a variety of combination rules, produced a blunt decision system: in some cases more than one possible classes had equally high membership functions. The idea we propose here is to combine the results of the accurate probabilistic classifier, which uses gross classes, with the results of the blunt fuzzy classifier, which uses finer classes, in order to obtain a final classification which will be more accurate and less blunt. 3 Dempster-Shafer Theory The theory of evidence was introduced by Glean Shafer in 1976 as a mathematical framework for the representation of uncertainty. Let. The mass function follows [13], [1]. for all non empty sets?ut :9!%@?A where and be two mass functions on the same frame of discernment, which is called orthogonal summation of <; and >= is defined as ; ;CBED F GVHIKJ'W F G:HIKJML %RN!P %ONEP %@QX %RQS (1) is considered as a normalisation factor and is needed to make sure that no mass is assigned to the empty set. In addition it is a measure of conflict between the two masses. If DY. conflict. If D ; it means that there is no ; the combination of!; and 5= does not exist and they are totally or flatly contradictory. For illustrating purposes it is convenient to represent this rule of combination graphically, as shown in figure 1. Let?, and Z Z\[ be focal elements of mass functions respectively. Along the horizontal side of a unit square we show the mass functions of all elements of The width of each strip is proportional to the value of its corresponding mass function. The mass functions of are shown along the vertical side of the same square. The area of intersection of strips %@Z [ (dashed area) represents the amount of mass that is assigned to? C9 combination rule, E%@_`,] Z [. It is possible that for some - and a we have?,^] %@? and. and Z [. According to Dempster s %@_` is proportional to the sum of the areas of all rectangles where,] Z [ cb. The mass functions have to be scaled in order to make sure that the sum of a mass function over all subsets of is 1. This is done with the help of D in equation (1). If? set and we say that and, ] Z\[ db for all - and a, all mass of the combination goes to the empty are not combinable or are totally contradictory. 6

_, %@Z %@Z. %@? %@? eë e %R? eë e //////// %@ZC[ //////// ////////. Figure 1: Combination of mass functions 4 The Proposed System of Classifier Combination In this paper we use the Dempster-Shafer theory to combine the items of evidence that come from the Bayesian network of Stassopoulou et al. and the fuzzy logic classifier of Sasikala et al. One of the conditions to be able to use the Dempster-Shafer theory is that all sources should have the same frame of discernment [13], [4]. In our case this is not true, as for example, risk of soil erosion is classified into 3 classes, which we denote by? denote by Z,, ZAh, in the fuzzy logic method.,?,?gf, in the Bayesian network method, and into 5 classes, which we To be able to use the Dempster-Shafer theory in this application, we look for a definition of a frame of discernment in which both methods can be defined. Since both methods span the same classification space, we quantise the classification space into 15 classes, _ to _ h. These classes can be expressed in both methods because 15 can be divided by 3 and 5. In other words, the union, for example, of the first 5 new classes i.e. _ : V _:h union of the first 3 new classes i.e. _ is the same as the first class of the Bayesian network method, i.e.? _if is the same as the first class, i.e. Z. Also the, of the fuzzy logic method. Figure 2 shows schematically the idea of defining this superset of classes. The next step is defining the mass functions from two classifiers. We interpret the beliefs of the Bayesian network system as mass functions in the Dempster-Shafer theory. Since the output measurements of a Bayesian network are in the form of probabilities, no further conversion is needed to use them as mass functions. However, the membership grades of classes in the fuzzy system, although in the range j k they do not sum up to 1. Therefore we cannot interpret them as mass functions directly. Instead, we use them in order to distribute the unit mass, proportionally, to the corresponding classes. 7 ;l,

A1 A 2 A 3 B 1 B 2 B 3 B 4 B 5 C1... C 15 Figure 2: Definition of the superset of classes. Before using the defined mass functions in Dempster s combination rule, another factor that should be taken into consideration is the relative reliability of the two classifiers. If we have the recognition rate, substitution rate (error rate) and rejection rate of a classifier, its reliability can be defined as [19]: nmo -qprs- o -qtp nmuvxw -PtP- v ; k+k^y Bnm a mxu tp- v (2) If a classifier does not include a rejection option, like the Bayesian classifier of Stassopoulou et al. [16], its reliability is the same as its recognition rate. So, we are going to use as reliability of the Bayesian classifier its recognition rate. The fuzzy logic classifier, however, was based heavily on the use of individual production rules, which themselves might be treated as a collection of individual classifiers. One does not necessarily expect that all rules used are equally reliable; indeed, some of them may even be wrong. To assign, therefore, an overall reliability factor to the fuzzy classifier would be equivalent to ignoring the peculiarities of the individual classifiers this is a collection of. We decided instead, to examine, with the help of a training phase, the reliability of the individual firing rules. They are these individual reliability factors that are used to moderate the mass functions of the fuzzy classifier. In the Dempster-Shafer theory we can interpret unreliability of a source as lack of information. So, after we scale down the mass functions which we have already defined for each classifier, by taking into consideration the reliability of the classifier, we assign the remaining mass to the frame of discernment as lack of information. In figure 3 the mass functions derived from the Bayesian network and the fuzzy logic system after considering the reliability of the classifiers, are denoted by and respectively. The combination of

D z`{} ~ { z`{} ~ ƒ z`{} ~K z {} _ z ƒ { Š Š Œ Œ Š Š z ƒ ƒ Œ z ƒ Œ Œ z ƒ Œ Š Š Š z ƒ Œ Œ Š { Š { Š { Š { Š Ž Ž ˆ z` Š {} ƒ z Š { ƒ ˆ z Š$ ŠŽ z` Š$ z Š$ ŠŽ Š$ ˆ z Š Š$ Š$ z Š Š$ Š$ ˆM z Š {R z` Š {q{ {@ƒ z Š {R {q{} {@ƒ ˆKŽ z` Š {R Ž z Š {R Ž znƒ z` ŠK{ z Š ŠK{R z ŠK{q{ ŠK{ z` Figure 3: Combination of mass functions mass functions is denoted by. Note that the square area denoted by, for example,!% _ is equal to %@? A1E %RZ. This value is used in Dempster s rule of combination given by equation 1 in order to assign mass to _. As it can be seen, in fifteen cases, the mass functions which resulted from the combination of the two sources can be assigned to non empty sets. Here the normalisation factor, D, is: %@? P %R?gfq %RZ Mš %RZAfš %@? f P %R? q %@Z šœ %RZgžšŸ %R? f q %R? q %RZ š %RZAhš %R? %R? q q %RZ f š (3) %RZAhx For example, we have: E%@_ + _:fx %R? q ;CB!D %RZ (4) Although we have classified the risk of soil erosion into 15 classes, we would like to have the result in 5 classes as used by the expert and by the fuzzy logic system. Thus, we calculate the belief function of the classes of interest, by using the mass functions of the focal elements. So, after scaling all mass functions which are assigned to non empty subsets, the summation of masses of the classes in each row will be the belief function of the corresponding class. For example, summation of E%@_\ž E%@_iž _:h _i in the second row will be assigned as r mxo %@_Vž out of the 5 possible classes, i.e. _:h _:hx,!% _i and _: x which is the belief of the second class r mxo %RZ r mo %@_ ž _ h _!% _ ž _ h 'š E%@_ š E%@_ ž _ h _ Figure 4 shows schematically the proposed combination system. 9

» Ä Ä A Slope Bayesian Network Mass Function Gen. Soil depth Dempster s Rock per. Fuzzy Logic System Mass Function Gen. B Combination Rule Reliability of Classifiers Figure 4: Proposed combination of classifiers system to assess the risk of soil erosion. 5 Experimental Results If we denote the output beliefs of the Bayesian network by ` @ `, ` @, ` @ ª and its recognition rate by «, we used: O±q²³s±q O±qṔµ «º¹» ¼¼¾½ R gàp Á R gàp VÂÃ ¼+¼ ±» Ä ÅÆÄÇ with «ÉÈÈ ½ ÈÊ. To deal with the reliability of the fuzzy classifier, we multiplied with weights, ¼SËÍÌ ÀºÎ» Ä ÅÄ ½½ ½ Ä Ï, the different mass functions which resulted from different expert rules used by it. We used 30 training sites to identify the best weights which would produce the best results. It is worth mentioning that we used exhaustive space search to find the best weights. However, in every set of weights we fixed one of the weights to be 1 in order to make the space search smaller, and because this way the weights measured the relative importance of the various rules. We found that the best results could be obtained when the mass functions of classes and Ð were scaled down by 0.5 1. Ñ This is a very interesting outcome as it indicates that perhaps less emphasis should be placed on rules that lead to classes other than the two extremes and the middle. We speculate that perhaps people find easy to classify things in classes like low, medium and high, or good, average and bad etc. It is more difficult to ask people to classify things in classes like very low, low, medium, high and very high, or very good, good, average, bad and very bad. It seems that heuristics devised by people to yield classification in classes inserted between the 3 gross ones, may not be as reliable as rules that classify into the clear cut and well defined classes of two extremes and a medium. ± 10

, % % Z? F f F h,,, After the reliability of each classifier was taken into consideration, the sum of its mass functions was not 1. The difference of the sum from 1 was assigned to the frame of discernment which is interpreted as the lack of information. For example, for the fuzzy classifier where Ò %RZ ZAf are the weights identified during the training phase. For the Bayesian classifier %@? Zgž? f ZAhx ;CB ;CB, J, J KÒ Z(`Ó %@? ; k+k By using the maximum belief function criterion in the decision making step, 6 out of the 9 testing sites were correctly classified and the other 3 sites were classified in the next class from that assigned by the expert. This should be compared with the 5 sites which were correctly classified by the fuzzy classifier alone. 6 Discussion and Conclusions In classification problems where the classes used represent different grades of the same attribute, it is possible to have different divisions into classes used by different classifiers. A similar situation may arise when the individual classifiers are unsupervised and determine the data classes automatically. In such cases it is not possible to combine the results of the different classifiers in a straightforward way. We proposed here that one may handle such problems within the framework of Dempster-Shafer theory. The Dempster- Shafer theory as a classifier combination method allows one to deal with the different number of classes used by the different classifiers, because, unlike Bayesian theory, it assigns probabilities to sets of possible hypotheses, not just to individual hypotheses. In addition, it allows one to take into consideration the reliability of the classifiers in the process of mass definition. We demonstrated our ideas using an example problem of prediction of soil erosion. Within the limitations of this problem, we showed that not only the accuracy of the individual classifiers was improved but also that a finer set of output classes could be obtained. Although our results are too limited and their statistical significance can not be estimated due to the limitations of our dataset, this should not degrade the proposed methodology which is generic and applicable to any situation where the classes defined by the different classifiers are different. This stems from the ability of the Dempster-Shafer theory to assign probabilities to sets of possible classes and not just to individual classes. 11 qô L %@Z

Acknowledgements Mohammad Reza Ahmadzadeh was on leave from the University of Shiraz, Iran, when this work was carried out as part of his PhD thesis. He is grateful to the Ministry of Science, Research and Technology of Iran for its financial support during his research studies. References [1] J. A. Barnett. Computational methods for a mathematical theory of evidence. In Proceedings of the 7th International Joint Conference on Artificial Intelligence., Vancouver, BC, Ca, volume 2, pages 6 75, Aug. 191. [2] C. Black. The Theory of Committees and Elections. Cambridge University Press, 1963. [3] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263 26, 1995. [4] J. W. Guan and D. A. Bell. Evidence Theory and its Applications. ELSEVIER SCIENCE PUBLISHER B.V., 1991. [5] T. K. Ho, J. J. Hull and S. N. Srihari. Decision Combination in Multiple Classifier Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1):66 75, 1994. [6] H. Kang, K. Kim, and J. H. Kim. A framework for probabilistic combination of multiple classifiers at an abstract level. Engineering Applications of Artificial Intelligence, 10(4):379 35, 1997. [7] J. Kittler. Combining classifiers: A theoretical framework. Pattern Analysis and Applications, 1:1 27, Jan 199. [] J. Kittler and F. Roli (Eds). Multiple Classifier Systems. Springer LNCS 2096, ISBN 3-540-4224-6, 2001. [9] L. Kuncheva. An application of owa operators to the aggregation of multiple classification decisions. In R. Yager and J. Kacprzyk, editors, The Ordered Weighted Averaging Operators. Theory and Applications. Kluwer Academic Publishewrs, 1997. [10] G. Rogova. Combining the result of several neural network classifiers. Neural Networks, 7(5):777 71, 1994. [11] K. R. Sasikala and M. Petrou. Generalised fuzzy aggregation in estimating the risk of desertification of a burned forest. Fuzzy Sets and Systems, 11(1):121 137, February 2001. [12] K. R. Sasikala, M. Petrou, and J. Kittler. Fuzzy classification with a GIS as an aid to decision making. EARSeL Advances in remote sensing, 4(4):97 105, November 1996. [13] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, 1976. [14] A. J. C. Sharkey. On combining artificial neural nets. Connection Science, (3/4):299 314, 1996. 12

[15] A. J. C. Sharkey. Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems. Springer-Verlag, 199. [16] A. Stassopoulou, M. Petrou, and J. Kittler. Application of a Bayesian network in a GIS based decision making system. Int. J. Geographical Information Science, 12(1):23 45, 199. [17] M. Taniguchi and V. Tresp. Averaging regularized estimation. Neural Computation, 9:1163 117, 1997. [1] A. Verikas, A. Lipnickas, K. Malmqvist, M.Bacauskiene, and A. Gelzinis. Soft combination of neural classifiers: A comparative study. Pattern Recognition Letters, 20(4):429 444, 1999. [19] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man and Cybernetics, 22(3):41 435, 1992. 13