Combining Heterogeneous Sets of Classifiers: Theoretical and Experimental Comparison of Methods

Size: px

Start display at page:

Download "Combining Heterogeneous Sets of Classifiers: Theoretical and Experimental Comparison of Methods"

Morris Harrington
5 years ago
Views:

1 Combining Heterogeneous Sets of Classifiers: Theoretical and Experimental Comparison of Methods Dennis Bahler and Laura Navarro Department of Computer Science North Carolina State University Raleigh NC Abstract In recent years, the combination of classifiers has been proposed as a method to improve the accuracy achieved in isolation by a single classifier. We are interested in ensemble methods that allow the combination of heterogeneous sets of classifiers, which are classifiers built using differing learning paradigms. We focus on theoretical and experimental comparison of five such combination methods: majority vote, a method based on Bayes rule, a method based on Dempster- Shafer evidence combination, behavior-knowledge space, and logistic regression. We develop an upper bound on the accuracy that can be obtained by any of the five methods of combination, and show that this estimate can be used to determine whether an ensemble may improve the performance of its members. We then report a series of experiments using standard data sets and learning methods, and compare experimental results to theoretical expectations. Keywords: classifier ensembles, voting, Bayesian combination, Dempster-Shafer, behavior-knowledge space, logistic regression 1

2 1 Introduction The combination of the decisions of several classifiers has been proposed as a means of improving the accuracy achieved by any of them. The reasons for combining the outputs of multiple classifiers are compelling, because different classifiers may implicitly represent different useful aspects of a problem, or of the input data, while no one classifier represents all useful aspects. In the context of pattern recognition, the idea of combining the decisions of several classifiers has been well explored. A wide variety of methods have been used to combine the results of several neural networks [6, 7, 13, 20, 23, 26], decision trees [2, 12], sets of rules [3] and other models [10, 11, 28, 31, 29, 30]. Moreover, the idea of combining decisions is not exclusive to the pattern recognition community. For example, Bates et al. [5] describe the combination of forecasts in operations research and French [8] presents an analysis of several methods used to form group consensus statistical distributions. Also in statistics, Uebersax [27] describes how to combine the decision of several experts on the appropriateness of indications for a specific medical procedure. In contrast to approaches which combine models derived from multiple versions of the same learning method, the specific focus of this paper is on ensemble methods that are able to combine the decisions of multiple classifiers of different types, so-called heterogeneous sets of classifiers. Different classifiers typically express their opinions in different ways. For example, an expert may report probabilities, distances between the concept described by an example and already defined concepts, or simply a label representing the predicted class of the example. We formalize our problem as follows. Assume a pattern space U of M mutually exclusive sets U = C 1 C 2... C M,whereeachofC i, for all i {1,...,M} represent a set of patterns called a class. Each expert or classifier (denoted by e) assigns to a sample x U an index j {1,...,M+1} that represents x as being from class C j,ifj M+1, and j = M+1 represents that the expert has no idea about which class x belongs to, i.e. x is rejected by e. The decision of expert e is denoted by e(x) =j. Based on this definition of the decision of an expert, our problem can be defined as the combination of D different classifiers e k, k =1,...,D, each of which assigns x to a label j k denoted e k (x) =j k,intoanintegrated classifier E, which gives x one definitive label j. The performance of a classifier can be described by its recognition, sub- 2

3 stitution, and rejection rates. The recognition rate of a classifier e k (ε (k) r ) measures the probability that its output given sample x corresponds to the actual class of x, as follows: ε (k) r = M P (x C i,e k (x)=i) i=1 The substitution rate, on the other hand, is related to the probability that e k makes the wrong decision, i.e. the output is label j for sample x when x actually belongs to class C i, j i ε (k) s = M P (x C i,e k (x)=j) i=1 j i If a classifier does not have enough evidence to assign x to any class (e k (x) =M+ 1), then ε (k) r + ε (k) s < 1. In fact, 1 ε (k) r ε (k) s measures how often e k rejects samples. To investigate systematically various approaches to combination, we performed the following set of experiments. We began with four data sets from the UCI repository [18]: Tic-Tac-Toe Endgame, Wisconsin Breast Cancer, Pima Indians Diabetes, and Credit Screening. These data sets were each used to train a heterogeneous set of three classifiers using standard techniques: a decision tree, a Bayesian belief network, and a backpropagation neural network. To obtain an integrated classifier, the results of the classifiers for each data set were then combined into an ensemble using five methods: majority vote, a method based on Bayes rule, a method based on Dempster-Shafer evidence combination, behavior-knowledge space, and logistic regression. Next, by examining conditions under which the combination of classifiers does not improve the accuracy obtained by a single expert, we developed a method to predict the theoretical maximum recognition rate that can be achieved by each of the ensemble methods. Finally, we compared the experimentally observed changes in accuracy of the ensemble models with the theoretical bounds we developed earlier. The remainder of this paper is organized as follows. In Section 2, we briefly describe the five methods that we used to combine multiple classifiers. In Section 3, we describe conditions for improvement in the recognition rate of a majority voting ensemble over the single classifiers and give estimators 3

4 of the accuracy of a vote-based ensemble. In the same section, we analyze the behavior of the remaining ensemble methods and define upper bounds on the accuracy that may be obtained by any of these combination methods. In Section 4, we present the results of experiments that show the applicability of this analysis. We offer some conclusions in Section 5. 4

5 2 Methods of Combining Multiple Classifiers Combination methods arising in statistics [8] and operations research [5] typically combine not only classification but also measures of confidence in these opinions. Confidence measures may be combined using statistical methods for continuous data, such as average [6, 16], opinion pools [8] or Bayes theory [8]. These combination techniques require that measures be homogeneous, that is, of the same kind for all the experts in the group; for example, all of them represent probabilities, or distances, or they need to be converted to equivalent scales. If the experts are not able to supply the same measure of confidence, or not able to supply any measure at all, we are confronted with a different problem. This is the case of interest to us, when several heterogeneous classifiers are to be combined, where the only common information that is available from all the experts is their opinion. Methods have been developed to deal with this problem, and they are described in this section. 2.1 Voting Generally speaking, the voting principle is just what we know as majority voting. Several variations of this idea have been proposed: The combined classifier decides that an input pattern x comes from class C j if and only if all the classifiers decide that x comes from class C j, otherwise it rejects x. The combined classifier decides that an observation x comes from class C j if some classifiers support that x belongs to C j, and no other classifier supports that x belongs to any other class (i.e. they reject x), otherwise it rejects x. The majority rule, where the combined classifier decides an observation x belongs to class C j if more than half of the classifiers support that x belongs to C j. A modification of this rule is to require a different proportion of classifiers to agree instead of half of them. The combined classifier decides for the observation x belonging to class C j if the number of classifiers that support it is considerably bigger than the number of classifiers that support any other class. 5

6 Combining classifiers with this method is simple; it does not require any previous knowledge of the behavior of the classifiers nor does it require any complex methodology to decide. It only counts the number of classifiers that agree in their decision and accordingly decides the class to which the input pattern belongs. This simplicity has a drawback, however: The weight of the decision of all the classifiers is equal, even when some of the classifiers are much more accurate than others. 2.2 Bayesian Ensemble Methods Voting methods are based solely on the output label computed by each classifier. No expertise or accuracy is considered. In these methods the decision of each classifier is treated as one vote, but what happens if one of the classifiers is much more accurate than any other? Should its accuracy not be considered in the combination? Let us suppose that we ask two experts about their opinions on whether a stock value will increase in the next month. We are likely to consider their opinions differently depending on how accurate they have been in the past. To address this problem we can establish weights proportional to each expert s accuracy, so each classifier s output is considered according to its past performance. Several methods of establishing these weights have been developed. In the following sections some of them will be described. One way to represent an expert behavior is through a confusion matrix, which represents the accuracy of each classifier as the following: CM k = n (k) 11 n (k) 12 n (k) 1M n (k) 21 n (k) 22 n (k) 2M... n (k) M1 n (k) M2 n (k) MM n (k) 1M+1 n (k) 1M+1. n (k) 1M+1 k =1,...,D where each row i represents the number of samples of class C i,andeach column j represents the number of samples that have been assigned label jbytheexperte k (e k (x)=j). This means that each n (k) ij represents the number of samples of class C i that have been assigned label j by expert e k. 6

7 Expert Real Class e 1 e 2 Up Down Up Up Up Down 35 5 Up Unknown 5 1 Down Up 4 30 Down Down 3 35 Down Unknown 1 8 Unknown Up 13 0 Unknown Down 1 8 Unknown Unknown 2 0 Table 1: Summary of the experts predictions in the stock example. Consider again the example of the experts in the stock market. Table 1 shows all possible combinations of decisions made by the two experts in the past, and, for each of these combinations, indicates how many stock values actually increase in the next month, and how many decreased. Given this information, the confusion matrices for the two experts are: CM 1 = CM 2 = The number of samples of class C i can be obtained by M+1 j=1 n (k) ij = n (k) i+ The number of samples that have been assigned label j is: M i=1 n (k) ij = n (k) +j 7

8 and the total number of samples is: N = M i=1 M+1 ThefactthatxcomesfromclassC i given that classifier e k output label j for x (event e k (x) =j) has uncertainty. Through the confusion matrix CM k, such uncertainty could be approximated by the conditional probabilities that x belongs to class C i, i =1,...,M given that e k (x) =j, j=1 n (k) ij P(x C i e k (x)=j)= n(k) ij n (k) +j = n(k) ij Ml=1 n (k) lj i =1,...,M (1) The confusion matrix CM k could be regarded as the prior knowledge of expert e k. Based on the occurrence of the event e k (x) =j,theexpert expresses its beliefs with uncertainty on each of the M mutually exclusive propositions x C i, for all i {1,...,M} by a real value called belief. The belief value given the decision of a classifier e k with respect to an example x, can be defined as: bel(x C i e k (x)) = P (x C i e k (x) =j) i=1,...,m When D different experts e 1,...,e D with D confusion matrices CM 1,...,CM D are used over the same observation x, D events e i (x) =j i,i=1,...,d will occur. Each event will supply its own set of bel(x C i e k (x)), i =1,...,M. Now we need to find a way to integrate these beliefs to give a combined value, given by: bel(c i ) = bel(x C i e 1 (x),e 2 (x),...,e D (x)) = P (x C i e 1 (x) =j 1,e 2 (x)=j 2,...,e D (x)=j D ) = P(e 1(x)=j 1,e 2 (x)=j 2,...,e D (x)=j D x C i )P(x C i ) (2) P(e 1 (x)=j 1,e 2 (x)=j 2,...,e D (x)=j D ) i =1,...,M If the classifiers perform independently of each other, then the events e i = j i, i =1,...,D will be independent of each other and we have: 8

9 bel(c i ) = = = Dk=1 P (e k (x) =j k x C i ) Dk=1 P(x C i ) P (e k (x) =j k ) Dk=1 P (e k (x) =j k,x C i ) Dk=1 P (e k (x) =j k ) D k=1 P (x C i ) P (x C i) Dk=1 P (x C i e k (x) =j k ) P(x C P(x C i ) D i ) where P (x C i e k = j k ) can be estimated by (1). A good estimator of P (x C i ) should be the posterior probabilities P (x C i x), but they are not available, so we estimate the value of the bel(c i )with: bel(c i ) Dk=1 P (x C i e k (x) =j k ) Ml=1 Dk=1 P (x C l e k (x) =j k ) This substitution will ensure that M i=1 bel(c i ) = 1, condition required since the events x C i,i=1..m are mutually exclusive and exhaustive. As a result of this ensemble method, from the opinions of the stock price experts, we obtain a belief about the gain or loss in stock value depending on the combination of decisions of the two classifiers. Then the belief that a stock value x will gain value in the next month given that e 1 (x) =Up and e 2 (x) =Down is obtained by: P 1 (Up) = P(x Up e 1 (x)=up) P 1 (Down) = P(x Down e 1 (x) =Up) P 2 (Up) = P(x Up e 2 (x)=down) P 2 (Down) = P(x Down e 2 (x) =Down) bel(up) = P 1 (Up)P 2 (Up) P 1 (Up)P 2 (Up)+P 1 (Down)P 2 (Down) = =0.86 The beliefs for all possible experts decisions in the example are shown in Table 2. Finally, depending on these values of bel(.), x can be classified into class C i if: 9

10 Expert e 1 Expert e 2 Up Down Unknown Up Down Unknown Up Down (a) Estimates of Conditional probabilities P (x C i e k (x) =j) e 1 (x) e 2 (x) Bel(Up) Bel(Down) E(x) Up Up Up Up Down Up Up Unknown Up Down Up Down Down Down Down Down Unknown Down Unknown Up Up Unknown Down Up Unknown Unknown Up (b) Combined beliefs Table 2: Bayes combination example E(x) = { i if bel(ci )=max j=1..m bel(c j )andbel(c i ) α M + 1 otherwise where 0 α 1 is a threshold. 2.3 Behavior-Knowledge Space Methods One of the strongest conditions for the applicability of Bayes rule to combine the decision of several experts is the fact that they must perform independently. The Behavior-Knowledge Space method described in [11, 28], by contrast, makes no assumption about the data to be combined. It determines the 10

11 belief degree on a proposition x C i based on the combination of evidences, e k = j k,k =1,...,D according to bel(c i )= P(e 1(x)=j 1,e 2 (x)=j 2,...,e D (x)=j D,x C i ) P(e 1 (x)=j 1,e 2 (x)=j 2,...,e D (x)=j D ) corresponding to Equation (2) used in the Bayesian approach. This definition can be easily implemented through a Behavior-Knowledge Space (BKS). A BKS is a D-dimensional space, where each dimension corresponds to the decision of one classifier. Each classifier has M + 1 possible decision values, for the M classes plus the rejection of the given sample by the classifier (class M + 1). The intersection of the decisions of individual classifiers occupies one point of the BKS. Each such point, denoted BKS(j 1,j 2,...,j D ), j i = 1,...,M + 1, for all i =1,...,D, contains three kinds of data: n j1,...,j D (i): Total number of patterns x such that e 1 (x) =j 1,...,e D (x)= j D and x C i,i=1,...,m T j1,...,j D : Total number of patterns x such that e 1 (x) =j 1,...,e D (x)= j D. M T j1,...,j D = (n j1,...,j D (i)) R j1,...,j D : The best representative class of BKS(j 1,j 2,...,j D ) i=1 n j1,...,j D (R j1,...,j D )= max (n j 1,...,j D (i)) (3) i=1,...,m Continuing with our example of the stock market, the BKS is shown in Table 3. Once the BKS is obtained, the belief that x belongs to class C i, bel(c i ), for all i {1,...,M} can be calculated by: bel(c i ) = P(e 1(x)=j 1,e 2 (x)=j 2,...,e D (x)=j D,x C i ) P(e 1 (x)=j 1,e 2 (x)=j 2,...,e D (x)=j D ) = n j 1,...,j D (i) T j1,...,j D 11

12 e 1 Up Down Unknown n[u] n[d] T R n[u] n[d] T R n[u] n[d] T R Up U D U e 2 Down U D D Unknown U D U U=Up,D=Down Table 3: Behavior-Knowledge Space example Similarly to the combination using Bayes rule, x is classified into class C i if bel(c i )=max j=1..m bel(c j )=bel(c Rj1,...,j D ) given Equation 3. Finally we can classify x into a class according to the decision rule: { Rj1,...,j E(x) = D if T j1,...,j D > 0andbel(C ) α Rj1,...,jD M + 1 otherwise where 0 α 1 is a threshold. 2.4 Dempster-Shafer Ensemble Methods The use of Bayes rule to combine the decision of several classifiers may be inappropriate in some cases. Bayes rule requires the belief measures to behave as probabilities [25] and in the case of decision of experts this is a requirement often impossible to satisfy. For example, according to Bayes rule, given ε (k) r and ε (k) s, the axioms of probability require ε (k) r + ε (k) s = 1, but this is not the case when e k rejects a sample x. In this case we cannot say that e k was correct in its decision, but neither can we say that it assigned the wrong label to x since it did not assign any label. In this case it would be reasonable to say that both bel(x C i )andbel(x C i ) have value 0. The Dempster-Shafer calculus is a system for manipulating degrees of belief that is more general than the Bayesian approach and does not require the assumption that ε (k) r + ε (k) s = 1. [31] introduces a method to combine multiple classifiers by adapting Dempster-Shafer s evidence theory. 12

13 In using Dempster-Shafer theory to combine the decisions of several classifiers, the exclusive and exhaustive possibilities that form the frame of discernment θ are the propositions x C i, i =1,...,M that denote that an input sample x comes from class C i. When D classifiers e 1,e 2,...,e D,analyzex, they will produce D items of evidence e k (x) =j k,wherej k {1,...,M+1}. Considering ε (k) r and ε (k) s the recognition and substitution rates of classifier e k and e k (x) =j k, for each k =1,...,D, one can have uncertain beliefs that x C jk is true with degree ε (k) r and is not true with degree ε (k) s when j k {1,...,M}.Ifj k =M+ 1 (i.e. x is rejected by e k ) no information about the M propositions x C i is given, for all i {1,...,M}. In this case full weight of support should be given to the entire frame of discernment θ. Now, we can define a basic probability assignment (BPA) function m k for the evidence e k (x) =j k in the following way: If j k = M + 1, x was rejected by e k,som k (x C i )=0andm k (x C i ) = 0 (for all i {1,...,M}), and m k (θ) =1sincee k says nothing about any one of the propositions. If j k {1,...,M}, then there are two focal elements x C jk and x C jk so that x (θ {C jk })withm k (x C jk )=ε (k) r and m k (x C jk )=ε (k) s. Since e k does not say anything about about any other class, m k (θ) = 1 ε (k) r ε (k) s. Once we obtaine the D BPA s m k, k = 1,...,D, Dempster s rule of combination can be used to obtain a combined m = m 1 m 2... m D.We would then have the basis to calculate Bel(C i ) for all i {1,...,M} based on the D BPA s. First, the items of evidence e k = M + 1 are discarded since they have no influence on the result of the combination, obtaining D D classifiers to be considered. We may also rule out three special cases: D = 0. In this case all the classifiers rejected x, so the final decision is to reject x too. There is one classifier e k which has a recognition rate of ε (k) r =1. In this case that classifies correctly any input so the combination is not necessary. 13

14 There is one classifier e k which has a substitution rate of ε (k) s =1. In this case it always makes the wrong decision, so it will not be used. However, if there are only two classes, the complement of the decision of e k will make its recognition rate ε (k) r =1. We can then concentrate on D items of evidence e k = j k,k =1,...,D with 0 <ε (k) r <1and0 ε (k) s <1. Taking the information from the stock market example, the resulting items of evidence and their BPAs for each possible output of the experts is shown in Table 4. e 1 (x) e 2 (x) m 1 (Up) m 1 (Up) m 1 (θ) m 2 (Up) m 2 (Up) m 2 (θ) Up Up Up Down Up Unknown Down Up Down Down Down Unknown Unknown Up Unknown Down Unknown Unknown Table 4: BPAs of each possible outcome of experts decisions in the stock example, given that ε 1 r =0.8,ε1 s =0.1,ε2 r =0.6andε1 s =0.33. To obtain the combined BPA we need to calculate m = m 1 m 2... m D, but the computational cost to obtain m increases exponentially with M, the number of classes, since we have to obtain all possible subsets of θ and θ = M. However in this particular problem, the BPA of each evidence only has two focal elements, so cost can be reduced by using the following method, proposed in [31], consisting of two main steps. In the first step, the items of evidence are collected into D 1 groups, E 1,E 2,...,E D1, with those classifiers supporting the same proposition in the same group in such a way that e k (x) =j k E k if e k (x) =j k =j k.then combine the BPA s m k in each group recursively as follows: m 1 (x C jk )=m k 1 (x C jk ) 14

15 m 1 (x C jk )=m k 1 (x C jk ) Then, for r =2,...,p, where p is the number of evidences in the group k r = 1 1 m r 1 (x C jk )m k r (x C jk ) m r 1 (x C jk )m k r (x C jk ) m r (θ) = k r m r 1 (θ)m k r (θ) m r (x C jk ) = k r m r 1 (x C jk )[m k r (θ)+m k r (x C jk )] + k r m k r (x C jk )m r 1 (θ) m r (x C jk ) = k r m r 1 (x C jk )[m k r (θ)+m k r (x C jk )] + k r m k r (x C jk )m r 1 (θ) As a result we obtain D 1 combined classifiers E 1,...,E D1 with their corresponding items of evidence E k (x) =j k,k =1,...,A 1 (j 1 j 2... j D1 ), with recognition rate ε (k) r = m k (x C jk ) and substitution rate ε (k) s = m k (x C jk ) (See Table 5 for an example). e 1 (x) e 2 (x) E 1 E 2 ε (1) r ε (1) s ε (2) r ε (2) s Up Up Up Up Down Up Down Up Unknown Up Down Up Up Down Down Down Down Down Unknown Down Unknown Up Up Unknown Down Down Unknown Unknown Table 5: Combined BPAs The second step combines these new BPA s m k,k = 1,...,D 1 into a final combined BPA m = m 1 m 2... m D1 andthencalculatesthe corresponding Bel(C i ), for all i {1,...,M}. Since each one of these new BPA s m k,k =1,...,D 1 supports either one proposition (x C i,i =1,...,M)oritsnegation(x C i,i =1,...,M) we may use the formulae developed in [4] to obtain the required beliefs. 15

16 Considering that in our case m k (θ) = 1,k = D 1 +1,...,M, then from formulae (6)-(8) in [4] we obtain: A = B = C = D 1 i=1 D 1 i=1 D 1 i=1 m i (x C i ) 1 m i (x C i ) (1 m i (x C i )) m i (x C i ) k 1 = { (1 + A)B C if D1 = M (1 + A)B if D 1 <M For Bel(C j k ) four different cases may arise, Bel(C j k )= ( k ( k B m k(x C j k ) +C 1 m k (x C j ) k ) B m k(x C j k ) 1 m k (x C j k ) ) m k(θ) m k (x C j ) k if D 1 = M if D 1 <M and j k D 1 kc if D 1 = M 1andj k =M 0 otherwise Table 6 shows the beliefs obtained for each possible decision outcome of the two experts in the stock example. Finally, depending on these values of Bel(.), x can be classified into class C i if: { i if Bel(Ci )=max E(x) = j=1..m Bel(C j )andbel(c i ) α M + 1 otherwise where 0 α 1 is a threshold. 2.5 Logistic Regression Ensembles Another method that assigns weights to each classifier is the combination using logistic regression [10]. 16

17 e 1 (x) e 2 (x) Bel(Up) Bel(Down) E(x) Up Up Up Up Down Up Up Unknown Up Down Up Down Down Down Down Down Unknown Down Unknown Up Up Unknown Down Down Unknown Unknown 0 0 Unknown Table 6: Combined beliefs Consider Y i, a binary variable that has value 1 when x C i and 0 otherwise. The goal of recognition is to predict the value of Y i, for all i =1,...,M. Since Y i is a binary value, the combination problem may be reformulated in the context of logistic regression analysis, where Y i depends on the value of explanatory variable z i =(z i1,z i2,...,z id )whereeachz ik =1ife k (x)=i and zero otherwise. Denote P (x C i z i )byπ(z i ), using the logistic regression function, π(z i )= exp(α i + β i1 z i β id z id ) 1+exp(α i +β i1 z i β id z id ) and, ( ) π(zi ) log =α i +β i1 z i β id z id 1+π(z i ) where α i,β i1,...,β id are constant parameters and log ( ) π(z i ) 1+π(z i ) is the logit. Since the logit is linearly dependent on z i, the values of the parameters α i,β i1,...,β id can be estimated through weighted least-squares [1]. To determine a good estimator for the true logit [1], assume there are n ij observations at the jth setting of z i,andy ij is the number of examples with Y i = 1 obtained in this setting. The jth sample logit is then log[y ij /(n ij y ij )]. Since this value is not defined when y ij or n ij are zero, the logit is adjusted 17

18 to the value ( yij + 1 ) 2 log n ij y ij Considering the stock value example described earlier, the logit values for each class are shown in Table 7. z i1 z i2 logit(z 0 ) logit(z 1 ) Table 7: Logit example For each test pattern x, the value of M vectors z i is obtained from the events e k (x) =j k, and then x will be classified into a class according to the decision rule: { i if logit(zi )=max E(x) = j=1..m logit(z j )andπ(z i ) α M+ 1 otherwise where 0 α 1 is a threshold. When the number of classes is big or there is not enough data available, a simpler mode can be constructed by using only one response variable Y for all the classes. 18

19 3 Analysis of Ensemble Methods In the last section we described several methods of creating ensembles of classifiers. These methods have all been used in applications, but the empirical evidence of their effectiveness is mixed. In many studies they have improved the accuracy obtained by any single classifier [3, 10, 11, 28, 31]. At the same time, in other analytical studies no improvement was seen from that of the single classifier [9, 16]. Characteristics of the data to be combined determine the degree of improvement obtained by the combination. Since majority voting is not based on the performance of the members of the ensemble, unlike all our other combination methods. we analyze the performance of majority voting separately. 3.1 Majority Voting When multiple classifiers are combined using majority vote, we expect to obtain good results based on the belief that the majority of experts are more likely to be correct in their decision when they agree in their opinion. So if the decision of D classifiers are combined, and more than half of them decide that observation x belongs to class C i, the ensemble decides that x C i. Even when common sense leads us to believe that combining expert s decisions in this way will improve the correctness of the integrated result, this is not always true. Hansen and Salamon [9] prove that if all of the classifiers being combined have an error rate less than 50%, we may expect the accuracy of the ensemble to improve when we add more experts to the combination. However, their proof is restricted to the situation when all of the classifiers perform independently and have the same error rate. This conclusion depends on the fact that if all D single classifiers have a certain likelihood (1 p) of arriving at the correct classification, then the likelihood of the majority rule being in error is D k= D/2 ( D k ) p k (1 p) D k However, it is not common to obtain the same error rate for several different classifiers. Hansen and Salamon explain that the consensus error rate 19

20 may be less than the individual error rate. This will occur when the minimum error rate p 1 obtained by any of the classifiers in an ensemble of three classifiers meets the following criterion: p 1 >p 1 p 2 p 3 +(1 p 1 )p 2 p 3 +p 1 (1 p 2 )p 3 + p 1 p 2 (1 p 3 ) (4) where p 1,p 2 and p 3 are the error rates of three classifiers. Even though this condition may explain when the combination does not lower the error rate obtained by the best classifier, it does not predict the accuracy that can be achieved by the ensemble. In fact, it will generally estimate the accuracy of the combination higher than the one actually obtained since it expects the classifiers to perform independently when, in fact, they do not. One of the factors affecting the estimate obtained in Equation 4 is the fact that the classifiers make errors in a correlated manner. Ali and Pazzani [2] show that the degree of error reduction obtained by the ensemble of classifiers that uses majority voting is related to the degree to which the members of the ensemble make errors in an uncorrelated manner. The error correlation, φ, is defined as the probability that two different classifiers e i,e k make the same kind of error, and it can be measured by φ = 1 D(D 1) D D i=1 k=i+1 p(e i (x) =e k (x)=j, x C j ) (5) where the value of φ is higher when the members of the ensemble make more correlated errors. A limitation in φ is that it is a pairwise measure and it is restricted to ensembles that make errors when two of its members make errors, i.e, considering majority voting, φ is restricted to ensembles of three experts. More specific is the work of Matan [15] that establishes lower and upper bounds for the ensemble of classifiers, showing that majority voting might, in certain conditions, perform worse than any of the members of the ensemble. For example, consider Table 8. The classifiers have recognition rates of 60%, 60% and 50% respectively. However, the majority voting ensemble obtains only a 40% accuracy. Nevertheless, if the decisions of the three classifiers, with the same accuracies, are distributed as in Table 9, the ensemble has a recognition rate of 80%. 20

21 Classification for expert True class Majority e 1 e 2 e 3 C 1 C 2 decision Table 8: Majority vote example where the ensemble performs worse that any of its members Classification for expert True class Majority e 1 e 2 e 3 C 1 C 2 decision Table 9: Majority vote example where the ensemble performs better that any of its members 21

22 So, in [15], the accuracy of a majority voting ensemble is bounded by the following theorem: Given an ensemble E of D classifiers {e 1,e 2,...,e D }with respective performances g 1 g 2... g D, the performance of a k-of-d majority voting classifier under the same distribution is bounded by: max(e) = min(1, Σ(k), Σ(k 1),..., Σ(2), Σ(1)) min(e) = max(0,ξ(k),ξ(k 1),...,ξ(2),ξ(1)) (6) where Σ(m) = 1 m ξ(m)= 1 m n i=k m+1 D k+m i=1 g i g i D k m m =1,...,k m =1,...,k 3.2 Ensembles Based on the Performance of their Members The majority voting combination does not consider that some classifiers of the ensemble may be more accurate than others and treats the opinions of each expert as a vote, independently of the fact that this classifier may be always wrong or right. In attempt to improve the performance of the combination over majority voting, we explored other ways of combining different classifiers that consider individual classifier performance. Other methods to combine the decision of the experts use Bayes theory, behavior-knowledge space, Dempster-Shafer calculus and logistic regression. These methods have different characteristics and rely on different assumptions, so each may combine some kinds of data better than others. We will also show how to estimate an upper bound for the performance of any ensemble, regardless of the method used to combine Independence Combinations based on Bayes theorem and Dempster-Shafer theory assume independence in the decisions of the members of the ensemble. In [31] it was proposed that this independence can be achieved if the experts are trained 22

23 over different data sets. However, considering that all the experts are modeling the same function that defines the real class of each observation, it is unlikely that they behave independently as required by these ensemble methods. Does it then mean that these combination methods cannot be used? To answer this question we must consider why independence is required. For an ensemble based on Bayes theory, the assumption of independence in the decision of the classifiers allows the calculation of the value of P (e 1 (x) = j 1,...,e D (x)=j D x C i ),i =1,...,M using the probabilities P (e k (x) = j k x C i ),k =1,...,D by: D P (e 1 (x) =j 1,...,e D (x)=j D x C i )= P (e k (x) =j k x C i ). k=1 These probabilities are then used to obtain the class of an observation x, by choosing the class with maximum belief, given the events e k (x) = j k,k = 1,...,D. The belief of x C i, (Bel(C i )),i=1,...,m is obtained by: Bel(C i ) Dk=1 P (x C i e k (x) =j k ) Ml=1 Dk=1 P (x C i e k (x) =j k ) (7) andthisisusedtoestimate: Bel(C i )=P(x C i e 1 (x)=j 1,e 2 (x)=j 2,...,e D (x)=j D ). (8) Since the result of the ensemble depends on the class with highest belief, we will obtain the same results from the ensemble using Equation 7 or Equation 8 if, for all the combinations of events e k (x) = j k,k =1,...,D and j k =1,...,M +1, ( Dk=1 ) P (x C i e k (x) =j k ) max i=1,...,m Ml=1 Dk=1 = P (x C i e k (x) =j k ) max i=1,...,m P (x C i e 1 (x) =j 1,...,e D (x)=j D ) (9) So, even though Equation 9 holds only when the decisions of the members of the ensemble are independent, the fact that they are not does not prevent the ensemble from performing equally in both cases. 23

24 A similar argument may be used for ensembles based on Dempster-Shafer theory. Ensembles using behavior-knowledge space (BKS) do not assume that the decisions of the classifiers are independent, and when they are, this method and the one based on Bayes theory should obtain the same performance, since they are based on the same principle, as was shown in Section 2.3. When the BKS method is used with D experts that decide between M classes, then there exist (M +1) D possible combinations of decisions to be taken into account. We also require enough samples from each possible combination of decisions to obtain a representative number of observations. For example, consider an ensemble of five experts and three classes using 10 samples per combination. This will require at least observations to train it. So if the BKS ensemble method is used to combine the decisions of more than a few experts or decide between more than 2 or three classes it requires large training sets that may not be available in many cases Maximum Accuracy of an Ensemble The sections above explain when each kind of combination scheme is likely to perform better than others. We have seen that when there is not enough data available for training it is better to use ensembles based on Bayesian theory than BKS, and that the combination using Dempster-Shafer calculus is better suited than other combination techniques to manipulate uncertain decisions (class M + 1) of the individual classifiers. The above analysis could lead us to believe that even when some ensemble methods might not obtain better accuracies than the single classifiers, careful selection of the combination method, or creation of other combination methods, will always allow us to find a way to improve the performance obtained by any single expert. There is, however, a limit in the accuracy ensembles can achieve regardless of the combination method used. Consider, for example, three experts e 1,e 2 and e 3 that assign one of two classes, C 1 and C 2, to a sample. As an example, the number of observations of each class corresponding to all possible combinations of decisions of the three classifiers is shown in Table 10. Consider the number of observations of each class when e 1 (x) =2,e 2 (x)= 1ande 3 (x) = 1. The table shows that there are 13 samples of class 1 and 25 24

25 Classification for expert True class e 1 e 2 e 3 C 1 C Table 10: Number of samples per combination of decisions per real class 25

26 samples of class 2 that have been classified with this combination of decisions. Since the only data available for the ensemble is the class assigned to the observation by each one of the classifiers, the ensemble will always classify incorrectly at least 13 of these observations, in spite of the ensemble method used. Notice that this situation occurs also with any other combination of results. For example, when e 1 (x) =1,e 2 (x)=2ande 3 (x) = 1 it is impossible to decide whether the sample x belongs to class 1 or 2 since the probability of x belonging to any of the classes is the same. From these examples we can observe how the ensemble of classifiers can obtain better accuracies as long as most of the samples for each combination of decisions of experts belong to the same class. So, if we can measure the concentration of samples x that have been classified by e 1 (x) =j 1,...e k (x)= j k and belong to a specific class, we can obtain a measure of the expected accuracy of the ensemble. Let us define n j1...j D i as the number of samples of class C i that have been classified as e k (x) = j k by each expert k = 1,...,D and n as the total number of samples. The relative frequency of the number of samples x such that e 1 (x) =j 1,...e k (x)=j k and x C i π j1...j D i is, π j1...j D i = n j 1...j D i n and, π +...+i = M+1 j 1 =1... M+1 j D =1 n j1...j D i n Mi=1 n j1...j π j1...j D + = D i n Then concentration can be measured using the Goodman and Kruskal s tau, τ, that describes the proportional reduction in the probability of an incorrect guess in the class of an observation given the decisions of the experts [1]. The measure τ is defined as τ = M+1 j 1 =1... M+1 Mi=1 (π j1...j D i) 2 j D =1 M π j1...j D + i=1 (π +...+i ) 2 1 M i=1 (π +...+i ) 2 26

27 A large value of τ represents a strong association between the decision of the experts and the actual class of the observation. This measure, as well as any other measure of concentration or uncertainty, indicates when the combination of decisions of single classifiers has information enough to decide the class of a sample. A difficulty with this measure is to determine how strong is a strong association. This difficulty is more pronounced the higher the number of classes, since in this case the value of τ will tend to 0 even when there is a strong association. The concentration measure can show which of two data sets having the same number of classes is more suited to have the most accurate ensemble of experts, but it cannot estimate how accurate this ensemble can be. Let θ be the set of all possible classes {1,...,M}to which an observation can belong. To obtain the maximum accuracy achievable by any ensemble of D experts, we define function G:(θ {M+1}) D θ. Given the decision of D classifiers regarding the class of an observation x, e 1 (x) =j 1,...e D (x)= j D, G returns the class with the most observations for that combination of decisions. In the example in Table 10, G(2,1,1) = 1. Given G, we can then define the maximum accuracy that can be obtained by an ensemble as: M+1 j 1 =1... M+1 j D =1 n j1...j D G(j 1...j D ) N where n j1,...,j D,i represents the number of samples of class C i that have been classified as e k (x) =j k by each expert k =1,...,D, and N is the total number of samples considered. In the example in Table 10 we find that even using the best combination method possible, the best recognition rate we may obtain is 80% Reliability Usually only two measures are used to describe the performance of a classifier: Accuracy or recognition rate, which measures the probability that the output label of an expert after considering sample x represents the actual class of x, or its substitution rate, that measures the probability of the expert making a mistake. 27

28 However, a classifier may also reject a sample, i.e., the expert e k might not be able to classify x into any class because there is not enough evidence of x belonging to a particular class. Rejected observations cannot be considered in the recognition or substitution rates. When some of the samples may be rejected, the recognition rate fails to provide a good description of the performance of the classifier because it does not take into consideration the rejection rate. In this case the reliability gives a better description of the accuracy of a classifier than the recognition rate. Reliability is defined as: Reliability = Recognition 1 Rejection and represents the proportion of samples not rejected by the expert e k that have been classified correctly. Consider the example in Table 10. Earlier we mentioned that when e 1 (x) =1,e 2 (x)=2ande 3 (x) = 1 the decision of whether x C 1 or x C 2 was not an easy one. In fact, the probability of making an error for any class chosen is the same; for this combination of evidence at least half of the samples are classified incorrectly by any ensemble. In this case, the ensemble can benefit from rejecting all the observations in this situation, leading to an increase on the ensemble reliability over the reliability of the single classifiers. An increase in reliability means that whenever the ensemble decides to classify a sample x as belonging to class C i instead of rejecting it, the probability of x C i is higher than when no samples are rejected. 28

29 4 Experiments We carried out several experiments to test the predictive value of the estimates for the performance of the ensemble methods proposed in Section 3, as well as the expected behavior of these methods considering their reliability and the independence of their decisions. Four data sets were taken from the UCI repository [18]. These data sets were used to train three different classifiers, whose results were then combined to obtain the integrated classifier. 4.1 Data The data sets were chosen because of their number of observations, number of classes, noise, and missing features. In general we chose data sets that had more than one hundred observations per class. We wanted the groups chosen to be a mixture of noisy/not noisy, with/without missing features. Based on these parameters we chose Tic-Tac-Toe Endgame (TTT), Wisconsin Breast Cancer (WBC), Pima Indians Diabetes (PIMA), and Credit Screening (CRX). All these data sets use only 2 classes. Each data set was partitioned into four subsets, G 1,G 2,G 3 and G 4,so that their union forms the complete data set and their pairwise intersection is empty. The single classifiers were then trained using one of the groups, and the accuracy of the resulting classifier was measured using the examples in the other three groups. The performance of majority voting was tested using the three partitions not used to train the classifiers. The behavior of the ensembles requiring knowledge of the performance of their members was evaluated using one of the remaining groups to train the ensemble and the other two to test them. The training of the ensemble methods involved obtaining: Bayes: the confusion matrices for each of the members of the ensemble; BKS: the Behavior-Knowledge Space; Dempster-Shafer: the recognition, substitution, and rejection rates for each individual classifier; Logistic Regression: the constant parameters of the logit function for each class. 29

30 This procedure was repeated for all permutations of the four groups in each data set to train the single classifiers and the combination method, and to test the ensembles. Thus, except for majority voting, the ensembles are tested in twelve different ways for each data set as shown in Table 11. # Train Test Single Ensemble Ensemble 1 G 1 G 2 G 3 G 4 2 G 1 G 3 G 2 G 4 3 G 1 G 4 G 2 G 3 4 G 2 G 1 G 3 G 4 5 G 2 G 3 G 1 G 4 6 G 2 G 4 G 1 G 3 7 G 3 G 1 G 2 G 4 8 G 3 G 2 G 1 G 4 9 G 3 G 4 G 1 G 2 10 G 4 G 1 G 2 G 3 11 G 4 G 2 G 1 G 3 12 G 4 G 3 G 1 G 2 Table 11: Partition of data sets for the 12 experiments Classifiers The ensemble of classifiers requires, as data, the decision of several experts with respect to a given observation. For our experiments we used three different classifiers, each of them based in a particular machine learning technique. The classifiers used are: Expert e 1 : Bayesian belief network, developed using the library Netica API [19]. No modification or selection of features has been done except discretization of continuous data. This kind of classifier returns the probabilities that a sample belongs to each class. To obtain the result of the classifier we take the class with greater probability. 30

31 Expert e 2 : Neural network, using backpropagation as a learning procedure. Training and testing of the neural networks was done using the Aspirin/Migraines Neural Network Software [14]. This classifiers returns a number between 0 and 1. We consider a sample as belonging to a class if the difference between the output produced upon the analysis of that sample and number between zero and one assigned to that class is the minimum. Since all the data sets used have only to classes, the numbers assigned to the classes are 0 and 1. Expert e 3 : Decision tree, obtained by training and testing using C4.5 [22]. This kind of classifier returns a label representing the predicted class of the ensemble. 4.2 Voting Majority voting is, perhaps, one of the best studied of the ensemble methods described. In Section 3.1 we described some measures that may help to predict the performance of an ensemble that uses majority voting. One of the conditions for the improvement of accuracy through majority voting is the one proposed by Hansen and Salamon [9]. They established that given (1 p i ) the probability of expert e i of predicting the class of an example correctly, then assuming the expert e 1 is the member of the ensemble with lowest error rate, the ensemble can have a lower error rate than its members if p 1 >p 1 p 2 p 3 +(1 p 1 )p 2 p 3 +p 1 (1 p 2 )p 3 + p 1 p 2 (1 p 3 )=ˆp E (10) However, as we can see in Table 12, even though ˆp E is always less than the error rate of all single classifiers, the ensemble does not perform better than the best one its members, except for the credit screening data set. This is due to the correlation of errors between classifiers, as we will show next. Consider the error ratio, the relation between the error of the ensemble with respect to the error rate p 1, of the best of the single classifiers: ErrorRatio = ErrorEnsemble p 1 31

32 Error Rate Error p 1 p 2 p 3 Voting ˆp E φ Ratio CRX 16.76% 16.96% 19.08% 16.04% 8.21% PIMA 25.65% 34.90% 29.86% 27.21% 21.69% TTT 30.79% 31.42% 24.33% 28.19% 20.11% WBC 36.51% 27.43% 34.34% 32.47% 25.39% Table 12: Error rate estimation for voting ensembles and error correlation In Table 12, we can see how the error ratio increases when the members of the ensemble make errors in a more correlated manner (φ), i.e., when the classifiers make the same error on an example, the ensemble will also make more errors than if the members of the ensemble make different errors. Given this explanation we might expect the error correlation also be related to the difference between the real accuracy of the ensemble and ˆp E. However, Table 12 shows that this is not true. Consider for example the case where all three classifiers agree on an error against the case where only two classifiers make the same error. Even when in both cases the error should be considered only once, the measure of error correlation (Equation 5) considers the ensemble error produced when the three experts agree on it as two errors, while in the other case it considers only one. Thus, depending on the distribution of the ensemble errors made by two or three classifiers, the difference between the error rate of the ensemble and the one estimated by ˆp E cannot be considered only in relation to φ. Since the actual ensemble will have an error rate higher than the one predicted by ˆp E, this measure could be regarded as an upper bound of the accuracy of a majority voting ensemble. However, this estimate of the error of the ensemble is only defined for three classifiers. A more general voting method is k-of-d voting, which is a voting ensemble E where E(x) =jif at least k of the D classifiers of the ensemble select j as the class of x. In [15], Matan proposed the Equations 6 as a way to obtain an upper and lower bound for the accuracy of a k-of-d majority voting ensembles. Considering that in our experiments we are using three classifiers and 2- of-3 majority ensembles, the upper and lower bound of the ensembles would be obtained by: 32

33 max(e) = min (1, 1 ) 2 (g 1 + g 2 + g 3 ),g 1 +g 2 ( min(e) = max 0, 1 ) 2 (g 1 + g 2 + g 3 1),g 2 +g 3 1 (11) given that g i,i=1,...,d is the accuracy of expert e i. Rec. rate E min(e) max(e) E B(E) E Voting min max min max min max CRX 83.96% 1.32% 6.54% 13.35% 18.18% 0.58% 2.51% PIMA 72.79% 19.79% 21.53% 24.65% 30.90% 0.00% 4.51% TTT 71.81% 15.83% 19.82% 26.11% 30.36% 0.28% 7.93% WBC 67.53% 11.04% 28.30% 24.32% 32.55% 1.35% 18.79% Table 13: Upper and Lower bounds for the accuracy of the voting ensemble Since four experiments were done for each data set, in Table 13 we show the minimum (min) and maximum (max) differences between the actual accuracy of the ensemble E and the bounds obtained by these equations in all the experiments over the same data. As we can see in this table, there is always a big difference between max(e) and the accuracy of the ensemble. In Section we defined a new estimator for a tighter upper bound in the accuracy that can be obtained by any of the considered ensemble methods. This estimator, for three classifiers and two classes is defined as: 3j1 3j2 3j3 =1 =1 =1 B(E) = n j 1 j 2 j 3 G(j 1 j 2 j 3 ) (12) N Figure 1 shows that this new estimator produces upper bounds closer to the real accuracy of the ensemble. Observe that for almost all the experiments the upper bound proposed by Matan obtains a maximum accuracy of 1, opposed to B(E) that represented a tighter bound that could predict the actual accuracy of the ensemble in some cases. Examples of this are the results obtained for the Pima Indians diabetes data sets, where both, the accuracy of the ensemble and B(E) are the same in one of the experiments, and 33

Naive Bayes classification

Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental