in Combining Classifiers Mohamed Kamel, Nayer Wanas Pattern Analysis and Machine Intelligence Lab University of Waterloo CANADA
! Dependence! Dependence Architecture! Algorithm Outline
Pattern Recognition Systems! Best possible classification rates.! Increase efficiency and accuracy. Multiple Classifier Systems! Evidence of improving performance! Problem decomposed naturally from using various sensors! Avoid making commitments to arbitrary initial conditions or parameters in Combining Classifiers
Categorization of MCS Architecture Input/Output Mapping Representation Types of classifiers in Combining Classifiers
Categorization of MCS (cntd Architecture cntd ) Parallel [Dasarathy,, 94] Parallel Input 1 Classifier 1 Input 2 Classifier 2 Input N Classifier N F U S I O N Output Serial [Dasarathy,, 94] Serial Input 1 Classifier 1 Classifier 2 Classifier N Output Input 2 Input N in Combining Classifiers
Categorization of MCS (cntd Architectures [Lam, 00] cntd ) Conditional Topology! Once a classifier unable to classify the output the following classifier is deployed Hierarchal Topology! Classifiers applied in succession! Classifiers with various levels of generalization Hybrid Topology! The choice of the classifier to use is based on the input pattern (selection) Multiple (Parallel) Topology in Combining Classifiers
Categorization of MCS (cntd Input/Output Mapping cntd ) Linear Mapping! Sum Rule! Weighted Average [Hashem 97] Non-linear Mapping! Maximum! Product! Hierarchal Mixture of Experts [Jordon and Jacobs 94]! Stacked Generalization [Wolpert 92] in Combining Classifiers
Categorization of MCS (cntd Representation cntd ) Similar representations! Classifiers need to be different Different representation! Use of different sensors! Different features extracted from the same data set [Ho, 98, Skurichina & Duin,, 02] in Combining Classifiers
Categorization of MCS (cntd Types of Classifiers cntd ) Specialized classifiers! Encourage specialization in areas of the feature space! All classifiers must contribute to achieve a final decision! Hierarchal Mixture of Experts [Jordon and Jacobs 94]! Co-operative operative Modular Neural Networks [Auda and Kamel 98] Ensemble of classifiers! Set of redundant classifiers Competitive versus cooperative [Sharkey, 1999] in Combining Classifiers
Categorization of MCS (cntd cntd )! Classifiers inherently dependent on the data.! Describe how the final aggregation uses the information present in the input pattern.! Describe the relationship between the final output Q(x) and the pattern under classification x in Combining Classifiers
Data Independent ly Dependent ly Dependent in Combining Classifiers
Data Independence Solely rely on output of classifiers to determine final classification output. Q(x) = arg max(f (C (x)), j) j Q(x) is the final class assigned for pattern x C j is a vector composed of the output of the various classifiers in the ensemble {c{ 1j,c 2j,...,c Nj } for a given class y j c ij is the confidence classifier i has in pattern x belonging to class y j Mapping F j can be linear or non-linear j j in Combining Classifiers
Data Independence (cntd cntd ) Simple voting techniques are data independent! Average! Maximum! Majority Susceptible to incorrect estimates of the confidence in Combining Classifiers
Train the combiner on global performance of the data Q(x) = arg max(f (W ( C( x)), C (x)), j) j j j W(C (x)) is the weighting matrix composed of elements w ij w ij is the weight assigned to class j in classifier i in Combining Classifiers
(cntd cntd ) ly data dependent approaches include! Weighted average [Hashem 97]! Fuzzy Measures [Gader et al 96]! Belief theory [Xu et al 92]! Behavior Knowledge Space (BKS) [Huang et al 95]! Decision Templates [Kuncheva et al 01]! Modular approaches [Auda and Kamel 98]! Stacked Generalization [Wolpert 92]! Boosting [Schapire 90] Lack consideration for local superiority of classifiers in Combining Classifiers
Classifier selection or combining performed based on the sub-space space which the input pattern belongs to. Final classification is dependent on the pattern being classified. Q(x) = arg max(f (W ( x), j j C (x)), j) j in Combining Classifiers
(cntd cntd ) ly Data Dependent approaches include! Dynamic Classifier Selection (DCS) DCS With local Accuracy (DCS_LA) [Woods et. al.,97] DCS based on Multiple Classifier Behavior (DCS_MCB) [Giancinto and Roli,, 01]! Hierarchal Mixture of Experts [Jordon and Jacobs 94]! Feature-based approach [Wanas et. al., 99] Weights demonstrate dependence on the input pattern. Intuitively should perform better than other methods in Combining Classifiers
Architectures Methodology to incorporate multiple classifiers in a dynamically adapting system Aggregation adapts to the behavior of the ensemble! Detectors generate weights for each classifier that reflect the degree of confidence in each classifier for a given input! A trained aggregation learns to combine the different decisions in Combining Classifiers
Architectures (cntd Architecture I cntd ) N. Wanas, M. Kamel, G. Auda, and F. Karray, Decision Aggregation in Modular Neural Network Classifiers, Pattern Recognition Letters, 20(11-13), 1353-1359, 1999. in Combining Classifiers
Architectures (cntd cntd ) Classifiers! Each individual classifier, C i, produces some output representing its interpretation of the input x! Utilizing sub-optimal classifiers.! The collection of classifier outputs for class y j is represented as C j (x) Detector! Detector D l is a classifier that uses input features to extract useful information for aggregation! Doesn t aim to solve the classification problem.! Detector output d lg (x) is a probability that the input pattern x is categorized to group g.! The output of all the detectors is represented by D(X) in Combining Classifiers
Architectures (cntd cntd ) Aggregation! Fusion layer for all the classifiers! Trained to adapt to the behavior of the various modules! data dependent Q(x) = arg max(f (D( x), j C (x)), j) Weights dependent on the input pattern being classified j j in Combining Classifiers
Architectures (cntd Architecture II cntd ) in Combining Classifiers
Architectures (cntd cntd ) Classifiers! Each individual classifier, C i, produces some output representing its interpretation of the input x! Utilizing sub-optimal classifiers.! The collection of classifier outputs for class y j is represented as C j (x) Detector! Appends input to output of classifier ensemble.! Produces a weighting factor, w ij,for each class in a classifier output.! The dependence of the weights on both the classifier output and the input pattern is represented by W(x,C j (x)) in Combining Classifiers
Architectures (cntd cntd ) Aggregation! Fusion layer for all the classifiers! Trained to adapt to the behavior of the various modules! Combines implicit and explicit data dependence Q(x) = arg max(f (W ( x, C ( x)), C (x)), j) j j Weights dependent on the input pattern and the performance of the classifiers. j j in Combining Classifiers
Five one-hidden layer BP classifiers used partially disjoint data sets No optimization is performed for the trained networks The parameters of all the networks are maintained for all the classifiers that are trained Three data sets! 20 Class Gaussian! Satimages! Clouds data in Combining Classifiers
(cntd cntd ) Oracle Majority Data Set Singlenet Maximum 20 Class 13.82 ± 1.16 7.29 ± 1.06 12.92 ± 0.35 13.13 ± 0.36 Clouds 10.92 ± 0.08 7.41 ± 0.16 Data Dependent Approaches 10.68 ± 0.04 10.71 ± 0.02 Satimages 14.06 ± 1.33 7.20 ± 0.36 13.61 ± 0.21 13.40 ± 0.16 Average 12.83 ± 0.26 10.66 ± 0.04 13.23 ± 0.22 Borda 13.04 ± 0.30 10.71 ± 0.02 13.77 ± 0.20 ly Data Dependent Approaches Weighted Avg. 12.57 ± 0.20 10.59 ± 0.05 13.14 ± 0.21 Bayesian 12.48 ± 0.21 10.71 ± 0.02 13.51 ± 0.16 Fuzzy Integral 12.95 ± 0.34 10.67 ± 0.05 13.71 ± 0.19 ly Data Dependent Feature-based 8.64 ± 0.60 10.28 ± 0.10 12.48 ± 0.19 in Combining Classifiers
each component independently! Optimize individual components, may not lead to overall improvement! Collinearity,, high correlation between classifiers! Components, under-trained or over-trained in Combining Classifiers
(cntd cntd ) Adaptive training Selective: Reducing correlation between components! Selective:! Focused: Re Re-training focuses on misclassified patterns. Efficient: Controls the duration of training! Efficient: in Combining Classifiers
Adaptive : Main loop Increase diversity among ensemble Incremental learning Evaluation of training to determine the re-training set in Combining Classifiers
Adaptive : Save classifier if it performs well on the evaluation set Determine when to terminate training for each module in Combining Classifiers
Adaptive : Evaluation Train aggregation modules Evaluate training sets for each classifier Compose new training data in Combining Classifiers
Adaptive : Data Selection New training data are composed by concatenating! Error i : Misclassified entries of training data for classifier i.! Correct i : Random choice of P ratio of correctly classified entries of the training data for classifier i. in Combining Classifiers
Five one-hidden layer BP classifiers used partially disjoint data sets No optimization is performed for the trained networks The parameters of all the networks are maintained for all the classifiers that are trained Three data sets! 20 Class Gaussian! Satimages! Clouds data in Combining Classifiers
(cntd cntd ) Best Classifier Oracle Best Classifier Oracle Best Classifier Oracle Data Set Singlenet 20 Class 13.82 ± 1.16 14.03 ± 0.64 7.29 ± 1.06 8.64 ± 0.60 14.75 ± 1.06 6.79 ± 2.30 8.62 ± 0.25 14.80 ± 1.32 5.42 ± 1.30 8.01 ± 0.19 Normal Clouds 10.92 ± 0.08 11.00 ± 0.09 7.41 ± 0.16 10.28 ± 0.10 12.03 ± 0.52 5.73 ± 0.11 10.24 ± 0.17 Architecture Trained Adaptively 11.97 ± 0.59 5.43 ± 0.11 10.06 ± 0.13 Satimages 14.06 ± 1.33 14.72 ± 0.43 7.20 ± 0.36 12.48 ± 0.19 Ensemble Trained Adaptively using WA as the evaluation function 17.13 ± 1.03 5.58 ± 0.17 12.40 ± 0.12 16.96 ± 0.87 5.48 ± 0.18 12.33 ± 0.14 in Combining Classifiers
Categorization of various combining approaches based on data dependence Independent : vulnerable to incorrect confidence estimates implicitly dependent: doesn t take into account local superiority of classifiers ly dependent: Literature focuses on selection not combining in Combining Classifiers
(cntd cntd ) Feature-based approach! Combines implicit and explicit data dependence! Uses an Evolving training algorithm to enhance diversity amongst classifiers! Reduces harmful correlation! Determines duration of training! Improved classification accuracy in Combining Classifiers
References [Kittler et. al., 98] J. Kittler, M. Hatef, R. Duin, and J. Matas, On Combining Classifiers, IEEE Trans. PAMI, 20:3, 226-239, 1998. [Dasarthy,, 94] B. Dasarthy, Decision Fusion, IEEE Computer Soc. Press, 1994. [Lam, 00] L. Lam, Classifier Combinations: Implementations and Theoretical Issues, MCS2000, LNCS 1857, 77-86, 2000. [Hashem,, 1997] S. Hashem, Algorithms for Optimal Linear Combination of Neural Networks Int. Conf. on Neural Networks, Vol 1, 242-247, 1997. [Jordon and Jacob, 94] M. Jordon, and R. Jacobs, Hierarchical Mixture of Experts and the EM Algorithm, Neural Computing, 181-214, 1994. [Wolpert,, 92] D. Wolpert, Stacked Generalization, Neural Networks, Vol 5, 241-259, 1992 [Auda and Kamel, 98] G. Auda and M. Kamel, Modular Neural Network Classifiers: A Comparative Study, J. Int. Rob. Sys., Vol. 21, 117 129, 1998. [Gader et. al., 96] P. Gader, M. Mohamed, and J. Keller, Fusion of Handwritten Word Classifiers, Patt. Reco. Let.,17(6), 577 584, 1996. [Xu et. al., 92] L. Xu, A. Kazyzak, C. Suen, Methods of Combining Multiple Classifiers and their Applications to Handwritten Recognition, IEEE Sys. Man and Cyb., 22(3), 418-435, 1992 [Kuncheva et. al., 01] L. Kuncheva, J. Bezdek, and R. Duin, Decision Templates for Multiple Classifier Fusion: An Experimental Comparison, Patt. Reco., vol. 34, 299 314, 2001. [Huang et. al., 95] Y. Huang, K. Liu, and C. Suen, The Combination of Multiple Classifiers by a Neural Network Approach, J. Patt. Reco. and Art. Int., Vol. 9, 579 597, 1995. [Schapire, 90] R. Schapire, The Strength of Weak Learnability, Mach. Lear., Vol. 5, 197 227,1990. [Giancinto and Roli,, 01] G. Giancinto and F. Roli, Dynamic Classifier Selection based on Multiple Classifier Behavior, Patt. Reco., Vol. 34, 1879-1881, 2001. [Wanas et., al., 99] N. Wanas, M. Kamel, G. Auda, and F. Karray, Decision Aggregation in Modular Neural Network Classifiers, Patt. Reco. Lett., 20(11-13), 1353-1359, 1999. in Combining Classifiers
http://pami.uwaterloo.ca Email: mkamel@uwaterloo.ca