Data Dependence in Combining Classifiers

Similar documents
Selection of Classifiers based on Multiple Classifier Behaviour

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Algorithm-Independent Learning Issues

Improving the Expert Networks of a Modular Multi-Net System for Pattern Recognition

Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy

Modular Neural Network Task Decomposition Via Entropic Clustering

Learning with multiple models. Boosting.

Neural Networks and Ensemble Methods for Classification

Combination Methods for Ensembles of Multilayer Feedforward 1

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

A PERTURBATION-BASED APPROACH FOR MULTI-CLASSIFIER SYSTEM DESIGN

The use of entropy to measure structural diversity

Dynamic Weighted Fusion of Adaptive Classifier Ensembles Based on Changing Data Streams

Bagging and Other Ensemble Methods

Overriding the Experts: A Stacking Method For Combining Marginal Classifiers

Investigating the Performance of a Linear Regression Combiner on Multi-class Data Sets

Statistical Machine Learning from Data

Dynamic Linear Combination of Two-Class Classifiers

Active Sonar Target Classification Using Classifier Ensembles

A Hybrid Random Subspace Classifier Fusion Approach for Protein Mass Spectra Classification

Getting Lost in the Wealth of Classifier Ensembles?

Analyzing dynamic ensemble selection techniques using dissimilarity analysis

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Classifier Selection. Nicholas Ver Hoeve Craig Martek Ben Gardner

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

A TWO-STAGE COMMITTEE MACHINE OF NEURAL NETWORKS

Electrical and Computer Engineering Department University of Waterloo Canada

Heterogeneous mixture-of-experts for fusion of locally valid knowledge-based submodels

Linear Combiners for Fusion of Pattern Classifiers

A Novel Rejection Measurement in Handwritten Numeral Recognition Based on Linear Discriminant Analysis

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Multi-Layer Boosting for Pattern Recognition

Use of Dempster-Shafer theory to combine classifiers which use different class boundaries

Chapter 14 Combining Models

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Regularized Linear Models in Stacked Generalization

ROBUST KERNEL METHODS IN CONTEXT-DEPENDENT FUSION

Ensembles. Léon Bottou COS 424 4/8/2010

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

A Novel Activity Detection Method

Boosting & Deep Learning

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

ADVANCED METHODS FOR PATTERN RECOGNITION

Recognition of Properties by Probabilistic Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

2D1431 Machine Learning. Bagging & Boosting

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Ensembles of Classifiers.

Unsupervised Learning with Permuted Data

University of Genova - DITEN. Smart Patrolling. video and SIgnal Processing for Telecommunications ISIP40

Lecture 8. Instructor: Haipeng Luo

Boosting: Foundations and Algorithms. Rob Schapire

Linear Classifiers as Pattern Detectors

Machine Learning Lecture 5

The Perceptron. Volker Tresp Summer 2014

Knowledge Extraction from DBNs for Images

Intelligent Modular Neural Network for Dynamic System Parameter Estimation

Neural Networks and the Back-propagation Algorithm

Maximum Entropy Generative Models for Similarity-based Learning

Effect of Rule Weights in Fuzzy Rule-Based Classification Systems

ENTROPIES OF FUZZY INDISCERNIBILITY RELATION AND ITS OPERATIONS

TDT4173 Machine Learning

Lecture 3: Pattern Classification

Detecting Statistical Interactions from Neural Network Weights

Mining Classification Knowledge

Numerical Learning Algorithms

Experts. Lei Xu. Dept. of Computer Science, The Chinese University of Hong Kong. Dept. of Computer Science. Toronto, M5S 1A4, Canada.

A Neuro-Fuzzy Scheme for Integrated Input Fuzzy Set Selection and Optimal Fuzzy Rule Generation for Classification

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Real-time image-based parking occupancy detection using deep learning. Debaditya Acharya, Weilin Yan & Kourosh Khoshelham The University of Melbourne

Logistic Regression and Boosting for Labeled Bags of Instances

SEC: Stochastic ensemble consensus approach to unsupervised SAR sea-ice segmentation

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

FINAL: CS 6375 (Machine Learning) Fall 2014

A Brief Introduction to Adaboost

Introduction to Machine Learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Sparse Kernel Machines - SVM

Notes on Discriminant Functions and Optimal Classification

ECE 5424: Introduction to Machine Learning

Naive Bayes classification

CS7267 MACHINE LEARNING

Diversity and Regularization in Neural Network Ensembles

Mining Classification Knowledge

MODULE -4 BAYEIAN LEARNING

COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS

Short-Term Solar Flare Prediction Using Predictor

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

VBM683 Machine Learning

Qualifying Exam in Machine Learning

Combining Classifiers and Learning Mixture-of-Experts

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition

Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

The Perceptron. Volker Tresp Summer 2016

Transcription:

in Combining Classifiers Mohamed Kamel, Nayer Wanas Pattern Analysis and Machine Intelligence Lab University of Waterloo CANADA

! Dependence! Dependence Architecture! Algorithm Outline

Pattern Recognition Systems! Best possible classification rates.! Increase efficiency and accuracy. Multiple Classifier Systems! Evidence of improving performance! Problem decomposed naturally from using various sensors! Avoid making commitments to arbitrary initial conditions or parameters in Combining Classifiers

Categorization of MCS Architecture Input/Output Mapping Representation Types of classifiers in Combining Classifiers

Categorization of MCS (cntd Architecture cntd ) Parallel [Dasarathy,, 94] Parallel Input 1 Classifier 1 Input 2 Classifier 2 Input N Classifier N F U S I O N Output Serial [Dasarathy,, 94] Serial Input 1 Classifier 1 Classifier 2 Classifier N Output Input 2 Input N in Combining Classifiers

Categorization of MCS (cntd Architectures [Lam, 00] cntd ) Conditional Topology! Once a classifier unable to classify the output the following classifier is deployed Hierarchal Topology! Classifiers applied in succession! Classifiers with various levels of generalization Hybrid Topology! The choice of the classifier to use is based on the input pattern (selection) Multiple (Parallel) Topology in Combining Classifiers

Categorization of MCS (cntd Input/Output Mapping cntd ) Linear Mapping! Sum Rule! Weighted Average [Hashem 97] Non-linear Mapping! Maximum! Product! Hierarchal Mixture of Experts [Jordon and Jacobs 94]! Stacked Generalization [Wolpert 92] in Combining Classifiers

Categorization of MCS (cntd Representation cntd ) Similar representations! Classifiers need to be different Different representation! Use of different sensors! Different features extracted from the same data set [Ho, 98, Skurichina & Duin,, 02] in Combining Classifiers

Categorization of MCS (cntd Types of Classifiers cntd ) Specialized classifiers! Encourage specialization in areas of the feature space! All classifiers must contribute to achieve a final decision! Hierarchal Mixture of Experts [Jordon and Jacobs 94]! Co-operative operative Modular Neural Networks [Auda and Kamel 98] Ensemble of classifiers! Set of redundant classifiers Competitive versus cooperative [Sharkey, 1999] in Combining Classifiers

Categorization of MCS (cntd cntd )! Classifiers inherently dependent on the data.! Describe how the final aggregation uses the information present in the input pattern.! Describe the relationship between the final output Q(x) and the pattern under classification x in Combining Classifiers

Data Independent ly Dependent ly Dependent in Combining Classifiers

Data Independence Solely rely on output of classifiers to determine final classification output. Q(x) = arg max(f (C (x)), j) j Q(x) is the final class assigned for pattern x C j is a vector composed of the output of the various classifiers in the ensemble {c{ 1j,c 2j,...,c Nj } for a given class y j c ij is the confidence classifier i has in pattern x belonging to class y j Mapping F j can be linear or non-linear j j in Combining Classifiers

Data Independence (cntd cntd ) Simple voting techniques are data independent! Average! Maximum! Majority Susceptible to incorrect estimates of the confidence in Combining Classifiers

Train the combiner on global performance of the data Q(x) = arg max(f (W ( C( x)), C (x)), j) j j j W(C (x)) is the weighting matrix composed of elements w ij w ij is the weight assigned to class j in classifier i in Combining Classifiers

(cntd cntd ) ly data dependent approaches include! Weighted average [Hashem 97]! Fuzzy Measures [Gader et al 96]! Belief theory [Xu et al 92]! Behavior Knowledge Space (BKS) [Huang et al 95]! Decision Templates [Kuncheva et al 01]! Modular approaches [Auda and Kamel 98]! Stacked Generalization [Wolpert 92]! Boosting [Schapire 90] Lack consideration for local superiority of classifiers in Combining Classifiers

Classifier selection or combining performed based on the sub-space space which the input pattern belongs to. Final classification is dependent on the pattern being classified. Q(x) = arg max(f (W ( x), j j C (x)), j) j in Combining Classifiers

(cntd cntd ) ly Data Dependent approaches include! Dynamic Classifier Selection (DCS) DCS With local Accuracy (DCS_LA) [Woods et. al.,97] DCS based on Multiple Classifier Behavior (DCS_MCB) [Giancinto and Roli,, 01]! Hierarchal Mixture of Experts [Jordon and Jacobs 94]! Feature-based approach [Wanas et. al., 99] Weights demonstrate dependence on the input pattern. Intuitively should perform better than other methods in Combining Classifiers

Architectures Methodology to incorporate multiple classifiers in a dynamically adapting system Aggregation adapts to the behavior of the ensemble! Detectors generate weights for each classifier that reflect the degree of confidence in each classifier for a given input! A trained aggregation learns to combine the different decisions in Combining Classifiers

Architectures (cntd Architecture I cntd ) N. Wanas, M. Kamel, G. Auda, and F. Karray, Decision Aggregation in Modular Neural Network Classifiers, Pattern Recognition Letters, 20(11-13), 1353-1359, 1999. in Combining Classifiers

Architectures (cntd cntd ) Classifiers! Each individual classifier, C i, produces some output representing its interpretation of the input x! Utilizing sub-optimal classifiers.! The collection of classifier outputs for class y j is represented as C j (x) Detector! Detector D l is a classifier that uses input features to extract useful information for aggregation! Doesn t aim to solve the classification problem.! Detector output d lg (x) is a probability that the input pattern x is categorized to group g.! The output of all the detectors is represented by D(X) in Combining Classifiers

Architectures (cntd cntd ) Aggregation! Fusion layer for all the classifiers! Trained to adapt to the behavior of the various modules! data dependent Q(x) = arg max(f (D( x), j C (x)), j) Weights dependent on the input pattern being classified j j in Combining Classifiers

Architectures (cntd Architecture II cntd ) in Combining Classifiers

Architectures (cntd cntd ) Classifiers! Each individual classifier, C i, produces some output representing its interpretation of the input x! Utilizing sub-optimal classifiers.! The collection of classifier outputs for class y j is represented as C j (x) Detector! Appends input to output of classifier ensemble.! Produces a weighting factor, w ij,for each class in a classifier output.! The dependence of the weights on both the classifier output and the input pattern is represented by W(x,C j (x)) in Combining Classifiers

Architectures (cntd cntd ) Aggregation! Fusion layer for all the classifiers! Trained to adapt to the behavior of the various modules! Combines implicit and explicit data dependence Q(x) = arg max(f (W ( x, C ( x)), C (x)), j) j j Weights dependent on the input pattern and the performance of the classifiers. j j in Combining Classifiers

Five one-hidden layer BP classifiers used partially disjoint data sets No optimization is performed for the trained networks The parameters of all the networks are maintained for all the classifiers that are trained Three data sets! 20 Class Gaussian! Satimages! Clouds data in Combining Classifiers

(cntd cntd ) Oracle Majority Data Set Singlenet Maximum 20 Class 13.82 ± 1.16 7.29 ± 1.06 12.92 ± 0.35 13.13 ± 0.36 Clouds 10.92 ± 0.08 7.41 ± 0.16 Data Dependent Approaches 10.68 ± 0.04 10.71 ± 0.02 Satimages 14.06 ± 1.33 7.20 ± 0.36 13.61 ± 0.21 13.40 ± 0.16 Average 12.83 ± 0.26 10.66 ± 0.04 13.23 ± 0.22 Borda 13.04 ± 0.30 10.71 ± 0.02 13.77 ± 0.20 ly Data Dependent Approaches Weighted Avg. 12.57 ± 0.20 10.59 ± 0.05 13.14 ± 0.21 Bayesian 12.48 ± 0.21 10.71 ± 0.02 13.51 ± 0.16 Fuzzy Integral 12.95 ± 0.34 10.67 ± 0.05 13.71 ± 0.19 ly Data Dependent Feature-based 8.64 ± 0.60 10.28 ± 0.10 12.48 ± 0.19 in Combining Classifiers

each component independently! Optimize individual components, may not lead to overall improvement! Collinearity,, high correlation between classifiers! Components, under-trained or over-trained in Combining Classifiers

(cntd cntd ) Adaptive training Selective: Reducing correlation between components! Selective:! Focused: Re Re-training focuses on misclassified patterns. Efficient: Controls the duration of training! Efficient: in Combining Classifiers

Adaptive : Main loop Increase diversity among ensemble Incremental learning Evaluation of training to determine the re-training set in Combining Classifiers

Adaptive : Save classifier if it performs well on the evaluation set Determine when to terminate training for each module in Combining Classifiers

Adaptive : Evaluation Train aggregation modules Evaluate training sets for each classifier Compose new training data in Combining Classifiers

Adaptive : Data Selection New training data are composed by concatenating! Error i : Misclassified entries of training data for classifier i.! Correct i : Random choice of P ratio of correctly classified entries of the training data for classifier i. in Combining Classifiers

Five one-hidden layer BP classifiers used partially disjoint data sets No optimization is performed for the trained networks The parameters of all the networks are maintained for all the classifiers that are trained Three data sets! 20 Class Gaussian! Satimages! Clouds data in Combining Classifiers

(cntd cntd ) Best Classifier Oracle Best Classifier Oracle Best Classifier Oracle Data Set Singlenet 20 Class 13.82 ± 1.16 14.03 ± 0.64 7.29 ± 1.06 8.64 ± 0.60 14.75 ± 1.06 6.79 ± 2.30 8.62 ± 0.25 14.80 ± 1.32 5.42 ± 1.30 8.01 ± 0.19 Normal Clouds 10.92 ± 0.08 11.00 ± 0.09 7.41 ± 0.16 10.28 ± 0.10 12.03 ± 0.52 5.73 ± 0.11 10.24 ± 0.17 Architecture Trained Adaptively 11.97 ± 0.59 5.43 ± 0.11 10.06 ± 0.13 Satimages 14.06 ± 1.33 14.72 ± 0.43 7.20 ± 0.36 12.48 ± 0.19 Ensemble Trained Adaptively using WA as the evaluation function 17.13 ± 1.03 5.58 ± 0.17 12.40 ± 0.12 16.96 ± 0.87 5.48 ± 0.18 12.33 ± 0.14 in Combining Classifiers

Categorization of various combining approaches based on data dependence Independent : vulnerable to incorrect confidence estimates implicitly dependent: doesn t take into account local superiority of classifiers ly dependent: Literature focuses on selection not combining in Combining Classifiers

(cntd cntd ) Feature-based approach! Combines implicit and explicit data dependence! Uses an Evolving training algorithm to enhance diversity amongst classifiers! Reduces harmful correlation! Determines duration of training! Improved classification accuracy in Combining Classifiers

References [Kittler et. al., 98] J. Kittler, M. Hatef, R. Duin, and J. Matas, On Combining Classifiers, IEEE Trans. PAMI, 20:3, 226-239, 1998. [Dasarthy,, 94] B. Dasarthy, Decision Fusion, IEEE Computer Soc. Press, 1994. [Lam, 00] L. Lam, Classifier Combinations: Implementations and Theoretical Issues, MCS2000, LNCS 1857, 77-86, 2000. [Hashem,, 1997] S. Hashem, Algorithms for Optimal Linear Combination of Neural Networks Int. Conf. on Neural Networks, Vol 1, 242-247, 1997. [Jordon and Jacob, 94] M. Jordon, and R. Jacobs, Hierarchical Mixture of Experts and the EM Algorithm, Neural Computing, 181-214, 1994. [Wolpert,, 92] D. Wolpert, Stacked Generalization, Neural Networks, Vol 5, 241-259, 1992 [Auda and Kamel, 98] G. Auda and M. Kamel, Modular Neural Network Classifiers: A Comparative Study, J. Int. Rob. Sys., Vol. 21, 117 129, 1998. [Gader et. al., 96] P. Gader, M. Mohamed, and J. Keller, Fusion of Handwritten Word Classifiers, Patt. Reco. Let.,17(6), 577 584, 1996. [Xu et. al., 92] L. Xu, A. Kazyzak, C. Suen, Methods of Combining Multiple Classifiers and their Applications to Handwritten Recognition, IEEE Sys. Man and Cyb., 22(3), 418-435, 1992 [Kuncheva et. al., 01] L. Kuncheva, J. Bezdek, and R. Duin, Decision Templates for Multiple Classifier Fusion: An Experimental Comparison, Patt. Reco., vol. 34, 299 314, 2001. [Huang et. al., 95] Y. Huang, K. Liu, and C. Suen, The Combination of Multiple Classifiers by a Neural Network Approach, J. Patt. Reco. and Art. Int., Vol. 9, 579 597, 1995. [Schapire, 90] R. Schapire, The Strength of Weak Learnability, Mach. Lear., Vol. 5, 197 227,1990. [Giancinto and Roli,, 01] G. Giancinto and F. Roli, Dynamic Classifier Selection based on Multiple Classifier Behavior, Patt. Reco., Vol. 34, 1879-1881, 2001. [Wanas et., al., 99] N. Wanas, M. Kamel, G. Auda, and F. Karray, Decision Aggregation in Modular Neural Network Classifiers, Patt. Reco. Lett., 20(11-13), 1353-1359, 1999. in Combining Classifiers

http://pami.uwaterloo.ca Email: mkamel@uwaterloo.ca