CARNEGIE MELLON UNIVERSITY

Size: px
Start display at page:

Download "CARNEGIE MELLON UNIVERSITY"

Transcription

1 CARNEGIE MELLON UNIVERSITY OPTIMAL CLASSIFIER ENSEMBLES FOR IMPROVED BIOMETRIC VERIFICATION A Dissertation Submitted to the Faculty of Graduate School In Partial Fulfillment of the Requirements for The Degree of DOCTOR OF PHILOSOPHY in ELECTRICAL AND COMPUTER ENGINEERING by KRITHIKA VENKATARAMANI COMMITTEE: Advisor: Prof. Vijayakumar Bhagavatula Prof. Tsuhan Chen Prof. David Casasent Prof. Arun Ross Prof. Marios Savvides Pittsburgh, Pennsylvania January, 2007

2 Copyright c 2007 by Krithika Venkataramani All Rights Reserved ii

3 To my parents iii

4 ABSTRACT In practical biometric verification applications, we expect to observe a large variability of biometric data. Single classifiers have insufficient accuracy in such cases. Fusion of multiple classifiers is proposed to improve accuracy. Typically, classifier decisions are fused using a decision fusion rule. Usually, research is done on finding the best decision fusion rule, given the set of classifiers. No account of the decision fusion rule is taken during classifier ensemble generation. By taking into account the decision fusion rule during classifier ensemble generation, the accuracy on decision fusion can be improved. The goal of this thesis research is to generate optimal classifier ensembles. The focus is on ensemble generation rather than the best decision rule evaluation. It has been found in literature that diversity in classifier decisions improves the Majority decision rule accuracy. The first part of the thesis finds the role of diversity on the ensemble accuracy. The ensemble accuracy is equated to the accuracy of the best monotonic decision rule at the given diversity. It is found in this analysis that the And, Or, and Majority decision rules are important. Hence these rules are investigated in detail to find their optimal diversity. The second part of the thesis connects the theory on optimal diversity can be used to generate optimal classifier ensembles in practice. An illustration of the design of multiple classifiers is shown on 2D simulated data. From this, it can be observed how the ensemble design is linked to optimal classifier diversity. It is assumed that the same base classifier is used. The classifiers in the ensemble are different because of training on different subsets of the training set. It is also seen that the data distribution and the base classifier plays a role in the optimal fusion rule as well as the generation of its optimal classifier ensemble. The last part of the thesis applies the learnt guidelines for optimal ensemble generation on real data. The approaches to ensemble design for And, Or and Majority decision rules are demonstrated on the CMU Pose, Illumination and Expression (PIE) face database and the NIST 24 fingerprint database. This shows the applicability of these ideas to general biometric verification problems. iv

5 ACKNOWLEDGMENTS This thesis has been possible with constant help from many people. I thank my advisor, Prof. Vijayakumar Bhagavatula, for his help and guidance on the complex dissertation problem. His advice on technical writing, presentation skills, approach to solving complex problems, step by step questioning to obtain deeper insight to the problem and its solution have been invaluable in guiding me to be a good researcher. I express gratitude to my thesis committee members for providing focus to the research problem and asking the questions instrumental in obtaining insight to the research problem. Dr. Marios Savvides in addition has provided advice, encouragement and help in the practical aspects of studying here. Prof. Rohit Negi aided me in developing the theory and leading me in the right direction during the initial phase of my thesis research. This thesis effort would not have been possible without the constant support of my parents. I thank my father for encouraging me in my graduate study and my mother in spending her time in getting me through difficult times. My brother has helped in providing practical suggestions and lightening the situation in order to move forward in my research. My friends have been instrumental in maintaining good morale, and have supported me during good and bad times. My research colleagues, Chunyan Xie, Pablo Hennings, Ryan Kerekes, Jason Thornton, Mike Beattie, Lakshmi Ramamoorthy, Sheida Nabavi, Jin Xie, Lingyan Sun, Hongwei Song have all helped me in many different ways. Chunyan, Pablo, Ryan, Jason and Mike have helped me in clarifying my ideas through discussion and reach solutions to the thesis research. Lakshmi, Sheida, Jin, Lingyan and Hongwei have provided encouragement. Ryan verified and provided direction to my theoretical solutions. My friends Evsen Yanmaz, Xun Zhang, Chunyan Xie and Pablo Hennings supplied much needed support and showed by example that difficult times can be overcome. Ripple Bora has provided constant companionship, support, and patience in listening to my problems. Siddhartha Misra gave encouragement and support. Smarahara Misra is a good friend, and I benefited from his philosophical insights for self-improvement. Ramesh Nallapati has also offered help to me. I am grateful to Lynn Philibin for caring about me and giving the advice that failures help us improve ourselves. I acknowledge Elaine Lawrence for her help throughout my graduate study. I also thank my mentors Chitra Dorai and Deepak Turaga. v

6 Lastly, I am extremely fortunate to have Dr. Hermina Szeles by my side during the last few months of my PhD. Her assistance has been key in maintaining my focus towards the completion of my dissertation. vi

7 TABLE OF CONTENTS Abstract iv 1 Introduction Reasons for the usefulness of a classifier ensemble Statistical Computational Representation Literature Survey on Classifier Fusion Overview of Information Integration Application Fusion objective Fusion process input-output (I/O) characteristics Sensor Suite Configuration Classifier output fusion Hard Decision Fusion Soft Decision Fusion Classifier Selection Classifier ensemble design Different base classifiers Different data subsets Different feature subsets Discussion Role of Statistical Dependence Between Classifiers Optimal Decision Rules Role of Statistical Dependence on the Minimum Probability of Error vii

8 3.2.1 Multi-dimensional Search for the Best Set of Thresholds Application to Biometric Verification Analysis of conditionally dependent classifiers for the OR Rule Two Classifier OR rule: False Acceptance Probability Two Classifier OR rule: False Rejection Probability Analysis of favorable statistical dependence for N classifier OR rule Analysis of conditionally-dependent classifiers for the AND rule Analysis of conditionally dependent classifiers for the Majority Rule Optimal ROC Fusion of Decision Fusion Rules Summary and Conclusions Classifier Ensemble Design For Different Rules on Simulated Data Ensemble design approach for the OR rule OR rule ensemble design for data distribution OR rule ensemble design for data distribution OR rule ensemble design for data distribution OR rule ensemble design for data distribution Ensemble design approach for the AND rule AND rule ensemble design for data distribution AND rule ensemble design for data distribution AND rule ensemble design for data distribution AND rule ensemble design for data distribution Ensemble design approach for the Majority rule Majority rule ensemble design for data distribution Majority rule ensemble design for data distribution Majority rule ensemble design for data distribution Majority rule ensemble design for data distribution Conclusions Ensemble design for Decision Fusion Rules on Biometric Data Classifier ensemble design for the OR rule PIE database evaluation NIST 24 plastic distortion dataset evaluation NIST 24 rotation dataset evaluation Classifier ensemble design for the AND rule NIST 24 plastic distortion dataset evaluation AR database evaluation viii

9 5.3 Classifier ensemble design for the MAJORITY rule NIST 24 plastic distortion dataset evaluation NIST 24 rotation dataset evaluation Summary and Conclusions Conclusions and Future Work Summary and Conclusions Summary of Original Contributions Future Work Appendix Optimal ROC Fusion of Decision Fusion Rules Optimal Decision Rules for Statistically Independent Decisions Optimal Decision Rules for Statistically Dependent Decisions Diversity Measures Correlation Coefficient ρ Q Statistic Disagreement Measure D Double Fault Measure DF Measure of Interrater Agreement, κ, for N > Entropy Measure E Kohavi-Wolpert Variance Measure of difficulty θ Generalized Diversity Coincident Failure Diversity Relative Error Measure McNemar Test References 214 ix

10 LIST OF FIGURES 1.1 The statistical reason for combining classifiers. D* is the best classifier for the data; the outer curve shows the space of all classifiers; the shaded area is the space of classifiers with good performance The computational reason for combining classifiers. D* is the best classifier for the data; the outer curve shows the space of all classifiers; the dashed lines are the hypothetical trajectories for the classifiers during training Optimal classifier for the displayed banana dataset is nonlinear Alternative fusion input-output characteristics Parallel sensor suite fusion Serial sensor suite fusion (a) Authentic scores of a 3-classifier ensemble (a) A zoomed in figure of authentic scores of the 3-classifier ensemble. (b) Impostor scores of the 3-classifier ensemble Region where monotonically increasing decision fusion rules are optimal the authentic correlation coefficient ρ a = Limits on the impostor correlation coefficient in region (1 P FR12 ) P FA12 at the authentic correlation coefficient ρ 12,1 = 0.8 for monotonically increasing rules to be optimal. (a) Upper limits (b) Lower limits (a) Minimum probability of error of 3 classifiers for the best fusion rule as a function of statistical dependence. (b) The best fusion rule as a function of statistical dependence Favorable conditional dependence for (a)and (b)or Favorable conditional dependence for (a)majority (b)and(1,or(2,3)) (a)favorable conditional dependence for OR(1,AND(2,3)). (b)unfavorable conditional dependence for all rules x

11 3.7 Authentic Q values as a function of the correlation coefficient between scores. The Q values are computed at the optimal thresholds of the best decision rule (at the given statistical dependence) Impostor Q values as a function of the correlation coefficient between scores. The Q values are computed at the optimal thresholds of the best decision rule (at the given statistical dependence) Slices of the three dimensional probability of error for the 3 classifier and rule as a function of thresholds on each classifier score at different correlation coefficients between authentic and impostor scores. (a)ρ a = 1, ρ i = 1 with min. error at thresholds (-4,-4,0.5) (b)ρ a = 1, ρ i = 0.5 with min. error at thresholds (-0.13,- 0.13,-0.13) Sample distorted images of a finger in the NIST 24 plastic distortion dataset Sample images of the variations present in AR database (a) General case of impostor classification by two classifiers. The classifiers declare impostor in the sets shown. Each classifier has a different color. The intersection of the two sets is declared as impostor by the OR rule. The complement of the intersection is the FAR of the OR rule. (b) The largest FAR for the OR rule when the sum of individual classifier FARs is large. (c) The largest FAR of the OR rule when the sum of individual classifier FARs is small. (d) The smallest FAR of the OR rule (a) General case of classification on authentics by two classifiers. Classifiers declare authentic in the sets shown (each classifier has a different color). The union of the two sets is declared authentic by the OR rule. The complement of the union is the FRR of the OR rule. (b) Largest FRR for the OR rule (c) Smallest FRR for the OR rule when the sum of FRRs of the individual classifiers is large. (d) Smallest FRR for the OR rule when the sum of FRRs of the individual classifiers is small a. General case of sets of impostor images classified correctly by two classifiers. Each set of impostor images classified correctly by a classifier has a different color. The union of the two sets is correctly classified by the AND rule and the c omplement of the union corresponds to the probability of false acceptance (FA). b. The largest probability of FA for the AND rule when the sum of individual classifier probabilities is large. c. The smallest probability of FA for the AND rule when the sum of individual classifier FA probabilities is small. d. The smallest probability of FA for the AND rule when the sum of individual classifier FA probabilities is large. 86 xi

12 3.15 a. General case of sets of authentic images classified correctly by two classifiers. Each set of authentic images classified correctly by a classifier has a different color. The intersection of the two sets is correctly classified by the AND rule and the complement of the intersection corresponds to the probability of false rejection (FR). b. Largest probability of FR for the AND rule when the sum of the FR probabilities of the individual classifiers is large. c. Largest probability of FR for the AND rule when the sum of the FR probabilities of the individual classifiers is small. d. Smallest probability of FR for the OR rule Data Distribution Data Distribution Data Distribution Data Distribution Optimum ensemble for OR rule fusion on data distribution 1. The pink, green and blue lines are the linear classifier decision boundaries. The dashed line is the OR rule decision boundary Design of multiple linear classifiers for OR rule fusion on data distribution 3. The marked area denotes the authentic decision region for the OR rule Design of multiple linear classifiers for OR rule fusion on data distribution 4. The marked area denotes the authentic decision region for the OR rule Design of multiple linear classifiers for AND rule fusion on data distribution 2. The marked area denotes the impostor decision region for the AND rule Design of multiple linear classifiers for AND rule fusion on data distribution 4. The marked area denotes the impostor decision region for the AND rule Design of multiple linear classifiers for Majority rule fusion on data distribution 1. The marked area denotes the authentic decision region for the Majority rule Design of multiple linear classifiers for Majority rule fusion on data distribution 2. The marked area denotes the impostor decision region for the Majority rule Design of multiple linear classifiers for Majority rule fusion on data distribution 3. The marked area denotes the authentic decision region for the Majority rule Design of multiple linear classifiers for Majority rule fusion on data distribution 4. The marked area denotes the authentic decision region for the Majority rule Images of different face poses of a person ROC of a single classifier per person trained on all 39 (13 poses * 3 illuminations) authentic training images xii

13 5.3 Individual classifier ROCs of our designed ensemble on the entire PIE pose and illumination database. The legend refer to the labels given to different poses in the PIE database Average authentic and imposter Q values of our ensemble designed for the OR rule ROC of the OR rule fusion using the designed classifiers ROCs of individual classifiers of Adaboost for a sample (40th) person. The EERs of the individual classifiers are between 5% and 12% Sample (40th) person s ROC of weighted decision fusion of individual classifiers by Adaboost. The EER is 5.2% Average ROC of Adaboost applied on all the authentic training images of a person. The averaging is done by averaging FRR across all persons for a given FAR. The EER is 6.2% Average pair-wise classifiers Q values for the Adaboost ensemble on the PIE database. This is obtained by first averaging pair-wise Q values for each person, and then averaging over all persons Average ROC for major decision rules applied on the bagging ensemble for the PIE database Average pair-wise Q values of the bagging ensemble on the PIE database. The averaging is done over all pairs of classifiers of a person, and then over all persons Distorted and partial fingerprints of a sample finger in the NIST 24 plastic distortion dataset Average authentic and imposter Q values of pair-wise classifiers in our ensemble on the NIST 24 plastic distortion set. A set of best thresholds on each of the classifiers are found for a given FAR/FRR point on the OR fusion ROC. The x-axis in this figure represents the index for these threshold sets Set 1 of the three authentic training image subsets of a sample finger. Each training subset is used to make one UOTF filter in the OR rule ensemble Set 2 of the three authentic training image subsets of a sample finger. This is used is building the second UOTF filter in the ensemble Set 3 of the three authentic training image subsets of a sample finger. This is used is building the third UOTF filter in the ensemble Comparison of ROCs for NIST 24 plastic distortion set: Three individual classifiers in our ensemble (each trained a subset of the authentic training ser), single OTF classifier using the entire authentic training set, and OR fusion of our ensemble. This shows that classifier ensemble fusion can be better than the best individual classifier xiii

14 5.18 Test ROCs of all fusion rules for the three classifier ensemble designed for the OR rule on the NIST 24 plastic distortion dataset Three classifier decision fusion ROCs for the bagging classifier ensemble on the NIST 24 plastic distortion set ROCs for the And, Or, Majority rules for the bagging classifier ensemble on the NIST 24 plastic distortion set Sample images of rotated fingerprints of the same finger in the NIST 24 rotation dataset Sample faint fingerprints of the same finger in Figure 5.21 in the NIST 24 rotation dataset Comparison of PSRs of the designed OTCHF classifier ensemble for the OR rule Vs the PSRs of a single OTCHF designed for the entire rotation range Comparison of ROCs on NIST 24 rotation set: ROC of a single OTCHF designed for rotation tolerance to all rotations in the test set, ROCs of 5 individual classifiers in the proposed ensemble, and OR fusion of the proposed ensemble Test ROCs of the OR, AND, MAJORITY rules on the designed classifier ensemble for the OR rule on the NIST 24 rotation dataset Authentic and imposter Q values of pair-wise classifiers in our ensemble on NIST 24 rotation set Authentic PSRs of classifiers 4 and 5 in our ensemble for a sample finger (Finger 16) on NIST 24 rotation set Authentic PSRs of classifiers 4 and 5 in our ensemble for all fingers in NIST 24 rotation set Authentic PSRs of classifiers 1 and 5 in our ensemble for all fingers in NIST 24 rotation set Authentic and impostor PSRs (match scores) for all 20 classifiers in the ensemble for a sample finger in the NIST rotation database Authentic correlation coefficient between each pair of classifier scores in the 20 classifier ensemble in the NIST rotation database Impostor correlation coefficient between each pair of classifier scores in the 20 classifier ensemble in the NIST rotation database ROCs of the 20 individual classifiers in the ensemble generated for the NIST rotation database along with their OR rule fusion xiv

15 5.34 ROCs of classifier ensemble designed for the AND rule and ROC of their AND fusion. An exhaustive search for the best thresholds on the PSRs of the three classifiers is used to find the best ROC. The average EER of the 3 individual classifiers is 2.82% and the AND decision fusion EER is 2.71% Performance of proposed And rule ensemble with SVM classifiers on NIST 24 plastic distortion database. ROCs of 15 monotonic decision fusion rules Performance of Bagging with SVM classifiers on NIST 24 plastic distortion database. ROCs of 15 monotonic decision fusion rules Pair-wise classifier Q values of the proposed SVM And rule ensemble on the NIST 24 plastic distortion database. These are shown as a function of the best threshold set for the And rule. Dashed (solid) lines are impostor (authentic) Q values Pair-wise Q values of the Bagging ensemble of SVM classifiers generated on the NIST 24 plastic distortion database. An optimal set of thresholds are selected for the And rule, for which the Q values are shown. Dashed (solid) lines are impostor (authentic) Q values Sample images of the variations present in AR database Individual classifier ROCs of the proposed Majority rule ensemble PSRs of a sample finger when trained with every 15th authentic training image, starting from the 1st that are divided into the same three groups as used in the OR rule design Test results on the ensemble designed for Majority rule on the NIST 24 plastic distortion set. ROCs of three classifier decision rules are shown. And 123 : And fusion of all three classifiers. And 1,Or2,3 : Or fusion of classifiers 2 and 3 is done first. This result is then fused with classifier 1 by the And rule PSRs of authentic test images of a sample finger (finger 7) and some impostor test images for the nine classifier ensemble designed for the Majority rule. Eight training images are used Eleven authentic training images are used to design a nine classifier ensemble designed for the majority rule for a sample finger (finger 7). PSRs of authentic test images and some impostor test images for the nine classifier ensemble are shown Thirteen authentic training images are used to design a nine classifier ensemble designed for the majority rule for a sample finger (finger 7). PSRs of authentic test images and some impostor test images for the nine classifier ensemble are shown Sixteen authentic training images are used to design a nine classifier ensemble designed for the majority rule for a sample finger (finger 7). PSRs of authentic test images and some impostor test images for the nine classifier ensemble are shown xv

16 LIST OF TABLES 1.1 Comparison of accuracies for the majority vote fusion for an independent classifier set and a diverse classifier set The Adaboost algorithm for two classes Prediction of the best fusion rule using correlation coefficients between classifier scores along with the top two observed fusion rules (in terms of TER/2) for bootstrap classifiers on NIST 24 data Evaluation of the optimality of the ensemble design using correlation coefficients between classifier scores. The ensemble is designed for the or rule on NIST 24 data Prediction of best fusion rule using correlation coefficients between classifier scores along with top two observed fusion rules (in terms of TER/2) for Bagging on AR data Classifier design evaluation using correlation coefficients between classifier scores for the classifiers designed for the and rule on AR data Two step procedure to obtain the optimal ROC for the AND rule from individual ROCs of statistically dependent classifiers The Adaboost algorithm for the UMACE base classifier Training image registration Correlation coefficients between pair-wise classifier scores. The ensemble is designed for AND rule using UOTF filters on the NIST 24 plastic distortion database Pair-wise score correlation coefficients of the proposed AND rule SVM ensemble Pair-wise score correlation coefficients of the Bagging SVM ensemble Performance (with 95% confidence intervals) of single classifiers on the AR database Performance (with 95% confidence intervals) of two classifier LDA fused with the AND rule Performance (with 95% confidence intervals) of two classifier SVMs fused with the AND rule xvi

17 5.9 Performance (with 95% confidence intervals) of two classifier DCCFs fused with the AND rule Comparison between the proposed AND rule ensemble with 94 LDA classifiers and Bagging with 94 LDA classifiers Training of each classifier for Majority Decision Fusion of a three classifier set. Each training subset is used in the training sets of 2 classifiers. This results in maximum accuracy since at least two classifiers produce a correct decision on that training subset. Each classifier is trained on a different set of two training subsets for maximum diversity and most significant improvement over the single classifier Pair-wise correlation coefficients of UOTF filter PSRs The desired rotation tolerance range for each of the nine classifiers used in Majority fusion is provided here Transform to simultaneously diagonalize the authentic and impostor covariances Two step procedure to obtain the optimal ROC for the AND rule from individual ROCs of statistically dependent classifiers Two classifier joint probability table (a + b + c + d = 1) xvii

18 CHAPTER 1 INTRODUCTION Characteristics that differ from person to person such as face, fingerprint, iris, palm print, voice, etc. are considered biometrics. Nowadays, laptops have fingerprint sensors for computer access. Iris scans are used to verification in government controlled buildings. Cell phones have cameras and are coming up with fingerprint sensors. These can be used for locking access by biometric authentication [1]. They also add user-friendly features such as linking the phone numbers with people s faces for easy recognition of the person calling. The US visit program verifies the person through the face and fingerprint images. Some of the passports have fingerprint and face images embedded for verification. Thus, biometrics are pervading everyday life and will become more commonplace in future. Biometric verification is being considered for secure access to physical and virtual spaces in place of techniques employing cards and passwords since biometrics cannot be lost or stolen. In verification, a person claims an identity. The biometric verification system compares the biometric features of the person with the stored templates of the claimed person and provides a yes/no answer. A related term is identification. In identification, the person s biometric features are compared to the stored templates of all people in a database and provides the identity of the person. Recognition refers to both identification and verification. In this research work, the focus is on verification. Fingerprints have a long history of usage in person recognition, especially in criminal identification [2]. Face recognition is currently popular because capturing face images is non-invasive. Iris recognition has been shown to be highly accurate [3]. However, the natural variability of bio- 1

19 metric features presents challenges to recognition. In practice, there would be a large variability present in the authentic features and perhaps small differences between the authentic and the impostor features. Impostor features are the biometric features of any person other than the claimed person. At times, the stored templates in biometric verification contains information not only of the claimed person but also a few other false clients. In this research work, impostors refer not just to these false clients but any other person other than the authentic. Varying illumination and poses in face images, make face recognition challenging. Challenges in fingerprint recognition are due to distortion due to pressing the finger on sensor surface, and varying environmental conditions such as dryness, moisture, dirt, etc. in fingers. Varying eyelid occlusion in iris images causes difficulty in iris recognition. For example, in the Face Recognition Grand Challenge (FRGC) [4] Experiment 4, the baseline Principal Component Analysis (PCA) method has a correct verification rate of only 12% at 0.1% False Accept Rate (FAR). To mitigate the effect of large distortion in biometric recognition, multiple sources of information/experts/classifiers can be fused improve accuracy. A classifier processes the input biometric feature and provides an output. This output is different depending on the recognition task. Recognition consists of two problems. On presenting a biometric feature such as face or fingerprint image, a classifier outputs the identity of the person. On the other hand, if the person claims an identity, a classifier outputs a yes or no answer to the claim. Biometric verification is of focus here in this work. There are advantages to a set of classifiers, or a classifier ensemble. Dietterich [5] suggests three reasons why a classifier ensemble might be better than a single classifier. 1.1 Reasons for the usefulness of a classifier ensemble Statistical Let us assume there are a number of classifiers with a good performance on a given labeled dataset as shown in Figure 1.1. However, each of these classifiers may have a different generalization performance on the data. We can pick a single classifier as a solution, and can run into a bad classifier for the problem. A better choice could be to use multiple classifiers and average their outputs. The new classifier may not be better than the single best classifier but will diminish the risk 2

20 of picking an inadequate single classifier. D 2 D * D 4 D 3 Good classifiers Classifier space D 1 Figure 1.1: The statistical reason for combining classifiers. D* is the best classifier for the data; the outer curve shows the space of all classifiers; the shaded area is the space of classifiers with good performance Computational Some training algorithms perform hill-climbing or random search, which may lead to different local optima as shown in Figure 1.2. We assume that the training process of each individual classifier starts somewhere in the space of possible classifiers and ends closer to the optimal classifier D*. Some form of aggregating may lead to a classifier that is a better approximation to D than any single classifier D i Representation It is possible that the classifier space considered for the problem does not contain the optimal classifier. For example, the optimal classifier for the banana dataset given in Figure 1.3 is nonlinear. If we restrict the space of possible classifiers to linear classifiers only, then the optimal classifier for the problem will not belong in this space. However, an ensemble of linear classifiers can approximate any decision boundary with arbitrary accuracy. If the classifier space is defined differently, the 3

21 D 4 D 2 D 3 D * Classifier space D 1 Figure 1.2: The computational reason for combining classifiers. D* is the best classifier for the data; the outer curve shows the space of all classifiers; the dashed lines are the hypothetical trajectories for the classifiers during training. optimal classifier D may be an element of it. In this case, the argument is that training an ensemble to achieve a certain high accuracy is more straightforward than directly training a classifier to achieve high complexity. Figure 1.3: Optimal classifier for the displayed banana dataset is nonlinear. An improvement on the single best classifier or on the group s average performance for the 4

22 general case is not always guaranteed. However, the experimental work published so far and the theories developed for a number of special cases demonstrate the success of classifier combination methods [6]. The improvement obviously is affected by the fusion strategy which aims to combine the diverse information from the multiple experts/classifiers. The typical approach is to study the information obtained from the different sources and then find the most efficient fusion strategy. Tumer and Ghosh [7] mention that there is little to be gained from combining, regardless of the chosen scheme if the classifiers make similar or same errors. In many cases, we have the freedom and/or the necessity to create multiple classifiers. In such situations, we can pick a fusion strategy and then create the set of multiple classifiers which have the diverse information needed to improve the accuracy. This thesis focuses on the situation where there is freedom to design classifiers for a given fusion rule and develops ways to create the diverse classifier set. A toy example in Table 1.1 shows the importance of diversity among the classifier decisions in the classifier set. Diversity is a fuzzy concept and there are no clear definitions in the literature. In this dissertation, we denote differences in classifier decisions as diversity in classifier decisions, and dissimilar scores as diversity in classifier scores. By diverse classifiers, we mean that the classifiers have dissimilar scores and different decisions. We have two sets of three classifiers, S 1 and S 2 making decisions on a two class problem. Each of the classifiers has a False Accept Rate (FAR) of 10% and a False Reject Rate (FRR) of 15%. S 1 is a set of independent classifiers, while S 2 is a set of diverse classifiers. For a majority vote fusion, set S 1 has an FRR of 6.1% with an FAR of 2.8%, while set S 2 has zero error (as explained below). Thus, the diverse classifier set S 2 can achieve improved accuracy over that of the independent classifiers. The analysis for obtaining the optimal set S 2 for the Majority rule is provided in detail in Chapter 3. The key in obtaining an optimal classifier ensemble is to consider the probabilities for the set of decisions.the Majority rule makes an error on the authentic data for the set of decisions 000, 010, 001, 100. When the authentic data probabilities for these sets of decisions are lowered, the Majority rule error on authentics reduces. The optimal classifier set minimizes the sum of these probabilities. This is a constrained minimization problem with the constraints that fix the individual classifier errors and that the sum of all the classifier set probabilities should add to 1. For this example, the constrained minimization solution yields a zero error for the Majority rule. There are numerous fusion strategies and Chapter 2 provides a survey of fusion methods. When 5

23 classifier decisions Probability of decision combination Impostor Authentic S 1 S 2 S 1 S Majority rule error Table 1.1: Comparison of accuracies for the majority vote fusion for an independent classifier set and a diverse classifier set the classifier provides a soft decision or probability, the sum, weighted sum, product, min, max on these soft decisions or scores are commonly employed. These may not be the best score fusion rules and hence the whole space of score fusion rules should be considered. The space of possible score fusion rules is vast but unknown. Research needs to be done to enumerate the number of score fusion rules. We choose to work with decision fusion rules as the number of rules is fixed. The literature on the theoretical behavior of the classifier set for optimal ensemble performance is also provided in Chapter 2. It is found that the classifier outputs in the ensemble need to be diverse to achieve optimal accuracy. Once the classifier ensemble is generated, the diversity in classifier outputs remains unchanged. This implies that the classifier ensemble generation is more important, in order to obtain optimal diversity between the classifier outputs. The common ensemble generation methods such as Bagging [8] and Boosting [9] are described in this chapter. It is found that there is a lack of classifier output diversity in these ensemble generation methods [10]. The literature on producing or selecting a diverse classifier ensemble is reviewed. This literature is surprisingly sparse and shows only limited success. The goal of this thesis is to generate optimal classifier ensembles. These ensembles have optimal diversity in order to achieve maximum ensemble fusion accuracy. Chapter 3 analyzes the effect of classifier diversity on ensemble fusion. Diversity measures are provided in the Appendix to quantify classifier ensemble statistical dependence. A three classifier set is analyzed to relate the diversity to the ensemble accuracy. The ensemble accuracy is equated to the accuracy of the best 6

24 monotonic decision rule at the given classifier ensemble statistical dependence. It is found that the best decision rule is most likely to be one of the Or, And, or Majority decision rule. Hence these rules are examined in detail in this Chapter to know the optimal diversity for these rules. The biggest missing link in the literature is the connection between the analytical optimal ensemble diversity and the practical generation of the ensemble. This dissertation bridges the gap between the theory and practice. As a first step, illustrations of optimal classifier ensemble design on simulated 2D data are shown. This clarifies the achievement of optimal ensemble diversity by the best classifier ensemble. It is assumed throughout this dissertation that the classifier ensemble consists of the same base classifier. For example, for a linear base classifier, the different classifiers in the ensemble all have linear boundaries but the location of the boundary changes from classifier to classifier. From the illustrations, it is observed that the data distribution and the base classifier together influence the best decision rule. The guidelines for designing optimal classifier ensembles for the And, Or, and Majority decision rules are obtained in this chapter. The design of the classifier ensemble having the desired diversity has been shown to be challenging in the literature. Using the ideas obtained from the classifier design on simulated data, application of classifier design to real databases is shown in Chapter 5. We propose new classifier ensemble generation methods by using different data subsets that are chosen to achieve the desired diversity. These ideas are illustrated on the CMU face Pose, Illumination and Expression (PIE) database and the NIST 24 fingerprint database to show the applicability of these ideas to practical biometric verification problems. As a comparison, the performance of ensembles generated for Bagging [8] and Boosting [9] are shown. Chapter 6 summarizes the key contributions of the dissertation and provides possible future directions. For efficient selection of impostor training subsets, good clustering of the impostor data is required. Good clustering for classifier ensemble generation should be dependent on the base classifier. Finding such clustering techniques is part of the future work on this thesis. 7

25 CHAPTER 2 LITERATURE SURVEY ON CLASSIFIER FUSION Our approach to improve the performance of biometric verification utilizes the fusion of multiple classifier outputs. This chapter reviews some representative papers on classifier fusion in pattern recognition. Classifier fusion is a specialized topic in the much broader field of information integration. Other common terminologies for information integration are data fusion and multisensor data fusion. A brief overview of information integration is provided initially to familiarize the reader to the terminology used. It is presented to show where classifier output fusion fits in the overall field of information integration. The next section focuses on common methods of classifier output fusion. Classifier outputs can be in the form of scores, which can be mapped to posterior probabilities of classes; or can be in the form of decisions. These outputs can be combined in different ways to produce a final score or decision. An alternative way is to select one classifier output based on a prediction of the accuracy of the various classifiers. Typical classifier output combination and selection strategies employed in literature are described in Section 2.2. The performance of fusion/selection of classifier outputs is affected significantly by the generated multiple classifiers. Generation of multiple classifiers has different requirements than the design of a single classifier. Hence, Section 2.3 is devoted to a literature review of multiple classifier generation techniques. Finally, a discussion of the strategies followed in this dissertation and their advantages is given in Section

26 2.1 Overview of Information Integration Information integration concepts and techniques can be categorized according to several different perspectives. Some categories and their subgroupings as given by Dasarathy [11] are: Application Fusion objective Fusion process input-output (I/O) characteristics Sensor suite configuration Application Studies in the information fusion area can be grouped on the basis of application domain. This is a natural distinction, because some concepts may be relevant only in some application environments as, for example, verification in biometric applications and noncooperative target recognition in military applications. Some typical application areas are Biometrics Defense Robotics Medicine Space The application areas can have considerable overlap in terms of objectives as well as the fusion concepts and techniques used to realize those objectives. While our research effort can be applied to many different domains, the focus is on biometric applications. 9

27 2.1.2 Fusion objective Fusion concepts can be categorized according to the goals set for the fusion process. Typically, the fusion objectives of a specific application scenario include one or more of the following functions Detection of the presence of an object (e.g., face detection, target detection in military applications) Recognition of an object or an event (e.g., face identification and face verification) Identification of the category of an object or event (classification, e.g., fingerprint classification into global patterns such as whorl, loop, arch, etc.) Tracking of an object or continued monitoring of an event Conjunction of information from multiple sources to make an intelligent decision (e.g., multi-modal biometric fusion) In this thesis, we focus on the recognition task, specifically on biometric verification Fusion process input-output (I/O) characteristics An important characterization of the fusion process is based on what is being fused. Major classifications are Data fusion : Data from multiple sources (sensors) are fused Feature fusion : Features from multiple sources are combined Classifier output fusion : Outputs (subdivided into scores and decisions) from multiple classifiers are combined Temporal fusion : Fusion of data, features or classifier outputs obtained from different times Some of the related terminology equivalences [6] in the literature are given below. data fusion information integration multisensor and multisource data fusion 10

28 Data input Data in - Data out fusion Data output Data input Data in - Feature out fusion Feature output Feature input Feature in - Feature out fusion Feature output Feature input Feature in - decision out fusion Decision output Decision input Decision in Decision out fusion Decision output Figure 2.1: Alternative fusion input-output characteristics feature fusion symbolic fusion information fusion fusion at an intermediate level object input data point example instance case classifier hypothesis learning machine expert classifier ensemble set of classifiers Xu et al [12] distinguish among three types of classifier outputs; namely 1) class label outputs, 2) rank level, where the alternatives are ranked in order of plausibility, and 3) measurement level, namely scores or probabilities of classes. We refer to all types of outputs as decisions. Temporal fusion [13], [14], [15], i.e. fusion of data or information acquired over a period of time can occur at any of the first three levels mentioned here and hence can be considered orthogonal to their categorization. In addition, depending on the input and output modes, the three level hierarchy can be further categorized into five fusion process input and output dependent modes, as shown in Figure 2.1 [11]. 11

29 Data in - Data out Fusion This fusion mode is typically referred to as data fusion. Fusion paradigms in this category are generally based on techniques developed in the traditional signal and image processing fields. Multidimensional data fusion can be accomplished through principal component analysis or other transform techniques, including frequency domain analysis tools. Image fusion falls into this category. Image fusion algorithms attempt to produce a single fused image that is more informative than any of the multiple source images used to produce the fused image. Satellite imaging for terrain visualization, geographic information system generation from multiple sources for mapping and charting, medical imaging for human body visualization and diagnosis, multisensor and image fusion for robot guidance, etc. are some applications where image and spatial fusion are utilized Data in - Feature out Fusion Here, data from different sensors are combined to derive some form of feature of the object under observation. Fusion in this mode can be looked upon as either data fusion (fusion of data) or feature fusion (fusion resulting in features) depending on whether one is concerned with the input or output. Most classifiers process data to obtain features. For example, the Polynomial Correlation Filters by Mahalanobis and Kumar [16] can fuse data from multiple sensors such as Infrared data and Ladar data to provide a single correlation output (features) Feature in - Feature out Fusion In this mode of feature fusion, derived features, instead of sensed measurements, are typically combined quantitatively, as in a multi-dimensional feature space; qualitatively, as in a heuristic decision logic process; or through a combination of such qualitative and quantitative information. This stage especially includes fusion in systems in which each sensor has a unique data structure and features obtainable from one sensor are not derivable from another. For example, minutiae features from fingerprints and facial features such as edge information from eyes, nose, etc, from faces are distintive to their respective sensors. 12

30 Feature in - Decision out Fusion This is one of the more common fusion paradigms encountered in literature. Here, the inputs are the features from different sensors and the output of the fusion process is a soft score or hard decision. This mode is referred to as either feature fusion (that is, fusion of features input) or decision fusion (fusion resulting in a decision output). One example is [17] where correlation output features from multiple modalities (fingers) are combined using a Support Vector Machine (SVM) to provide a more reliable global decision Decision In - Decision Out Fusion This mode has both the inputs and outputs as decisions and is commonly referred to as decision fusion. This thesis is focused on this approach. Depending on the specific sensors deployed in the sensor suite, fusion at the data and feature levels, that is fusion in the previous four modes, may or may not be always practical. For example, data fusion requires compatible sensors that are appropriately registered to to permit data level integration. In cases where this is not practical, decisions will have to be made at each sensor based on the data derived from the local sensor, and thse local decisions will have to be passed on to the fusion processor for integration. Another example for this mode of fusion: different biometric verifiers may be sold by different vendors, who are in general unwilling to give access to scores and only provide the final decisions. Even if it is not the best fusion strategy in all cases, decision fusion is always a feasible approach. This fusion mode is the main thrust of this thesis Sensor Suite Configuration Another perspective in grouping the studies in the decision fusion field particularly, is that of sensor suite configuration. Parallel, and serial or tandem configurations are most common. Combinations of serial and parallel configurations are also conceivable. Sensor networks are used in wireless communication, building access control through biometrics [18] or other sensors, automatic target recognition (ATR) applications to provide better decisions, etc. Varshney [19] presents techniques for fusion in sensor suites having independent sensors. 13

31 Sensor 1 Event Sensor 2 Sensor n Fusion center Figure 2.2: Parallel sensor suite fusion Event 2 Event m Event 1 Sensor 1 Sensor 2 Sensor m Figure 2.3: Serial sensor suite fusion Parallel Suite A parallel sensor suite consists of a set of n sensors that are interrogated in parallel, as shown in Figure 2.2. The data, features, or decisions, derived from these sensors are combined by the fusion processor. This scenario is well suited to model similar, if not identical, sensors capable of operating independently of one another more or less simultaneously. Typical multi-modal biometric fusion applications are in this mode. ATR applications also frequently use this mode Serial Suite A serial sensor suite, as shown in Figure 2.3, consists of a set of m sensors that are interrogated in series or tandem; the data, features or decision derived from these sensors are combined sequentially. This mode is particularly suited for scenarios with sensors of varying ranges of effectiveness and can model sequential target hand-over from one sensor to the next. One application in the biometrics domain is the sequential person verification/identification as the person moves from one room to another in a building or from one building to another [18] in a defined area. Classifier design and generation may make use of one or more topics within the broad field of data fusion, for example, image fusion from multiple sensors (data fusion), fusion of linear 14

32 transformations of log spectral components of speech (feature fusion), and a final fusion of features to provide a decision. While a brief overview of information integration has been provided here, there are several books, conferences and publications that provide a representation of this broad theme, to which the readers are referred to for further information. For example, the books by Hall [20], Hall and Llinas [21], Abidi et al. [22] and Aggarwal [23], and the conference series SPIE Conference on Multisensor Data Dusion, IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems and IEEE International Conference on Information Fusion have more information on the broad field of information integration. For the application envisaged in this thesis, i.e., multiple classifiers making decisions on the same test object, the parallel suite is the most suitable as there is no temporal information. We focus on literature using the parallel fusion scheme in this chapter. In the next section, we present categorization of decision fusion or classifier output fusion and discuss representative literature. 2.2 Classifier output fusion This section presents different methods of combining multiple classifier outputs to provide a final output.the classifier fusion area has been extensively investigated. The conference series Multiple Classifier Systems has many papers on the topic of classifier fusion. The book Combining Pattern Classifiers [6] by Kuncheva is a well researched survey of publications in this field. Kleinberg s [24] important paper provides a statistical framework for classifier generation and improvement on combination. Kleinberg [24] introduced the concept of stochastic discrimination (SD) for generation of multiple weak classifiers that are combined using the sum rule to form a strong classifier. These classifiers, as well as their combination, are over-training resistant under certain strong mathematical assumptions of indiscernibility between training and test sets with respect to the weak classifiers. This requirement would lead to the need for a large training set, which may not be available in practice. However, there have been many heuristic approaches for classifier combination that have been found to be successful and some of the typical approaches are presented in this section. Classifier outputs can either be fused or selected [6]. In classifier fusion, typically each classifier is supposed to have knowledge of the whole feature space, while in classifier selection, each classifier is supposed 15

33 to have good knowledge of a part of the feature space and is expected to make correct classification of objects in that part. Typically majority vote or average decisions are used in classifier fusion, while one classifier is selected based on the test sample in classifier selection. There are schemes that are in between these two pure schemes. The fusion of experts [25] where another separate classifier is used to decide which classifier in the ensemble is selected, or how the classifiers outputs are weighted to make the final decision on the test sample, is an example of an in-between approach. Some synonyms of the fusion-selection approach in the literature [6] are provided below. fusion vs selection competitive classifiers vs modular approach multiple topology vs hybrid topology In this section, representative papers on hard decision fusion methods are discussed in Subsection and typical soft decision fusion rules are presented in Subsection A third category is classifier selection, where the most accurate classifier is selected for a given test input. Classifier selection strategies are discussed in Subsection Hard Decision Fusion OVERVIEW: This subsection enumerates decision combination methods. Majority and Weighted-Majority voting are commonly used in two class (verification) problems. Majority voting is defined only for two-class problems. In multi-class problems (identification), Plurality voting is typically used. The class having the largest number of votes is the chosen class in Plurality voting. Plurality voting on a two-class problem is Majority voting. A review of papers analyzing the limits of accuracy of these combination methods is also provided in this section. The Naive Bayes approach provides its final decision based on the assumption of independent classifier outputs. The final estimate of the posterior probability of the test input is proportional to the the product of the class conditional probabilities of the different classifiers. Multinomial methods also combine classifier output vectors based on posterior probability estimates. Multinomial methods estimate the posterior probability of the classifier output vector from the training data. Rather than a decision of the 16

34 most likely class, classifier outputs can be in the form of ranks. Different strategies for combination of rank outputs from different classifiers are presented. The fusion of class label outputs in a parallel manner is considered in this thesis. Majority [26], [12], [27] and plurality voting [28] are commonly used class label fusion rules. Plurality voting is used in identification, where the class with the maximum number of votes is declared as the identified class. An extension to the majority rule is the weighted majority rule [29], which is a reasonable extension when classifiers in the ensemble do not have the same accuracy and more weight could be given to classifiers that are more accurate. In addition to the majority vote, there are other decision fusion rules for two class problems such as and, or for which some analysis has been done [19], [30]. Other approaches to label fusion include Naive Bayes [31], Multinomial methods [32] and Wernecke s methods [33]. Literature on these rules and their analysis to find the limits of their accuracy is reviewed in this subsection Label Fusion Rules Lam and Suen [26] offer theoretical analysis of limits of majority voting for independent classifiers that have the same individual classification probability. Results are provided for the limits of majority vote classification probability when the number of classifiers tends to infinity. Analysis on the increase in correct classification probability when two new classifiers are added is also given. When classifiers are independent, and each has a different accuracy, then the weighted majority rule can be used instead of the majority rule. Shapley and Grofman [29] show that the weights on the classifier decisions are dependent on the classifier accuracy. The weight on the ith classifier decision ( log p i 1 p i ), with its classification accuracy equal to p i, is optimal to maximize the accuracy of the independent classifier ensemble. The proof is based on the Bayes optimal discriminant functions for the classifier outputs. It should be noted that the proof shows that assigning weights to the classifiers does not guarantee minimum classification errors, since the prior probabilities of the classes are not taken into account. Typically, classifiers are assumed to be statistically independent for tractability of analysis and independence is viewed favorably. However, in most cases, the classifier ensemble generation leads to statistically dependent classifiers. Recently, some attention has been devoted to dependent classifier fusion. Kuncheva et al [27] showed that dependent classifiers may provide a better accuracy than 17

35 independent classifiers when using the majority vote fusion for identification. Upper and lower limits on the majority vote accuracy are derived theoretically with respect to equal individual classifier accuracy p, the number of classifiers, N, and the pairwise dependence between classifiers, measured by the Q statistic [34]. The Q statistic is a measure of the diversity in verification decisions produced by two classifiers. The results support the intuition that negative pairwise dependence between classifiers is beneficial. Intuitively, the best improvement of the majority vote accuracy over the individual classifier accuracy will be achieved when exactly (N/2) + 1 votes are correct for a test sample (or none of the votes are correct). Any extra correct vote will be wasted because it is not needed to give the correct class label. When p is less than (N/2) +1 N, then this combination pn can be achieved leading to maximum majority vote accuracy of P maj = (N/2) +1 and this happens when all pairs of classifiers are negatively dependent. Matan [35] gives tight upper and lower bounds on majority voting accuracy on two classes (verification) when the classifiers are statistically dependent and have unequal error probabilities. Demikelker and Altincay [28] compare plurality voting performance of classifier ensembles that have the best and the worst joint distributions to classifier ensembles that have independent joint distributions. These distributions are obtained by formulating the combination operation as an optimization problem. The best and the worst correct classification probability is given for a two class problem with N classifiers where all classifiers have equal individual classifier probabilities. Extension to multi-class problems is shown by an example. There is an example of a 3 class, 4 classifier problem. For classifiers with unequal error probabilities, there is an example given for 2 class, 5 classifier problem where one classifier has an unequal error probability. The optimum solution gives a recognition probability of 1 as N approaches to infinity if p > 1/M, where M is the number of classes. The best and the worst distributions when an additional classifier is added to the system are also found. There are a large number of label fusion rules. For example, for two class labels of N classifiers, there are 2 2N label fusion rules. For independent classifiers, the best fusion rule is monotonic [19], thus reducing the search space for the best decision fusion rule (explained in Appendix 7.2). For independent classifiers, the likelihood ratio increases as the number of classifiers declaring authentic increases. Hence, if a given set of classifier decisions is declared as authentic in the fusion rule, monotonically increasing rules would declare authentic when a larger number of classifiers in the 18

36 set declare authentic. However, for a large number of classifiers, the number of monotonic rules is large (exponential). Varshney [19] considers the fusion of independent classifiers and provides analysis to find the best fusion rule and best thresholds on the scores of individual classifiers, given the probability distributions of the scores. When the classifiers are not conditionally independent, then the joint probability distribution of the scores needs to be known. In typical biometric applications which are of interest in this thesis, the training data size is too small to estimate the joint probability distributions of the scores. The number of non-linear coupled equations (N + 2 N ) [19] that need to be simultaneously solved increases exponentially with the number of classifiers, N, and the computational effort becomes prohibitive. This computation becomes easier for conditionally independent classifiers. Schubert et al. [36] find the worst case Receiver Operating Characteristic (ROC) curves of statistically dependent classifiers for the two-classifier or / and rules given the ROC curves for individual classifiers. Measures of dependence or correlation are defined. However, they assume that there is constant correlation between classifiers at every pair of thresholds on the classifier scores. This is not a valid assumption since the correlation between classifier decisions depends on the thresholds chosen. Schubert et al. corrected this assumption in [30], where they account for the fact that the correlation between classifier decisions changes at different thresholds. They show a simulated example of a two class, two classifier and fusion when analytical expressions are present for the individual classifier ROCs as well as the errors for the and rule given the thresholds on the individual classifiers. At different points on the ROC for the and rule, there are different values of correlation between classifier decisions. In practice, we will not have analytical expressions for the individual classifier errors and more importantly for the errors of the fusion rule, which depends on the joint distribution of the classifier scores. This thesis uses a similar approach to Demirekler and Altincay s approach [28] to analyze other decision fusion rules. The difference is, for any decision fusion rule other than the majority rule, conditional dependence matters while it does not matter for the majority rule. In addition, we consider the favorable and the unfavorable distributions rather than just the best and the worst case distributions. At favorable ( unfavorable ) dependence, accuracy on fusion is better (worse) than accuracy on fusion at statistical independence. We also consider classifiers with different classification probablities instead of considering only classifiers with the same accuracy for combination. 19

37 This approach is detailed in Chapter Naive Bayes Combination The Naive Bayes or Simple Bayes method assumes the classifier decisions are conditionally independent. [ The Bayes rule is used to estimate the posterior probability conditioned on the decision vector u = u 1 u 2... u N ], (u i is the ith classifier s decision) i.e, P (ω c u), where c is the class label. Assuming the classifiers are conditionally independent, the posterior probability is given by P (ω c u) = P(ωc)P(u ωc) P(u) = P(ωc) Q N i=1 P(u i ω c) P(u) As the denominator does not depend on the class label, it can be ignored. Domingos and Pazzani [31] found that the Naive Bayes method is surprisingly accurate even when the classifier decisions are not conditionally independent. They compared the performance of the Naive Bayes method assuming independent classifier probabilities to three other approaches to classification. They used 28 datasets with dependent features for evaluation. The Naive Bayes method was the best for 10 of the 28 datasets. (2.1) They computed the pairwise feature dependence using mutual information. For independent features, the mutual information is zero. Mutual information is highest when the features provide identical information. However, they did not normalize the mutual information. While the lower limit of mutual information is zero, the upper limit is undefined. Since they did not normalize the diversity measure, it is not known how much pairwise dependence is present in the datasets. For a two class problem, they show the Naive Bayes method assuming independent classifiers is optimal in half the space of (p, r, s). Here, p is the prior probability of one of the classes, r and s are the posterior probabilities of the two classes, assuming independence of all classifier probabilities. Further, attempts to amend the naive Bayes by including estimates of some dependencies do not always pay off. The difficulty in implementing this method in practice for biometric applications with typically small data sets is that the estimates of individual classifier conditional probabilities are inaccurate. 20

38 Multinomial methods In this group of methods, the posterior probabilities, P (ω c u), are estimated based on the training data. The behavioral knowledge space (BKS) [32] and Wernecke s methods [33], are examples of the multinomial approach. To have reliable estimates, the data set should be large. Huang and Suen [32] proposed the Behavior Knowledge Space (BKS) method. Based on the training data, a BKS lookup table is created for each decision vector u containing the estimates of P (ω c u). The class with the highest posterior probability is labeled as the representative class for this cell. During the test phase, the BKS lookup table is accessed using the decision vector u, and its representative class is assigned as the decision on the test object. Empty cells in the lookup tables are either labeled randomly or use the label based on the majority rule. Ties are resolved arbitrarily. The BKS method is often overtrained due to limited data, with poor results on the test data. Wernecke s [33] method aims at reducing overtraining. In each cell of the lookup table, the 95 percent confidence intervals of the posterior probability estimates are calculated. If the confidence intervals overlap, then an alternate labeling method is used for that cell. The least wrong classifier for that cell is used to label the cell Rank Level Combination Instead of providing the most probable class, ranks of the top choices of classes can be provided. The Borda Count [37] is a well known combination scheme for rank level combination, which is a generalization of the majority voting rule. Let Bj i be the number of classes ranked below class j, j = 1, 2,...,N by the ith classifier. The Borda Count for the jth class is given by N B j = Bj i (2.2) i=1 The Borda Count decision rule is to pick the classifier with the highest B j. Ho [38] introduces some rank combination schemes. An intersection method and a union method are proposed for class set reduction. The first method computes the intersection of large neighborhoods taken from each classifier. The lowest rank given by that classifier to the true class in the training set is chosen as a threshold (determines the size of the neighborhood) for that classifier. For a test pattern, classes ranked above the thresholds are selected and intersected. The second 21

39 method computes the union of small neighborhoods taken from each classifier. The thresholds on the ranks are selected by a max-min procedure. For each training object, the rank of the correct class by each of the classifiers is noted. The best (minimum) rank among these is placed under the classifier that produces it. For each classifier, the maximum of these best ranks for each classifiers is found among all the training objects. This maximum rank is the neighborhood of that classifier. The union of the neighborhoods of each classifier contains the true class. The highest rank method proposed by Ho [38] is a rank reordering method. N classifiers are applied on the given test object to find the ranks of different classes. Each class receives N ranks. The minimum (highest) of these N ranks is assigned to that class as its score. The classes are then sorted by these scores to derive a combined ranking for that object. Ties in the combined ranking may be broken arbitrarily to achieve a strict linear ordering. In empirical results the author notes that the Borda Count is better and improves the accuracy at all ranks. The highest rank method can improve the accuracy in the top ten choices substantially. Because of arbitrarily broken ties, this method does not give a good top choice performance. Al-Ghoneim and Kumar [39] introduce the Pooled Ranking Figure of Merit (PRFM), which is a generalized rank level combination scheme. The plurality decision rule, the average rule and the Borda Count are special cases of the PRFM. The Ranking Figure of Merit (RFM) defined in [40] from each the N classifiers is averaged to obtain the PRFM. The RFM is a differentiable family of objective functions that rewards the classifier for making better rankings. The object that is correctly classified as the first rank will receive the maximum reward. If the object is ranked second, it will have a smaller reward, the third ranked object will have a smaller reward than the second ranked one and so on till the top k choices. The PRFM does not require numerical scores from all classifiers. It has a unifying framework and can combine classifiers providing scores with those providing class labels or ranks Soft Decision Fusion OVERVIEW: This subsection enumerates soft classifier output combination methods. Classifiers typically produce scores. These scores are converted to probabilities using normalization schemes, which are enumerated in Section Sum, Weighted-Sum, product, and to a smaller 22

40 extent, order statistic combiners, are the commonly used soft decision fusion schemes. A review of the analysis and empirical results done on these fusion rules is given Section The posterior probabilities of a given class from all classifiers are combined using these fusion rules to yield the final posterior probability of the same class. Hence these fusion rules are class-conscious combiners. Dempster-Shafer combination and Decision Templates are class-indifferent combiners. These class-indifferent combiners use the posterior probabilities of all classes from each classifier to provide the final posterior probabilities. Details of these class-indifferent combination methods are provided in Section Most classifiers produce a soft decision or score. Soft decision fusion is a popular fusion strategy. The scores are normalized to obtain estimates of the aposteriori probablities. A further reason for normalizing the scores is to fuse classifier scores of different ranges obtained from different types of classifiers. The simple or weighted sum/average, order statistic combiners are some typical fusion rules applied to the normalized scores. In this section, some typical methods of normalizing scores, followed by literature review on the typical fusion rules and their accuracy are given Normalizing Scores Duda et al. [41] propose a softmax output for normalizing discriminant and neural network scores. Let s 1 (x), s 2 (x),..., s c (x) be the output of the classifier for the c classes. The normalized output is given by softmaxν j (x) = exp(s j (x)) c k=1 exp(s k(x)) (2.3) Some typical methods for normalizing scores for two class problems enumerated in Jain et al [42] are given below. The score s is normalized to ν, and S represents the set of all scores for the classifier. Min-Max ν = s min(s) max(s) min(s) z score ν = s mean(s) std(s) 1 1+exp( 2(s t)/r 2 ) (2.4) (2.5) Median Absolute Difference (MAD) ν = median s median(s) (2.6) 1 1+exp( 2(s t)/r double sigmoid function ν = 1 ) if s < t (2.7) otherwise 23

41 where t is the reference operating point. Tanh ν = 1 [ ( tanh 0.01 s mean(s) ) ] std(s) (2.8) The Min-Max and z-score normalization are sensitive to outliers [42]. The Min-Max normalization retains the original distribution except for a scaling factor, whereas the z-score normalization does not retain the original input distribution. The median and median absolute difference (MAD) are insensitive to outliers and the points in the extreme tails of the distribution. A normalization scheme using median and MAD would be robust. However, when the score distribution is not Gaussian, median and MAD are poor estimates of the location and scale parameters. Therefore, this normalization technique does not retain the original input distribution and does not transform the scores into a common numerical range. Capelli et al. [43] introduced the double sigmoid function, which has a linear range in (t r 1, t r 2 ) for given operating point t, and has an exponential characteristic outside this range. But, it requires careful tuning of the parameters t, r 1, r 2 to obtain good efficiency. Generally, the value of t is chosen in the region of overlap between authentic and impostor scores, and r 1, r 2 are the extent of overlap on the left and right side, respectively. The scores in the overlap region are linearly transformed while those outside are non-linearly transformed. The Tanh normalization introduced by Hampel et al. [44] is robust to outliers and is efficient. Shu and Ding propose an Adaptive Confidence Transform (ACT) [45] according to the theory of Classifiers Confidence Analysis as a better normalization method for distance (similarity) scores. It is a two step process and outperforms the Min-Max and Tanh normalization methods. For c classes and N classifiers, the first step computes a generalized confidence g(ω j x i ) for class j, given the data vector x i used by the ith classifier. g(ω j x i ) = 1 s j (x i ), k = 1, 2,..., N (2.9) min k j (s k (x i )) or, g(ω j x i ) = 1 max k j (s k (x i )), k = 1, 2,..., N (2.10) s j (x i ) The generalized confidence is given by Eq.(2.9) when s j (x i ) is the distance measure between the data vector x i and the template data vector of class j given by the ith classifier. When s j (x i ) is a similarity measure, the generalized confidence is computed by Eq.(2.10). The second step maps the generalized confidence into the a posteriori probability using a map- 24

42 ping function f(.) by P(ω j x i ) = f(g(ω j x i )). Let the domain of the generalized confidence be T. Let X be a sample in the training set S T. Let y = g(ω j x i ). For y T, choose a small closed set [y δ, y + δ]. The mapping function f(y) is computed by f(y) = N j=1 count ({X X S T and g(ω j x i ) [y δ, y + δ] and X ω j }) N j=1 count ({X X S T and g(ω j x i ) [y δ, y + δ]}) (2.11) where function count(.) counts the number of elements in a set Typical fusion rules on probability estimates Some of the commonly used fusion rules on these normalized scores or probabilities are simple and weighted sum/average, product, min, max, and other order statistics, and generalized mean [46] given by ( ) 1/α 1 N µ(x, α) = s i (x) α N where s i is the ith classifier s normalized score and x is the test object. i=1 (2.12) Ross and Jain [47] empirically evaluate some score fusion rules on biometric data. The sum rule on the scores, the Fisher Disciminant [41] linear classifier on concatenated classifier ensemble scores and decision trees are compared. It is found that the sum rule outperforms the other rules. Wang and Casasent [48] find that utilizing the data quality information of the images in weighting the sum of scores improves accuracy over the simple sum or weighted sum rule. The weights for the weighted sum rule are typically based on the error rates of the individual classifiers [49]. Wang and Casasent provide results of face and fingerprint score combination. They weight the fingerprint scores based on the quality of the training and test fingerprint images and error rates of the fingerprint and face classifiers. The face scores are weighted such that the sum of weights is equal to 1. Zhang and Chen [50] propose discriminant classifiers that project features that are concatenated classifier ensemble scores, onto a subspace, Symmetric Max Minimal distance on Subspace (SMMS) and Generalized SMMS (G-SMMS). The SMMS tries to identify a subspace where all authentic training scores are projected into a single point and impostor training scores are projected far way from that point. The SMMS maximizes the minimal distance of the projection of impostors from the authentic projection point. The Generalized SMMS relaxes the constraint that all authentic 25

43 training samples are projected to a single point. It also considers the optimal direction of the linear SVM as a feasible solution to ensure that its solution is no worse than the SVM [51]. The verification of the test feature is done by projecting it onto the obtained subspace and comparing it to a threshold. While there have been many empirical evaluations of score fusion rules, there has been limited work done on understanding the underlying theory behind score fusion. Kittler et al [52] attempt to justify the sum, product, min, max, and median rules as simplifications or bounds of the maximum a posteriori probability of the ensemble of conditionally independent classifiers. However, most of these assumptions are not reasonable. E.g. the justification for the sum rule uses the assumption that the posterior probabilities of the classifiers are close to the prior probability. With this assumption, the maximum sum (of posterior probabilities of classifiers) is equivalent to the maximum a posteriori probability. This assumption would imply a uniform conditional probability density on the data or features, which is unrealistic. Tumer and Ghosh [53] provide a better justification for the good performance of the mean (or sum) than Kittler et al. [52]. They show that this is because the bias and variance of the deviation of decision boundary of the classifiers with respect to the optimal Bayes decision boundary is reduced on combination by the mean (or sum). In addition to the simple mean, Tumer and Ghosh [53], [54] provide an analytical framework to quantify the improvements in classifier fusion for order statistic combiners such as min, max and median. Tumer and Ghosh [53], [54] assume that the classifier outputs approximate the a posteriori probability of the class. They consider a single dimensional input with a single mode probability density and relate the classifier s error/decision boundary to the optimal Bayes error/decision boundary. The total error of the classifier is equal to the min Bayes error plus an added error due to shifting the decision boundary with respect to the optimal Bayes decision boundary. A first order or linear approximation of the a posteriori probability within a suitably chosen region about the optimum boundary is assumed. Using this assumption, the bias and variance components of the combiner decision boundary with respect to the optimal boundary are examined. The bias and variance terms should be reduced in the combiner to improve accuracy. In the absence of classifier bias, the reduction in the added error is directly proportional to the reduction in the variance. For linear combiners, if the errors of individual classifiers are zero mean 26

44 i.i.d., the factor of reduction in boundary variance is shown to be N, the number of classifiers that are combined. If the errors of individual classifiers are zero mean i.i.d., the added error of the order statistic combiner is obtained by applying a reduction factor to the added error of the individual classifier. The reduction factor is obtained from tables that are dependent on the distribution of the data and the order statistic used. The authors assume Gaussian distribution of the deviation of the combiner decision boundary with respect to the optimum Bayes boundary. The error of the order statistic combiners are higher than the error of the linear (average) combiner. When the classifiers are biased and/or have positively correlated outputs, the reduction factors on the added error of the individual classifier are smaller. Tumer and Ghosh [53], [54] use the correlation coefficient as a diversity measure on classifier scores. The correlation coefficients on scores is defined in Appendix For statistically independent classifiers, the correlation coefficient is zero. Tumer and Ghsoh [53], [54] find that the error reduces as the value of the correlation coefficient reduces. They only consider positive values of the correlation coefficient. They do not deduce the effect of a negative value of the correlation coefficient between classifier outputs. This is important because at a negative value of the correlation coefficient, the added error of the mean (sum) score combiner would be smaller than the added error of statistically independent classifiers. While they attempt to design classifier ensembles for linear combiners, they are not successful. Their methods are discussed in the next section on ensemble generation techniques. Xie and Kumar [55] proposed class-dependent feature analysis (CFA) method where the scores of N classifiers are used as features. The CFA feature for an input x is given by the array s 1 (x). CFA(x) = s i (x). (2.13). s N (x) Let CFA j be the representative CFA feature array for the jth class. The CFA feature of the test input, CFA(x), is compared to the representative CFA vector of the jth class using a distance or similarity measure to determine the class label. Xie and Kumar [55] used this approach when the 27

45 classifiers can only act on two class problems. The classifiers are designed to recognize one class and reject all other classes. For correlation filter [56], [57] based classifiers, the cosine distance or normalized correlation was found to be the most accurate Class Indifferent Combiners Decision Templates [58] and Dempster-Shafer combiners [59] are examples of class indifferent combiners. For a given input x, each classifier can provide a degree of support or probability for each of c classes. Without loss of generality, we can assume the c degrees of support are in the interval [0, 1]. The outputs from the N classifiers can be organized in a decision profile as a matrix s 1,1 (x) s 1,j (x) s 1,c (x)..... DP(x) = s i,1 (x) s i,j (x) s i,c (x). (2.14)..... s N,1 (x) s N,j (x) s N,c (x) where s i,j is the ith classifier s probability for the jth class. Class conscious combiners such as sum and weighted sum, product, and order statistic combiners operate on one column of the decision profile at a time. Class indifferent combiners utilize all the elements of the decision profile to make a final decision. The elements of the decision profile can be treated as features in a new feature space referred to as the intermediate feature space in the class indifferent combiner literature. Any type of classifier can be used that takes the intermediate feature space as input and outputs a class label. In the decision templates approach proposed by Kuncheva et al. [58], a representative decision profile for each class j, say the mean of all decision profiles of that class, is chosen as the decision template DT j. The decision profile of the test input is compared to the decision templates of each class using a similarity or distance measure and the closest match is declared as the class label of the test input. In other words, this is a nearest mean approach in the intermediate feature space. Any distance measure such as Eucledian, Minkowski and Mahalanobis distance can be used. The authors compare 11 different fuzzy measures of similarity and conclude that integral measures of similarity seem to perform better. 28

46 Dempster-Shafer combination [59], which takes inspiration from the Dempster-Shafer belief theory, uses three steps to arrive at the support for each class. 1. Let DTj i denote the ith row of decision template DT j. Let D i (x) be the ith classifier s output vector representing the probabilities of all classes. The proximity Φ j,i (x) between DTj i and D i (x) is computed as Φ j,i (x) = ( 1 + DTj i D i(x) 2) 1 c k=1 ( 1 + DT i k D i(x) 2) 1 (2.15) where. denotes any matrix norm. Thus for each decision template, there are N proximities. 2. For every class j = 1, 2,...,c and for every classifier i = 1, 2,...,N, the following belief degrees are calculated. b j (D i (x)) = Φ j,i (x) k j (1 Φ k,i(x)) [ 1 Φ j,i (x) 1 ] (2.16) k j (1 Φ k,i(x)) 3. The final degrees of support for each class are N µ j (x) = K b j (D i (x)) (2.17) i=1 where K is a normalizing constant Classifier Selection Ho et al [38] introduced the concept of dynamic classifier selection (DCS) as an alternative to classifier ensemble combination where the most appropriate classifier is chosen to make the decision. The principle behind classifier selection is that different classifiers may be more competent in different regions of the data/feature space. Depending on the region where the test input is, the most competent classifier for that region is chosen to make the decision. Finding the regions of competence, estimating the competence of classifiers in each region and choosing a selection strategy (e.g., choosing the best classifier or weighting the decisions by the competence) are the main problems in classifier selection. 29

47 Classifier selection is typically done by estimating the local accuracy (around the test point) of the classifiers in the test phase. This is attempted by finding the K nearest neighbors Kn x to the test input x in the training or validation set, and then computing the competence of the classifiers on these K objects [60]. Another way proposed by Giancinto and Roli [61] is to estimate the classifier competence by a weighted average of the classifier s prediction for the correct labels in the K nearest neighbors. If an object x j has the class label ω k = l(x j ), let P i (ω k = l(x j ) x j ) be the a posteriori probability estimate of the ith classifier for the correct class ω k. These probabilities are weighted by the distances of the test input to the K nearest neighbors. The competence of the ith classifier for x is given by x A i (x) = j Kn x P i (ω k = l(x j ) x j )(1/d(x,x j )), (2.18) x j Kn x (1/d(x,x j )) where d(x,x j ) is the distance between x and x j. An alternative approach is to find the competence after the class labels are found for the test input x by all classifiers. If the ith classifier provides a class label l i, then the K nearest neighbors that the ith classifier declared as l i are found. The classifier competence is the proportion of objects in that set that have the true class label of l i [60]. Woods et al. [60] proposed another modified approach. Among the K nearest neighbors of the test input x, find the set Kn k x of objects that have the true class label k. The a posteriori probability estimate of the ith classifier for class k, P i (ω k x j ) is found for these objects. These are weighted based on the distance of the object to the test input to find the competence. x A i (x) = j Kn k P i(ω x k x j )(1/d(x,x j )) sum xj Kn k (1/d(x,x j)) x (2.19) Shin and Sohn [62] proposed a combination of DCS and classifier ensemble method for fusion. Multiple decision trees are built, but two clusters of trees are chosen based on the local accuracy of the test sample, and these are combined through majority voting. Methods of classifier selection as well as classifier fusion are applied after the classifier ensemble is designed. However, the classifier ensemble design is much more important than the selection or fusion. This is because either classifier selection or classifier fusion cannot be effective in reducing the ensemble error when the classifier ensemble has poor diversity. Appendix 7.4 describes measures of diversity of classifier outputs. Optimal ensemble design methods are the key contribu- 30

48 tions of this dissertation. Details of common ensemble generation techniques are provided in the next section. 2.3 Classifier ensemble design As was mentioned earlier, for improved performance on classifier fusion, the classifier ensemble design is crucial. If all classifiers in the ensemble make errors on the same objects, any fusion strategy would yield poor results. Thus diverse classifiers should be present in the ensemble for effective fusion. Ensemble design has several types of groupings in literature, which are enumerated as follows. Decision Optimization Vs Coverage Optimization: The ensemble design approach could be geared to either decision optimization or coverage optimization. Decision optimization refers to finding and optimizing the fusion rule given a fixed ensemble of classifiers; while coverage optimization refers to methods of creating a diverse set of classifiers assuming a fixed fusion rule. The fusion-selection grouping can be thought of as a subdivision of the decision optimization - coverage optimization grouping. The latter approach is followed in this thesis. Overproduce and Select methods Vs Generative methods: Classifier ensembles can either be obtained from generative methods to obtain the desired diverse classifier ensemble directly or overproducing the classifiers and selecting a subset having the desired diveristy. The former approach is more difficult. Several papers follow the latter approach. Giacinto and Roli [63] aim to choose an independent classifier ensemble for fusion with majority voting using the overproduce and choose strategy. Multiple neural network classifiers are generated through a randomized approach (i.e., random initialization). Different base classifiers (e.g., different type of neural networks (NN) such as Radial Basis Function (RBF) NN, Multilayer Perceptron (MLP), etc.) are used in this overproduction process. If the classifiers in the ensemble are all obtained from the same classification algorithm, by using different parameters of the classification algorithm (including different training sets, different feature sets), then the classification algorithm is called the base classifier. They use the double fault diversity measure [34] (described in Appendix) for clustering classifiers. A pair of 31

49 classifiers from different clusters have a lower double fault measure (or more diversity) than a pair of classifiers from the same cluster; in other words classifiers in different clusters are more probable to make different errors on the same objects than classifiers in the same cluster. One classifier from each of the different clusters is selected to form the ensemble. The ensemble with the highest accuracy is chosen as the best ensemble. Margineantu and Dietterich [64] [65] use ensemble pruning to select a diverse ensemble from the set of classifiers produced by AdaBoost [9]. This is done by iteratively choosing a pair of classifiers having the highest value of a diversity measure (based on the kappa measure [34]) until the desired number of classifiers is reached. Banfield et al. [66] propose thinning the ensemble from the entire set of classifiers to obtain the diverse ensemble. They define an ensemble diversity measure which is the proportion of uncertain data points. For these uncertain points, the number of correct votes from the classifier ensemble is between 10% and 90%. The classifier that is most often incorrect on the uncertain points is removed from the ensemble. This step is repeated until desired ensemble size is reached. However, all these methods for overproduce and select have the problem that either the overproduction strategy or the selection strategy to find the best ensemble is not optimal. It will be shown later in this thesis that the overproduction strategy will not be successful unless there is a careful selection of different base classifiers. The selection strategy will not be optimal unless there is consideration of diversity among all classifiers in the ensemble (rather than pairwise classifier diversity used in [63], [64], [65]). In this thesis, we propose generative methods to design the classifier ensemble directly. Instead of using diversity measures as passive tools for monitoring the ensemble, new ways of using the diversity requirement in the classifier ensemble generation itself are put forth. These are provided in Chapter 5. Random Generation Vs Nonrandom Generation: There are many papers on randomized ensemble generation aimed at statistically improving fusion accuracy on combination of a large number of classifiers. There are some approaches using prior knowledge to improve classifier diversity that aim to improve upon the random generation methods. Some of these random and non-random approaches will be discussed in the following subsections. These different groupings of classifier ensemble design are present in each of the following 32

50 levels of creating the ensemble. Classifier level: Use different base classifiers Data level: Use different data subsets Feature level: Use different feature subsets Different base classifiers Different base classifiers may be chosen to form the ensemble with the idea that these classifiers will be independent. This is an ad-hoc approach to building the ensemble. There are several types of base classifiers. Some of the commonly used classifiers for face recognition are Principal Component Analysis (PCA) [67] [68], Linear Discriminant Analysis (LDA) [69], Independent Component Analysis [70], Support Vector Machines (SVM) [71], [51]. Minutiae based classifiers [2] are common for fingerprint recognition. Gabor wavelets based classifiers [3] are typically used for iris recognition. Correlation Filters have been successfully used for all image based biometrics [57], [72], [73], [74], [75], [76], [77]. Tree classifiers such as decision trees, classification and regression trees [78] are also used in the classifier fusion literature. Using different base classifiers is a naive approach to ensemble generation. This approach does not have any theory to back it up. It is not possible to predict the diversity between the classifier outputs of different base classifiers. It is a purely empirical approach and hence is not followed in this dissertation Different data subsets Bagging [8] and boosting [79], [9] are common methods of generating classifier ensembles by using different data subsets. These methods are also used in ensemble generation through different feature subsets. Hence, we first present some common approaches to classifier ensemble generation using different data subsets. 33

51 Bagging Bagging was first proposed by Breiman [8] as Bootstrap AGGregatING. In bagging, classifiers are generated in a parallel manner by training on boostrap replicates of the training set. The bootstrap samples are obtained by random sampling with replacement from the original training set. The classifier outputs are combined through majority (in verification applications) or plurality (in identification applications) voting. Bagging works when the base classifier used is such that small changes in the training set result in large changes in the classifier output (unstable classifiers). If the classifier outputs were independent, then the majority vote is guaranteed to increase the ensemble accuracy over those of inidividual classifiers [26]. Bagging aims to create independent classifiers by using independent training sets. In practice, there is only one training set available. Bootstrap training sets are obtained by choosing a random subset of the training data. In reality, these bootstrap samples are not independent samples of the data distribution. Further, even if independent samples were used for training, the classifier decisions need not be independent. The independent training samples may not drastically change the classification boundaries. The statistical dependence between classifier decisions depend on the classification boundaries. If the classification boundaries are similar, the classifier outputs will be positively dependent. Empirical evaluations support the fact that the classifier outputs in bagging are typically positively dependent [10]. Even though the classifier outputs are not independent, bagging does reduce the ensemble error. Domingos [80] provides some hypothesis of why bagging works. According to Domingos, bagging shifts the prior distribution of the classifier models towards models that have higher complexity (as the ensemble itself). Such models are assigned a larger likelihood of being the right model for the problem. The ensemble is in fact a single complex classifier picked from the new distributions. Schapire et al. [81] supports this argument saying that voting infact increases the complexity of the system. The ensemble generation methods proposed in this dissertation are compared to bagging in Chapter 5. It is our observance that random generation of classifier ensembles as used in bagging yields poorer performance. Further, Majority voting (as used in Bagging) need not be optimal. Reasons and further discussion are provided along with the results in Chapter 5. 34

52 Boosting In boosting [79], the classifier ensemble is generated in a sequential and incremental manner, by adding one classifier at a time to the ensemble. The current classifier is generated by resampling or reweighting the training data. Initially the sampling or weighting distribution is uniform and progresses towards the increasing the likelihood of difficult data points. The distribution is modified based on the weighted error of the ensemble. The classifier outputs are typically combined through weighted majority voting. With the reweighting implementation, it is typically assumed that the base classifiers can directly use the modified distribution of the training data as weights. The current classifier in boosting is aimed at correctly classifying the misclassified samples of the previous ensemble. While the initial classifiers typically have good generalization characteristics and have good individual classifier accuracy, the later classifiers will focus on outliers in the training data and may have lower individual classifier accuracy. Due to the weighted majority voting, the ensemble will have improved accuracy over the individual classifiers. Freund and Schapire [9] introduced a variant of boosting called Adaboost, which comes from ADAptive BOOSTing. Adaboost for two classes is explained below [9]. It is assumed that there are N training samples x i, i = 1, 2,.., M with their associated labels y i with y i {0, 1}. Assume N classifiers are designed sequentially. The distribution of the training samples is associated with a set of weights wi l, where l refers to the classifier index and i refers to the training sample index. Initially, the distribution is uniform with the weights w 1 i = 1/M. In the resampling version of Adaboost, the distribution is used to sample the training data. In the reweighting version, the base classifier weights all the training samples based on the distribution weights. The lth classifier label of the kth training sample is given by h l (x k ). The algorithm Adaboost is given in the Table 2.1. The algorithm is explained for the reweighting version. The explanation can be suitably modified for the resampling version. The initial classifier uses all the training images equally in the reweighting version. An average error on the training set is found for this initial classifier. The weights of the training samples are modified in the next step. The misclassified training samples are given more weightage. The next classifier in the sequence is computed using the new weights. In this classifier, more weightage is given to the misclassified samples (by the previous classifier). In this way, each new classifier in the sequence is generated 35

53 by providing more weightage to the misclassified training samples by the last classifier. During the test phase, a decision is made by the Adaboost algorithm using all the classifiers present in the ensemble. This decision is a weighted sum of each classifier s decision. The decisions are weighted based on the accuracy of each classifier; more accurate classifiers are given a larger weight. Initialize the weights wi 1 = 1/M, i = 1, 2,.., M. For l = 1, 2,.., N 1. Compute the classifier h l. 2. Calculate the weighted error of h l : ǫ l = M wi l h l(x i ) y i 3. Set β l = ǫ l 1 ǫ l 4. Set the new weights to be w l+1 i i=1 = wl i β1 hl(xi) yi l MP wi lβ1 h l (x i ) y i l i=1 Output the final decision 1, if N N log(1/β h f (x) = l )h l (x) 1 2 log(1/β l ) l=1 l=1 0, otherwise Table 2.1: The Adaboost algorithm for two classes. For multiple classes, two extensions of Adaboost algorithm, Adaboost.M1 and Adaboost.M2 are provided in [9]. With Adaboost, the ensemble error on the training data quickly converges to zero even with very few classifiers. Interestingly, the test error reduces with additional classifiers even after the training error reaches zero. This phenomenon is possibly explained through the margin theory [9], [81] in relation to the Vapnik-Chervonenkis (VC) dimension [71]. There are several papers on extensions and modifications of Adaboost. DOOM (Direct Optimization of Margins) [82] is another extension. Schapire and Singer [83], Allwein et al. [84] provide interesting variants of Adaboost with Error Correcting Codes (ECOC) ensembles, multilabel classifiers and multi-class problems. Several ad-hoc variants include arc-x4 by Breiman [85], MultiBoosting by Webb [86], Adaboost-VC by Long and Vega [87]. In this thesis, we sample the training set to create statistically dependent classifiers. The base classifier behavior, i.e., how the classifier outputs change with different training sets, is taken into account while sampling the training set, in order to create the desired diversity in the classifier ensemble. This is completely different from bagging, where the base classifier behavior is never taken into consideration. Boosting empirically takes into account the base classifier behavior by 36

54 creating the current classifier on the misclassified samples of the previous ensemble. However, it is found there is not much diversity present even in the ensemble created by Boosting [10]. Comparison of Bagging and Boosting with our proposed ensemble design is done in Chapter 5. The results demonstrate the improvement in ensemble diversity as well as ensemble accuracy with our proposed approach Different feature subsets Classifiers using different feature subsets could be used to form the ensemble. These subsets could be randomly chosen, obtained through some natural groupings, or some other heuristic methods aimed to create diverse ensembles. Natural Grouping: Some problems have natural groupings of features, which are utilized in designing classifier ensembles. For example, in text-independent speech identification, different groups of features are related to the pitch of the signal and the speech spectrum. The speech spectrum can be further characterized by the linear predictive coefficients, the cepstrum, and so on [88]. Each grouping of features can be used to design a classifier and their outputs can be fused rather than designing a single classifiers on all features Random Selection The random subspace method (RSM) by Ho [89] uses randomly chosen feature subsets of predefined size to build each classifier of the ensemble. The classifiers outputs are then weighted based on their error and combined. Ho [89] suggests that good performance for tree classifiers is found when the predefined size is about half the total feature size. This method works well when there is redundant information dispersed over all the features rather than being concentrated on subsets of features. Latinne et al [90] combine bagging with RSM. B random bootstrap replicates are sampled from the training data. For each bootstrap replicate, R random subsets of features are sampled. The classifier ensemble is of size N = B R classifiers. The idea is to create more diverse classifiers than either bagging or RSM. Latinne et al [90] claim that the combined method outperforms each of the individual methods. 37

55 Skurichina et al [91] compares the performance of the RSM to bagging, i.e., classifiers built on randomly chosen data subsets. The RSM is resistant to the number of redundant features when the classification ability in the feature set is spread over all the features. The bagging method is unaffected either by the number of redundant features, or, the redundancy representation. This is because all features are used in designing the classifier. Bagging may outperform the RSM for highly redundant feature subspaces where the discrimination power is condensed in a few features. The RSM outperforms bagging when the discrimination power is spread over all features. Bagging is useful for classifiers with a nondecreasing learning curve constructed on critical training sample sizes. The RSM is useful for classifiers having a decreasing learning curve constructed on small and critical training sample sizes. Pekalsa et al. [92] use distances (dissimilarity representations) as features. This is useful when the number of features for each object is prohibitively large or when they have little discriminative power. The distances of the given object to the n objects of the training set is used as a feature set (of size n). The RSM is applied to select a random subset of features and a linear discriminant classifier is applied to each random subset. Pekalsa et al. recommend choosing a subset between 4 to 30 percent of the total features. Pekalsa et al. [93] extend this idea by using p multiple distance measures. Each object is then characterized by pn features. They find that using distance measures or dissimilarity representations that are of different nature results in better performance on their combination. Genetic algorithms (GA) offer a guided random search (to find the ensemble of feature subsets) in the space of all possible feature subsets. A GA operates on a set of M chromosomes. The population is evolved by producing offspring and keeping a set of M fittest individuals, using some fitness criterion. There are two approaches using GA to find the feature subsets in the ensemble. In the first approach [94], each chromosome is a representation of a feature subset. For example, a binary valued chromosome of length n (for a total of n features) has its ith value 1/0 if the ith feature is present/not present in the feature subset. If the fitness function is purely the classifier accuracy (based on the given feature subset), the GA will converge to a single solution and all feature subsets of the population will be identical. Hence, a fitness function that is a combination of the classifier accuracy and its diversity (either based on the feature subset, which is not much useful, or based on the classifier output diversity) with respect to the classifier ensemble is used to obtain a good 38

56 ensemble. The problem with this approach is that the ensemble accuracy is not figured into the evolution process. The second approach [95] overcomes this problem by having each chromosome as a representation of the entire feature subset ensemble and using the ensemble accuracy as the fitness function. When feature subsets are disjoint, the chromosome is an integer valued vector representation of size n, with the ith value denoting the classifier that uses the ith feature; zero will mean none of the classifiers use the ith feature. When the feature subsets intersect, the chromosome is binary valued with size N n, for an ensemble size of N. The kth row of n bits will represent the feature subset for the kth classifier Non-Random Selection Oza and Tumer [96] propose input decimation as a simple way of selecting feature subsets that outperforms the random subset selection method. They design N = c classifiers, where c is number of classes. Each classifier has a favorite class. The correlation between each feature and the class label variable is used to determine the feature subsets. The class label variable has value 0 for all objects that are not in the ith class and 1 for all objects that are in the ith class. The n correlations of the features are sorted according to their absolute value, and the features corresponding to the n i largest values are chosen as the subset for the ith classifier. Each classifier has a favorite class, but is trained to recognize all classes. Various feature selection methods can be used instead of sorted correlations such as floating selection [97] and sequential group [98]. However, this approach yielded poor results in [7] compared to data subset classifier ensemble generation. An improvement to the above approach by Puuronen et al [99] uses an iterative approach to increase the ensemble accuracy. Here, the initial ensemble is obtained from the favorite class model. The classifier whose output differs the least from the outputs of other classifiers is identified as the median classifier. This classifier is found using pairwise diversity measures. n feature subset replacements are found by changing the feature subset of the median classifier. This is done by altering the present/absent status of each feature, one at a time. The ensemble accuracy of each replacement is computed and the ensemble with the highest accuracy is kept. This process is repeated iteratively till no further improvement in the ensemble accuracy is found. This greedy algorithm has been shown experimentally to converge quickly and improve on the initial ensemble. Several variants of this approach can be designed. In [100], the initial ensemble is obtained from the random 39

57 subspace selection method. Instead of varying the feature subset of the median classifier, all classifier feature subsets can be varied. In this case, computing the ensemble accuracy for all variations becomes prohibitively expensive. Hence, the authors use a weighted combination of the classifier accuracy and its diversity with respect to the ensemble as a measure to choose the best ensemble. Gunter and Burke [101] propose an incremental method, i.e., add a feature subset based classifier to the ensemble one at a time. The feature subset of the previous classifiers is banned, although an intersection of the feature subsets is allowed. The authors recommend the floating search method for being both robust and computationally reasonable, although they suggest any feature selection method can be used. The authors use different feature selection algorithms relying on their suboptimality to produce different feature subsets. The ensemble accuracy is used as an evaluation criterion of the incremental ensemble. This dissertation does not use the feature subset ensemble generation approach since it was found to be less successful than data subset ensemble design in [7]. 2.4 Discussion For classification problems, we need to have a final classifier to provide a decision. The classifier design and generation may make use of one or more fusion process input-output characteristics, for example, image fusion from multiple sensors (data fusion), fusion of cepstral coefficients and linear predictive coefficients of the speech spectrum (feature fusion), and a final fusion of features to provide a decision. We focus on classifier fusion as the best means to achieve our goal of improving verification performance on a single biometric modality. The combiners for soft decision fusion can be many and hence difficult to investigate. In the literature, some typical methods such as sum, product, weighted sum and order statistics are investigated. In the hard decision fusion, there are a fixed number of fusion rules and hence we investigate these rules. Furthermore, with optimal choice of thresholds, the fusion of hard decision rules provides better accuracy than fusion of similar soft decision rules. This point can be illustrated with the help of some results obtained in Chapter 5 on classifier ensemble fusion. A 3-classifier ensemble is generated for the OR fusion rule on the NIST 24 plastic distortion fingerprint database. A portion of the 3-classifier ensemble scores are plotted in Figures 2.4 and 2.5. The authentic score plots contain 40

58 sorted authentic scores of the first classifier. The other scores (other authentic classifier scores) are the corresponding scores of the sorted 1st-classifier authentic scores. In other words, the scores of the three classifiers on the same image are plotted at the same index on the x-axis. The OR decision fusion rule is similar to a thresholded version of the MAX score fusion rule. For a threshold of on the MAX score fusion rule, the FRR is 1.4% and the FAR is zero. If all three classifiers are thresholded with a threshold of 18.17, and the resulting decisions are fused with the OR rule, the same FRR of 1.4% and FAR of zero is obtained. In this case, OR and MAX followed by a threshold are exactly the same. Now, if the thresholds on the classifier scores are changed to 18.17, 17.66, for the 1st, 2nd, and 3rd classifiers, respectively, the OR fusion results in an FRR of 0.7% and an FAR of zero. Here, the OR fusion has a lower error than the MAX score fusion. The threshold of two classifiers (2nd and 3rd classifier) are lowered from the threshold on the MAX score, and this results in a better accuracy. Since decision fusion rules have similar or better accuracy than similar score fusion rules, decision fusion rules are of focus in this thesis. Figure 2.4: (a) Authentic scores of a 3-classifier ensemble. We investigate hard decision fusion rules and the classifier characteristics needed to improve accuracy on fusion in the next chapter, Chapter 3. As in classifier selection, our approach in this thesis is also to make each classifier in the ensemble to be responsible for a part of the feature space 41

59 (a) (b) Figure 2.5: (a) A zoomed in figure of authentic scores of the 3-classifier ensemble. (b) Impostor scores of the 3-classifier ensemble. but the final decision is based on the fusion rule. The part of the feature space to which is each classifier is designed to be responsible for is based on the fusion rule. This approach has stricter constraints than the classifier selection approach. Chapters 3 and 4 offer insight on this approach. Tumer and Ghosh [7] say that the gains obtained by combining however, are often affected more by the selection of what is presented to the combiner, than by the actual combining method chosen. In [7], the authors compare four different methods of designing the classifier ensemble; bagging or bootstrapping with the chosen linear combiner, data partitioning by training each classifier on different (k 1)-of-k subsets of the training set, feature partitioning by input decimation and spatial partitioning using a weighted mixture of experts [25]. They found that neither the feature partitioning nor the spatial partitioning lead to significant improvements. Furthermore, both these methods are difficult to fine tune, as small changes in design (e.g., changing the number of input features) leads to large changes in the combiner performance. However, they found the data partitioning methods to be promising. In this thesis too, we investigate data partitioning methods for designing the classifier ensemble. We tune the data partitioning according to the chosen combiner. These methods are illustrated on simulated data in Chapter 4 in order to understand the approach and demonstrated on real data in Chapter 5. 42

60 CHAPTER 3 ROLE OF STATISTICAL DEPENDENCE BETWEEN CLASSIFIERS OVERVIEW: Often in Biometric applications, there is insufficient data to model the distributions of authentic and impostor scores of multiple classifiers. Even if they are estimated, the optimal Bayes rule error is difficult to obtain analytically because of the complex distributions of biometric scores. For Gaussian score distributions with unequal authentic and impostor covariances, the Bayes decision boundary is quadratic. It is difficult to find an expression for the resulting minimum probability of error. Due to this, finding the ROC for this case is hard. Decision fusion rules are computationally easier to apply on the scores. A threshold on each classifier score provides the classifier s decision. These decisions are fused using a decision fusion rule to obtain a final decision of the classifier ensemble. The decision boundary in the N dimensional classifier ensemble score space (when there are N classifiers) is composed of a set of lines for decision fusion rules. These decision boundaries are simpler than the optimal Bayes decision boundary, which would be a curve for quadratic decision boundaries. For N classifiers, there are 2 2N decision fusion rules, and evaluating the performance of all these rules is computationally infeasible. In Section 3.1, we investigate if the optimal decision rule is monotonic for statistically dependent classifier decisions. In Section 3.2, we evaluate the performance of monotonic fusion rules for three-classifier decision fusion of statistically dependent Gaussian score distributions. We examine if statistical dependence plays a role in the accuracy of a decision fusion rule. We also examine whether the best monotonic decision rule varies with 43

61 statistical dependence. We find the OR, AND, MAJORITY rules are the important decision fusion rules in Section 3.2. Hence we analyze each of these decision fusion rules separately in Sections 3.3 to 3.5 to know their best/worst and favorable/unfavorable values of statistical dependence. Classifier scores are thresholded to obtain decisions. An optimal set of thresholds on the multiple classifier scores needs to be obtained for optimal fusion with each of the decision fusion rules. Section 3.6 provides a method to minimize the search for optimal thresholds on the classifier scores for AND/OR decision fusion. Section 3.7 provides the summary, conclusions and original contributions provided in this chapter. In practice, the distributions of biometric scores may be complex, (e.g., Gaussian Mixtures). More importantly, there is often insufficient data to even estimate the distributions of the authentic and impostor scores. In such cases, estimating the Bayes rule error is hard. It is computationally easier to evaluate simple decision fusion rules. Decision fusion is focused on in this thesis. It is found that diverse classifiers improve accuracy [27]. The notion of diversity between classifiers is not clear cut in the literature. A complete description of the joint probability of classifier decisions is ideal for analysis: to find the best decision fusion rule, the accuracy on fusion, the comparison to independent classifier fusion, etc. Complete information about the statistical dependence between classifiers is difficult to obtain. The joint probability of the classifier decisions is difficult to obtain due to insufficient data for statistical analysis. Further, the computational complexity for computing the joint probability is high. Kuncheva et al [34] attempt to answer the following questions on diversity: 1. How do we define and measure diversity? 2. How are the diversity measures related to the accuracy of the ensemble? 3. Is there a best diversity measure that is useful to describe ensembles that have minimum error? 4. How can we use the diversity measures to design the classifier ensembles? To answer the first question, they studied ten different diversity measures and found their limits and values for independent classifier ensembles. These diversity measures are included in a list of diversity measures provided in Appendix 7.4. They found that the diversity values for independent 44

62 classifiers from these measures are not always constant. Further, the limits of the diversity values are also not constant and may depend on the accuracy of the classifiers. Ruta and Gabrys [102] raised a question of symmetry of the diversity measures. Some measures have different values of diversity when the class labels are switched and these measures are not symmetric. One example of a non-symmetric measure is the double fault measure (provided in Appendix 7.4), which is the probability P(u 1 = 0, u 2 = 0 H 1 ). Since P(u 1 = 0, u 2 = 0 H 1 ) P(u 1 = 1, u 2 = 1 H 0 ), this measure is not symmetric. For the second question, they simulate many different classifier ensembles, find the diversity measure values and the accuracy for the majority vote for each of these ensembles. For ensembles that have better accuracy than the independent classifier ensemble, they study the diversity values and attempt to find a threshold on these values that can predict better accuracy than independent classifiers. This approach yielded moderate answers to the question. There was not a clear separation between diversity values for the more accurate ensembles and the less accurate ensembles (than independent ensembles), hence the ambiguity. Another approach they use is to design classifier ensembles for real databases and check if there is a relation between the diversity measures and the accuracy. They did not find any definite relation between the accuracy and the diversity measures. However, they say this could be due to the classifier ensemble design strategy. Ensembles designed for obtaining diversity such as Adaboost classifiers could show a relation between the diversity and the ensemble accuracy. They were also unsuccessful in giving any answers to the questions 3 and 4. In this dissertation, we use a different approach to answer the last three questions. The key to answer these questions is to start from the joint probability distributions since they completely characterizes the problem. Since Kuncheva et al. [34] did not figure the role of the probability distributions into their analysis, it is not surprising that they found inconclusive results. We modify these questions to: 1. How do diversity measures describe the joint probability of the classifier decisions? 2. Is it possible to completely decribe the joint probability through one or a group of diversity measures? 3. What are the joint probability distributions that improve the accuracy (over those of independent classifiers)? 45

63 4. What are the values of the diversity measures for the above probability distributions? 5. How can we design classifier ensembles to have favorable joint probability distributions? In this chapter, we start with the assumption that the complete joint probability between classifiers known. The relation between the joint probability and the ensemble fusion accuracy is found. When the joint probability of scores is known, the optimal decision fusion accuracy as a function of the statistical dependence is found in Section 3.2. With known joint probability of the decisions, the accuracy of important decision fusion rules as a function of statistical dependence is found in Sections 3.3 to 3.5. The values of diversity measures at specific values of joint probability that improve or reduce the accuracy over statistical independence are then found in the sections mentioned. This establishes a clearer relation between values of the diversity measures and the ensemble fusion accuracy. The number and type of diversity measures that are required to describe the joint probability between classifiers (for ensembles more accurate than independent classifier ensembles) is also known in these sections. The answers to modified questions 1 to 4 are explored in this chapter. Later in this thesis, we provide a more definite answer to design classifier ensembles that have favorable joint probability distributions. Biometric verification is the application of focus in this thesis. This is a two class problem. The conditional dependence on authentics and impostors should be considered for accurate understanding of the problem. By not conditioning on authentics and impostors, misleading results will be obtained. For example, it was concluded by Shipp and Kuncheva [103] that there was not much dependence between combination methods and diversity measures. Ten combination methods including Majority, Min, Max and average were studied. Ten diversity measures including the Q statistic and the correlation coefficient were evaluated. However, the analysis done here is misleading because the class-conditional errors and class-conditional diversity values were not taken into account. However, there is a specific relation between the class conditional accuracies of these combination methods and the class-conditional diversity values. This relation will be shown for the AND, OR and Majority decision fusion rules in this chapter. In this dissertation, we will loosely refer to conditional independence (dependence) or conditionally independent (dependent) classifiers for classifiers whose decisions are conditionally independent (dependent), but the meaning should be clear. When classifier decisions are conditionally 46

64 independent, the joint classifier statistics and error probabilities for different binary decision fusion rules are known from the individual classifier error probabilities. When the classifier decisions are conditionally dependent, the error probabilities after decision fusion may be larger or smaller than when the classifier decisions are conditionally independent. The conditional dependence for which the error probabilities after fusion with a given decision fusion rule are smaller (larger) than those of conditionally independent classifier decisions, is known as favorable ( unfavorable ) conditional dependence for the given decision fusion rule. When the error probability is the smallest for a given decision fusion rule at a particular value of conditional dependence, that value is known as the optimal conditional dependence for the given decision fusion rule. Let us start by describing the notation followed in this chapter. We assume that there are two classes, authentics and impostors, which need to be discriminated in verification applications. Let H 0 and H 1 be the two hypotheses denoting impostors and authentics, respectively, with prior probabilities P 0 and P 1. We assume that there are N classifiers that map the input biometric data into scores y i, (i = 1, 2,..N). Let the conditional densities on the scores under the two hypotheses be denoted by p(y i H j ), j = 0, 1. These distributions are typically unknown in practice and must be estimated. By using thresholds τ i on these scores y i, decisions u i, (i = 1, 2,..N) are obtained. The decisions declare which of the two classes, i.e., authentics or impostors, the input data belongs to. The false acceptance rate (FAR) for the ith classifier is given by P FAi = Prob(u i = 1 H 0 ) = p(y i H 0 )dy i (3.1) Z 1,i where Z j,i is the decision region corresponding to hypothesis H j for the ith classifier, i.e., hypothesis H j is declared true by the ith classifier for any score y i falling in the region Z j,i. Let Z be the entire score space so that Z = Z 0,i Z 1,i, i and Z 0,i Z 1,i = φ, i, the null set. The false rejection rate (FRR) for the ith classifier is given by P FRi = Prob(u i = 0 H 1 ) = p(y i H 1 )dy i, i = 1,...,N (3.2) Z 0,i A related term commonly referred to is the probability of detection, P D. The probability of detection for the ith classifier is P Di = 1 P FRi = Prob(u i = 1 H 1 ) = p(y i H 1 )dy i, i = 1,...,N (3.3) Z 1,i 47

65 Decision fusion using the fusion rule ζ is applied on the decisions u i, (i = 1, 2,..N) to provide a global decision u 0, i.e., u 0 = ζ(u 1, u 2,...,u N ). (3.4) The probability of false acceptance for the fusion rule is P FA = Prob(u 0 = 1 H 0 ) (3.5) and the probability of false rejection for the fusion rule is P FR = Prob(u 0 = 0 H 1 ) (3.6) The Bayes risk function minimization performed over the set of fusion rules ζ and the best thresholds τ = [τ 1,...,τ N ] on the scores can be written as min R = min ζ,τ (P 0 C FA P FA + P 1 C FR P FR ). (3.7) Here, C FA is the cost associated with false acceptance and C FR is the cost associated with false rejection. It is assumed that there is no cost for making correct decisions. The best fusion rule minimizes the average cost, and is chosen based on the likelihood ratio test: u 0 = 1 P ( u 1, u2,...,u N H 1 ) P ( u 1, u2,...,u N H 0 ) > < u 0 = 0 P 0 C FA P 1 C FR = η (3.8) 3.1 Optimal Decision Rules OVERVIEW: After designing the classifier ensemble, it is of interest to know the optimum decision fusion rule that maximizes the accuracy of the ensemble. This would mean searching over the set of possible decision fusion rules for the optimum decision fusion rule. Search over the total 2 2N decision fusion rules is computationally infeasible. For statistically independent classifiers, it is shown that monotonic rules are optimal [104], [19], thus reducing the search space. In this section, it is studied if monotonic rules are optimal for statistically dependent classifiers. The results 48

66 of our analysis show that monotonic rules are not optimal in general for even two statistically dependent decisions. However, there are specific regions of (P FR, P FA, ρ a, ρ i ), where monotonic rules are optimal. ρ a are ρ i are the correlation coefficients of decisions on authentics and impostors, respectively. The results of the analysis and the regions where monotonic rules are optimal for two statistically dependent classifiers are shown in this section. The proof is given in Appendix 7.3. As mentioned before, for N classifiers on a two class problem, there are 2 2N decision fusion rules [19] amd searching for the best decision rule for a given set of classifiers is exponential in N. It has been shown that monotonic rules are optimal for statistically independent classifiers [104], [19]. When (1 P FRi ) P FAi, i = 1,...,N, which is a reasonable assumption for good classifiers, monotonically increasing rules are optimal. Monotonically decreasing rules are optimal when P FRi < P FAi, i = 1,...,N. Let [S 1 (k) S 0 (N k)] be a set of decisions with k 1 s and (N k) 0 s. When the optimum decision rule is monotonically increasing, the likelihood ratio of [S 1 (k) S 0 (N k)] is less than or equal to the likelihood ratio of [S 1 (k ) S 0 (N k )], where k > k and S 1 (k) S 1 (k ). The inequalities of the likelihood ratios are reversed for monotonically decreasing rules. Their proofs of these statements are given in Appendix 7.2. This reduces the search set for optimum decision fusion rules from 2 2N decision fusion rules. For statistically dependent classifiers, it is not yet known if monotonic rules are optimal. In this section, it is studied whether monotonic rules are optimal when 2 out of N classifier decisions are statistically dependent while the rest are statistically independent. It is assumed that P Di P Fi, i = 1,...,N for all classifiers. Conditions are found as to when the optimal rule is monotonically increasing. Since P Di P Fi for the N 2 statistically independent decisions, the optimal rule for the N classifier decisions is monotonically increasing if the optimal rule for the 2 statistically dependent decisions are monotonically increasing. When the optimal rule of 2 classifiers is monotonically increasing, the likelihood ratios (LLR) of the two classifier decisions should have the following inequalities. P(u = [0, 0] H 1 ) P(u = [0, 0] H 0 ) P(u = [1, 0] H 1) P(u = [1, 0] H 0 ) P(u = [1, 1] H 1) P(u = [1, 1] H 0 ) (3.9) and P(u = [0, 0] H 1 ) P(u = [0, 0] H 0 ) P(u = [0, 1] H 1) P(u = [0, 1] H 0 ) P(u = [1, 1] H 1) P(u = [1, 1] H 0 ). (3.10) 49

67 The Bahadur-Lazarsfield expansion [105] expresses the LLRs in terms of the classifier decision probabilities and correlation coefficients between the decisions. A set of inequalities based on the constraints in Eq.(3.9) and Eq.(3.10) are found to satisfy the monotonically increasing optimum decision fusion condition. These inequalities are expressed in terms of (P FR, P FA, ρ a, ρ i ). One example of a set of ρ a 0, ρ i 0) values where the inequalities do not hold is shown. This is sufficient to state that monotonic rules are not optimal in general for statistically dependent decisions. Regions of (P FR, P FA, ρ a, ρ i ) where the inequalities hold, i.e., where monotonic rules are optimal for 2 statistically dependent decisions are also shown. A brief description and the results of the analysis are given below. Details of the proof are given in Appendix 7.3. Kam et al. [106] described the optimum fusion rule for correlated decisions (Eq.3.8) in terms of the Bahadur-Lazarfield expansion [105] of the joint probability of the decisions. The Bahadur- Lazarsfield expansion expresses the conditional joint probability in terms of normalized decisions z i,h, and correlation coefficients of the normalized decisions. The normalized decisions z i,h have zero mean and unit standard deviation. They are given by z i,h = u i P(u i = 1 H h ), h = 0, 1 (3.11) P(ui = 1 H h )(1 P(u i = 1 H h )) The correlation coefficients of the normalized decisions are as follows. second order correlation coefficients ρ ij,h = E (z i,h z j,h ) = u z i,hz j,h P(u H h ), h = 0, 1 third order correlation coefficients ρ ijk,h = E (z i,h z j,h z k,h ) = u z i,hz j,h z k,h P(u H h ), h = 0, 1 Nth order correlation coefficients ρ 12...N,h = E (z 1,h z 2,h...z N,h ) = u z 1,hz 2,h...z N,h P(u H h ), h = 0, 1 (3.12) The Bahadur-Lazarfield expansion of the likelihood ratio [106] of the optimum decision rule (Eq.3.8) for N classifiers is N i=1 P(u i H 1 ) (1 + ) i<j ρ ij,1z i,1 z j, ρ 12...N,1 z 1,1 z 2,1...z N,1 N i=1 P(u i H 0 ) (1 + ) i<j ρ ij,0z i,0 z j, ρ 12...N,0 z 1,0 z 2,0...z N,0 u 0 = 1 > < u 0 = 0 η (3.13) 50

68 We now examine the correlations for which the optimal fusion rule is monotonic. Lets first consider the case of N decisions where two of the decisions are correlated and the remaining decisions are independent of the others. This may occur when two classifiers are based on the same algorithm (say correlation filters) with different training sets, and the remaining classifiers are based on other algorithms (say SVM, Neural Networks, PCA, etc.), which may result in independent decisions from other classifiers. Without loss of generality, [ we assume that ] u 1 and u 2 are correlated and u i, i = 3,...,N are independent. Let u 3,...,N = u 3... u N denote the independent decisions from i = 3,...,N. To simplify the problem further, we assume only the second order correlation coefficient between the first two decisions is nonzero while the higher order correlation coefficients are zero. and It is assumed 1 P FRi P FAi, i = 1,...,N. Let δ FAi = δ FRi = P FAi (1 P FAi ) 0 (3.14) P FRi (1 P FRi ) 0 (3.15) For further sumplification, it is assumed that the error probabilties of the first two decisions are equal, P FR1 = P FR2 = P FR12 and P FA1 = P FA2 = P FA12. Let δ FR1 = δ FR2 = δ FR12 and δ FA1 = δ FA2 = δ FA12. From the derivations given in Appendix 7.3, the following conditions are to be satistified for the optimal decision fule of u = [u 1, u 2,u 3,...,N ] to be monotonically increasing. ρ 12,1 ρ 12,0δ FA12 (1 + δ FR12 ) + (1 δ FR12 δ FA12 ) 1 + δ FA12 (3.16) ρ 12,1 ρ 12,0(1 + δ FR12 ) (1 δ FR12 δ FA12 ) δ FR12 (1 + δ FA12 ) (3.17) ρ 12,0 ρ 12,1δ FR12 (1 + δ FA12 ) + (1 δ FA12 δ FR12 ) (1 + δ FR12 ) ρ 12,0 ρ 12,1(1 + δ FA12 ) (1 δ FA12 δ FR12 ) δ FA12 (1 + δ FR12 ) (3.18) (3.19) Eqs.(3.16) to Eq.(3.19) place bounds on the values of ρ 12,1 and ρ 12,0 for which the optimal fusion rule is monotonically increasing. The bound on a correlation coefficient ρ 12,j, j = 0, 1 depends on the values of δ FR12 and δ FA12 as well as on the value of the other correlation coefficient ρ 12,l, l 51

69 j, l = 0, 1. Substituting the values ρ 12,1 = 0 and ρ 12,0 = 0 in the Eqs.(3.16) and 3.17, the following inequalities are obtained. (1 δ FR12 δ FA12 ) δ FR12 (1 + δ FA12 ) 0 (1 δ FR12δ FA12 ) (1 + δ FA12 ) (3.20) For (1 P FR12 ) P FA12, which is a reasonable assumption for good classifiers, it can be shown that δ FR12 δ FA12 1 (3.21) By Eq.(3.21), and as δ FR12 and δ FA12 are positive, it can be observed that Eq.(3.20) holds for any value of P FR12 and P FA12 satisfying (1 P FR12 ) P FA12. In other words, the optimal fusion rule is monotonically increasing for statistically independent classifiers when (1 P FR12 ) P FA12. This validates our approach as this is the same claim made in [104], [19] for statistically independent classifiers. It is interesting to note that when either of the correlation coefficients ρ 12,1 or ρ 12,0 is equal to 1, the other correlation coefficient should also be equal to 1 for these inequalities to hold. ρ 12,0 (1+δ FR12 ) (1 δ FR12 δ FA12 ) δ FR12 (1+δ FA12 ) ρ 12,1 ρ 12,0δ FA12 (1+δ FR12 )+(1 δ FR12 δ FA12 ) 1+δ FA12, mboxwithρ 12,0 = 1 (3.22) ρ 12,1 = 1 This shows by counter-example that the optimal fusion rule is not monotonically increasing for ρ 12,0 = 1, ρ 12,1 1. Hence, monotonic rules are not optimal in general for statistically dependent classifier decisions. The regions where monotonically increasing decision rules are optimal for statistically dependent two-classifier decisions can be obtained from Eq.(3.16) to Eq.(3.19). Since the interest is in diverse decisions, the values of the correlation coefficients at maximum diversity are found. It is of interest to know whether there are any regions where monotonically increasing rules are optimal at maximum diversity in the decisions. There is maximum diversity when the two classifier decisions are exactly complementary, and this represents the lower limit on the correlation coef. When the two classifier decisions u 1 and u 2 are exactly complementary, i.e., if one classifier s decision is 1, the other classifier s decision is 0, it can be shown that the correlation coefficients between the 52

70 authentic and impostor decisions are as follows. P FR12 (1 P minρ 12,1 = FR12 ) = δ FR12, P FR (1 P FR12) P FR12 = 1 δ FR12, P FR12 > 0.5 min ρ 12,0 = (3.23) 1 min ρ 12,1 0 (3.24) P FA12 (1 P FA12 ) = δ FA12, P FA (1 P FA12) P FA12 = 1 δ FA12, P FA12 > 0.5 (3.25) (3.26) 1 min ρ 12,0 0 (3.27) (3.28) The validity of the first of the above four equations is shown later in Section 3.3 for two-classifer complementary authentic decisions. The derivation of the second of the above two equations is exactly same when the conditional dependence on authentics is replaced with the conditional dependence on impostors. It should be noted that the maximum diversity depends on the value of P FR12 and P FA12 and the theoretical lower limit of the correlation coefficient lies between -1 and 0. The region where monotonic rules are optimal is of interest. By staying within this region in classifier combination, the search for the optimum decision rule will be limited to monotonic rules. For a given value of the authentic correlation coefficient ρ 12,0, the region where monotonically increasing rules are optimal is shown below. When the authentic correlation coefficient, ρ 12,1 is equal to 0.8, Figure 3.1 shows the regions of (P FA12, P FR12 ) where the optimum decision fusion rule is monotonically increasing. This represents the entire possible region where 1 P FR12 P FA12. In other words, the region of (P FR12, P FA12 ) for optimal monotonically increasing rules is not limited for the chosen value of conditionally dependent authentic decisions. These regions are found by checking where the inequality constraints on ρ 12,0 given in Eq.(7.33) and Eq.(7.33) hold when the value of the authentic correlation coefficient is 0.8. The limits of the impostor correlation coefficient, ρ 12,0 at this region of (P FR12, P FA12 ) are shown in Figure 3.2. The upper limit of the impostor correlation correlation coefficient varies between 1 and 0.8. The lower limit varies between -1 and 0.8. This implies that there is a large region 53

71 1 Monotonically increasing regions for ρ a = P FR P FA Figure 3.1: Region where monotonically increasing decision fusion rules are optimal the authentic correlation coefficient ρ a = 0.8. of (P FR12, P FA12, ρ 12,0 ) where monotonically increasing rules are optimal even for statistically dependent decisions. Upper limit of ρ 12,0 for ρ 12,1 = 0.8 Lower limit of ρ 12,0 for ρ 12,1 = P FA 1 0 P FR P FA 1 0 P FR (a) (b) Figure 3.2: Limits on the impostor correlation coefficient in region (1 P FR12 ) P FA12 at the authentic correlation coefficient ρ 12,1 = 0.8 for monotonically increasing rules to be optimal. (a) Upper limits (b) Lower limits It will be beneficial to design the statistically dependent classifier ensemble to lie in the region 54

72 where monotonic rules are optimal. Then the search is limited to monotonic rules. The number of monotonic rules, while less than 2 2N, is also exponential. Hence, searching even among monotonic rules is computationally infeasible for a large number (N > 3) of classifiers. The next section investigates if this search for the optimal monotonic rule can be reduced. 3.2 Role of Statistical Dependence on the Minimum Probability of Error In this section, we investigate the role of statistical dependence on the best decision fusion rule and check for the best accuracy dependence on the statistical dependence. We also check if some diversity measures can help in predicting the best fusion rule. The statistical dependence between classifier decisions implies a statistical dependence between classifier scores. The role of statistical dependence is investigated by evaluating the accuracy of decision fusion rules on three jointly Gaussian scores with various covariances. For jointly Gaussian scores with known means and variances of individual classifiers, the correlation coefficient between pairs of classifier scores completely characterizes the statistical dependence. We study the problem by synthesizing jointly Gaussian scores and use the correlation coefficient as the classifier diversity measure [34]. It has been shown that statistical dependence between classifiers can improve accuracy of different decision fusion rules [27], [28], [107], [108]. Some classifier design methods which are used to obtain an improved accuracy for these rules are given in [109], [107], [108]. However, a unified theory to explain which fusion rule is the best for a given statistical dependence is not yet available. Further, it is not clear if statistical dependence affects the overall best performance. In other words, it is not clear whether the accuracy of the corresponding best fusion rule is different for different values of statistical dependence. This section attempts to answer these questions for verification applications using decision fusion rules. The results of the analysis on the role of statistical dependence are applied to predict the best fusion rule [110] and to evaluate how well the classifiers are designed for biometric verification in Section The NIST 24 [111] fingerprint database and the AR face database [112] are used for these evaluations. The effect of statistical dependence between classifiers on the fusion performance is analyzed by finding the minimum probability of error for the best fusion rule for different statistical dependences 55

73 between the classifiers [110]. For N classifiers, there are 2 2N decision fusion rules [19]. The optimal decision fusion rule for independent classifiers is monotonic [19]. For statistically dependent classifiers, the optimum decision fusion is non-monotonic in general. Even though we know this, we focus only on monotonic rules to limit the computational complexity. For two, three and four classifiers, there are 6, 20 and 168 monotonic rules, respectively [19]. For a large number of classifiers, the number of monotonic rules becomes too large, and searching for the best rule becomes computationally infeasible. In this thesis, we analyze the performance of all the monotonic rules for three classifiers for different statistical dependence between classifiers to study if there are any important rules to focus on. A simulation to analyze the role of statistical dependence is described, followed by an analysis of the results. Three synthetic classifier scores are generated from the following joint Gaussian distribution, with equal variances and same pairwise correlation coefficient. 1 1 ρ a ρ a Authentic Scores N 1, ρ a 1 ρ a, 0.5 ρ a 1 (3.29) 1 ρ a ρ a ρ i ρ i Impostor Scores N 0, ρ i 1 ρ i, 0.5 ρ i 1 (3.30) 0 ρ i ρ i 1 The correlation coefficient for authentic scores, ρ a, can be different from that of the impostor scores, ρ i. The limits on the correlation coefficient ensure that the covariance matrix is positive semidefinite. ρ a and ρ i are varied from -0.5 to 1 in steps of 0.1, and for each combination of (ρ a, ρ i ),10,000 authentic and 10,000 impostor scores are generated from their respective joint Gaussian distributions. There are 20 monotonic rules for 3 classifier decision fusion. However, not all these rules need to be considered. One rule declares everything as authentic; one rule declares everything as impostor. These two rules need not be considered since either the FAR or FRR is 100%. Three rules pay attention to only one of the three single classifiers. Since the three classifiers are identical and we know the single classifier performance, we need not consider these rules. Three rules are two 56

74 classifier and rules; three rules are two classifier or rules. By setting the threshold on the third classifier to or for the three classifier and and or rules respectively, the two classifier and and or rules, respectively, are obtained. If the two classifier rules are optimal, a brute force search for the optimal thresholds on the three classifiers will lead to the two classifier rules. Here, unless the authentic as well as impostor correlation coefficients are 1, the two classifier rules will not have more accuracy than three classifier rules. This is because there is information to be gained from fusion with the third classifier. Among the three classifier decision fusion rules, three rules are of the form or(i,and(j, k)) (where i, j, k represent the classifiers, i, j, k = 1, 2, 3, i j k), three rules are of the form and (i,and(j, k)), and one rule each for the three classifier and, or and majority rules. Again, we can choose one each of the or(i,and(j, k)), and (i,and(j, k)) decision rules since each of the classifiers and the pairwise correlation coefficients are also identical. The minimum probability of error, assuming equi-probable priors for authentics and impostors, is found for the 5 three classifier monotonic fusion rules for each combination of (ρ a, ρ i ). The thresholds on the classifier scores are chosen jointly (by brute force) to minimize the probability of error. It may happen that the thresholds are different for each classifier. More details on finding this joint set of thresholds in given in Section Fig. 3.3a shows the minimum probability of error for the best decision fusion rule at each value of (ρ a, ρ i ) and Fig. 3.3b shows the best decision fusion rule as a function of (ρ a, ρ i ). It can be seen from Fig. 3.3a that the minimum probability of error is different for different statistical dependences. Hence it is desirable to design classifiers to have a particular statistical dependence that leads to the smallest probability of error. The maximum error probability in Fig. 3.3a is for the case of maximum positive correlation (ρ a = 1, ρ i = 1), with the probability of error equal to the minimum probability of error for the single classifier, which is 31% for this experiment. Here, the best thresholds for fusion rules are such that only one classifier is used and the other two are ignored. All other points in the figure are smaller, showing that the fusion of multiple classifiers improves the accuracy over the individual classifiers. The probability of error surface has its minima at the corners of the plot, i.e., at (ρ a = 0.5, ρ i = 0.5), (ρ a = 1, ρ i = 0.5), (ρ a = 0.5, ρ i = 1) for which the and, majority and or rules, respectively are the best with the probability of error of 7%,11% and 7%, respectively. In other words, the and and the or rules are the best rules since they can achieve the smallest probability of error at their most favorable conditional dependence. 57

75 From Fig. 3.3b, it can be seen that the and, or and majority rules are the important fusion rules to focus on since one of these three is the best rule at any given (ρ a, ρ i ). In general, the best fusion rule appears to be as follows. best rule = and, ρ a > 0, ρ i < ρ a majority, ρ a 0, ρ i 0 or, ρ i > 0, ρ i > ρ a (3.31) It is also observed in Fig. 3.3b that there are multiple fusion rules having the best performance at and around the boundaries of the regions given in Eq.(3.31). Min P e across all 3 classifier fusion rules (3D search) 1 Best Rules for 3 classifiers P e ρ i 0 and and(1,or(2,3)) majority or(1,and(2,3)) or ρ a ρ i ρ a (a) (b) Figure 3.3: (a) Minimum probability of error of 3 classifiers for the best fusion rule as a function of statistical dependence. (b) The best fusion rule as a function of statistical dependence. The favorable correlation coefficients for the 5 three-classifier decision fusion rules are shown in Figures 3.4 to 3.6. It is interesting to note that there is a region that is unfavorable for any of these rules. By inspection, this region is approximately as shown in Figure 3.6b. This region is approximately described by the following equation (by inspection). unfavorable classifiers: ρ a ρ i g(ρ a, ρ i ), ρ a > 0, ρ i > 0 a, 0 < a 1, 0.1 < ρ a, ρ i < 0.2 with g(ρ a, ρ i ) 0.1, (ρ 2 a + ρ 2 b ) , (ρ 2 a + ρ 2 b ) 0.2 (3.32) 58

76 1 AND: favorable conditional dependence 1 OR: favorable conditional dependence ρ i ρ i ρ a (a) ρ a (b) Figure 3.4: Favorable conditional dependence for (a)and (b)or 1 MAJORITY: favorable conditional dependence 1 AND(1,OR(2,3)): favorable conditional dependence ρ i ρ i ρ a (a) ρ a (b) Figure 3.5: Favorable conditional dependence for (a)majority (b)and(1,or(2,3)) There is a link between the diversity of classifier decisions and the diversity of classifier scores. The Q statistic is a good diversity measure for classifier decisions [113] (Appendix 7.4.2). The Q value is zero for statistically independent decisions and has limits of -1 and 1. The Q value has the same sign as the correlation coefficient of classifier decisions ρ d. It can also be proved that ρ d Q. The best decision rule s authentic and impostor Q values at the optimal thresholds of the 3-classifier scores are shown as a function of the correlation coefficient of the scores in Figures 3.7 and 3.8. The authentic (impostor) Q values plotted are the average pairwise classifier authentic (impostor) Q values. It can be observed that the sign of the authentic (impostor) Q value is the same as the sign of the authentic (impostor) correlation coefficient between scores. Further, 59

77 OR(1,AND(2,3)): favorable conditional dependence 1 1 Unfavorable conditional dependence ρ i ρ i ρ a (a) ρ a (b) Figure 3.6: (a)favorable conditional dependence for OR(1,AND(2,3)). (b)unfavorable conditional dependence for all rules. the magnitude of the Q values increase (decrease) as the magnitude of the correlation coefficients increase (decrease). Hence there is a direct relation between the Q values of decisions (at the best thresholds) and the ρ of scores. Figure 3.7: Authentic Q values as a function of the correlation coefficient between scores. The Q values are computed at the optimal thresholds of the best decision rule (at the given statistical dependence). 60

78 Figure 3.8: Impostor Q values as a function of the correlation coefficient between scores. The Q values are computed at the optimal thresholds of the best decision rule (at the given statistical dependence). In the next subsection, the search strategy to find the best set of thresholds on the classifier scores is given Multi-dimensional Search for the Best Set of Thresholds The best set of thresholds on the multiple statistically dependent classifier scores has been found by searching over the multi-dimensional space of thresholds. The probability of error surface for the joint Gaussian scores is studied for each fusion rule at a few values of statistical dependence to get an initial estimate of the minima points, which are refined using gradient descent approaches to get the actual minima points. Fig. 3.9 shows slices of the probability of error (P e ) as a function of thresholds for the and fusion of three (identical) classifiers for different values of conditional dependence between the classifiers: at (ρ a = 1, ρ i = 1) and (ρ a = 1, ρ i = 0.5). For (ρ a = 1, ρ i = 1), all the classifier scores are the same and here it is sufficient to use just one classifier to find the minimum P e. The best threshold set here is ( 4, 4, 0.5), i.e., two classifiers are ignored by setting their threshold 61

79 to the minimum score and the best threshold for the single classifier is the threshold for the third classifier. For (ρ a = 1, ρ i = 0.5), which is the favorable statistical dependence for the and rule, all authentic scores are the same, and setting the same thresholds on the three classifier scores will not increase the false rejection rate over that of the single classifier. Since ρ i = 0.5, there is information from the multiple impostor scores which can be used to potentially lower the false acceptance rate for the and rule over that of the single classifier. In this case, the best set of thresholds is ( 0.13, 0.13, 0.13), i.e., the same threshold on the three classifier scores. In this way, each fusion rule is analyzed to obtain a few initial estimates of the location of minima points. We use a multi-dimensional binary search around these initial estimates to find the minima of the error surface. The error surface is evaluated at three thresholds for each classifier score, centered at the initial threshold set, and separated by an initial step size. Around the threshold set with the smallest value of P e, the search is repeated with half the step size. This is iterated till the local minima is found, i.e., till there is not much difference in the smallest value of P e from the previous iteration, or if the number of iterations exceed a given number, N i. AND: ρ a = 1, ρ i = 1 AND: ρ a = 1, ρ i = 0.5 Threshold Threshold Threshold Threshold Threshold Threshold (a) (b) Figure 3.9: Slices of the three dimensional probability of error for the 3 classifier and rule as a function of thresholds on each classifier score at different correlation coefficients between authentic and impostor scores. (a)ρ a = 1, ρ i = 1 with min. error at thresholds (-4,-4,0.5) (b)ρ a = 1, ρ i = 0.5 with min. error at thresholds (-0.13,-0.13,-0.13) In the next subsection, we check to see if conclusions obtained from the synthesized scores in this section can be applied to real biometric scores. 62

80 3.2.2 Application to Biometric Verification Using the results from Section 3.2, this section investigates whether the correlation coefficients between biometric classifier scores can 1. predict the best decision fusion rule for a given set of classifiers, 2. evaluate classifier design techniques by stating if the classifiers are favorable for the rule they are designed for. This is done because the distribution of biometric scores is not known, and in general, is not Gaussian. The NIST 24 fingerprint database [111] and the AR face database [112] are used for evaluation. A description of the databases is followed by the results of the evaluation [110]. It should be noted that the design and evaluation of classifier ensembles on these databases is done in Chapter 5. In this section, it is investigated if the prediction/evaluation of the best monotonic fusion rule can be done for those classifier ensembles. Only a brief description of the classifier ensemble design is provided here. The reader is referred to Chapter 5 for more details on the ensemble design, test procedure, the statistical dependence obtained in the ensemble, and the best decision fusion rule NIST 24 Database The NIST Special Database 24 of digital live-scan fingerprint video data [111] corresponds to 10 fingers of 10 people. Each finger has a 10 second video, containing 300 images of size pixels, which have been padded to size pixels here. The plastic distortion set is used for evaluation here, where the fingers are rolled and twisted producing a lot of distortion in the images. Some of the images have partial fingerprints. The distored images of a sample finger are shown in Figure The unconstrained optimal trade-off (UOTF) correlation filters [114] are the base classifiers used here. They offer good discrimination and distortion tolerance capability. While more details of the filter design are given in [114], the filter parameters chosen are a noise tolerance coefficient of 10 6 and an average correlation energy minimization coefficient of 1. These parameters are chosen because they provide the best accuracy in a study done in [74] Two ensembles are designed for this database in Section Only a brief description of the training and test procedures is provided here. The details of how the ensemble is designed for the 63

81 Figure 3.10: Sample distorted images of a finger in the NIST 24 plastic distortion dataset. OR rule and the statistical dependence between the designed classifiers are discussed in Section The training set consists of 20 uniformly sampled images from the 300 images of a finger, starting with the 1st image. The 20 authentic images per finger and the first image from all the 99 impostor fingers are used for training each filter, which is specific to a finger. The test set for each finger consists of 280 authentic images other than the training set and 20 randomly sampled images from each of the 99 impostor fingers, since the UOTF filter is shown to be discriminative [74]. Best Decision Rule Prediction: An ensemble designed by Bagging is described in Section Bagging is a common method used to generate a set of classifiers Only a brief description of the training/test is given here, and the reader is referred to Section for more details. The best fusion rule for three bootstrap [79] UOTF classifiers are predicted here. The bootstrap [79] classifiers are obtained by training on a random subset of the authentic data and a random subset of the impostor training data. The random subsets are obtained by random sampling of training images, with replacement, from the training set. For each fusion rule, the minimum probability of error, assuming equi-probable priors for authentic and impostors, is found for each finger and averaged over all fingers. This can also be stated as half the total error rate (TER), which is the sum of the false accept rate (FAR) and the false reject 64

82 Table 3.1: Prediction of the best fusion rule using correlation coefficients between classifier scores along with the top two observed fusion rules (in terms of TER/2) for bootstrap classifiers on NIST 24 data. Average ρ Predicted Best Rule Best Rule Next Best Rule Authentic.72 or majority or Impostor.797 borderline 1.10 ±.13% 1.14 ±.13% rate (FRR). The mean and correlation coefficient of the authentic and impostor scores for the three classifiers (averaged over all fingers) are given in Table 3.1. While the analysis in Section 3.2 assumes identical classifiers with the same correlation coefficient between each pair of classifiers, this is not generally valid in practical design of classifiers. The average correlation coefficients, assuming identical classifiers, lie close to the edge of the or region in Fig. 3.3b, predicting that multiple fusion rules, the majority, or and or(i,and(j, k)), have the best performance. By evaluating the TER/2 for all the fusion rules, it is found that the majority and the or rules have comparable best performance, as shown in Table 3.1. Thus, correlation coefficients between scores have provided a good prediction here. Evaluating Classifier Design: An ensemble designed for OR rule fusion on the NIST 24 plastic distortion set is described in Section The reader is referred there for details of the design, the statistical dependence achieved for the designed ensemble and the best decision fusion rule. Three UOTF classifiers are designed for the or rule in Section by an informed selection of the authentic training set and are found to have favorable statistical dependence for the OR rule by evaluating the Q value [34] between classifier decisions. Here, we investigate if the correlation coefficients between the scores can evaluate if the classifiers are favorable to the or rule. Table 3.2 shows the mean and correlation coefficient between the scores. Using the average correlation coefficients, Fig. 3.3b predicts that the or rule is the best rule and that these classifiers are favorable for the or rule. This is the result obtained in Section Thus correlation coefficients between scores have made a good prediction as well as evaluation here AR Database The AR face database [112] contains color images of expression, illumination and occlusion variations taken at two sessions separated by two weeks. There is a slight pose variation also in 65

83 Table 3.2: Evaluation of the optimality of the ensemble design using correlation coefficients between classifier scores. The ensemble is designed for the or rule on NIST 24 data. Average ρ Predicted Best Rule Best Rule Next Best Rule Authentic -.37 or or majority Impostor.45 favorable for or 0.4 ±.06% 1.6 ±.10% the images. Figure 3.11 shows sample images of one person in the AR database. Registered and cropped grayscale images (size pixels) of 95 people are used for evaluation here because of missing data for some of the people. Performance on 20 images of expression, illumination and scarf occlusion per class is evaluated here since the registration of sunglass images is difficult. Figure 3.11: Sample images of the variations present in AR database. Details of the design and test of the ensembles on the AR database that are evaluated here are given in Section The ensembles designed with the Fisher linear Discriminant [115] as the base classifier are chosen here. To avoid the singularity of the within class scatter matrix in classical Linear Discriminant Analysis when only a few training images are present, a Gram Schmidt (GS) Orthogonalization based approach for LDA proposed in [115] is used here. 66

84 Table 3.3: Prediction of best fusion rule using correlation coefficients between classifier scores along with top two observed fusion rules (in terms of TER/2) for Bagging on AR data. ρ Predicted best rule Best rule Next best rule Authentic 0.79 and and or Impostor 0.72 borderline 3.8 ±.46% 4.0 ±.50% Three images (neutral expression and scream at indoor ambient lighting, and neutral expression at left lighting) from each person are used for training. The training set for each person is 3 authentic images and (94 impostors)*(3 images per person) = 282 impostor images. The test set for each person is the entire database, i.e. 20 authentic images and (94 impostors)*(20 images per person)=1880 impostor images. Best Decision Rule Prediction: A Bagging classifier ensemble composed of two LDA classifiers is designed in Section The mean and correlation coefficient between the scores for the two Bagging [79] LDA classifiers are shown in Table 3.3. From Fig. 3.3b, the authentic and impostor correlation coefficients are close to the and region and multiple fusion rules may have the same best performance. On evaluation, it is found that the and and or rules have comparable performance, thus providing good prediction. Evaluating Classifier Design: A classifier ensemble for the and rule composed of two LDA classifiers is designed in Section Details of the design can be obtained in Section Two LDA classifiers are designed by an informed selection of the impostor training set. The impostor training set is divided into male and female impostor clusters, each of which are used to train the two different classifiers. The entire authentic training set is used in both the classifiers. These are found to have favorable conditional dependence on impostor decisions. From the mean and correlation coefficient between the scores given in Table 3.4, Fig. 3.3b, predicts that the and rule is the best rule and the classifiers are favorable for the and rule. Hence, the correlation coefficient provides a good prediction and evaluation. Statistical dependence between classifiers plays a role in the accuracy of the best decision fusion rule. This confirms the need for designing classifiers to have a specific statistical dependence in order to maximize their fusion performance. It has been shown for three classifiers that one of 67

85 Table 3.4: Classifier design evaluation using correlation coefficients between classifier scores for the classifiers designed for the and rule on AR data. ρ Predicted best rule Best rule Next best rule Authentic 0.86 and and or Impostor 0.40 favorable for and 2.7 ±.40% 3.8 ±.46% and, or, majority is the best decision fusion rule at any given statistical dependence, and hence classifier design can focus on these rules. It has also been shown that the correlation coefficient between classifier scores can predict the best decision fusion rule, thus avoiding a search for the best rule. They can also evaluate if the classifiers have a better performance than independent classifiers on fusion. These results are useful in classifier fusion for biometric verification. Results on the NIST 24 fingerprint database and the AR face database confirm that the prediction and evaluation are good. Since we have seen in Section 3.2 that the or, and, and majority rules are the important fusion rules, we study how statistical dependence affects each of these rules in detail in the following sections. 3.3 Analysis of conditionally dependent classifiers for the OR Rule OVERVIEW: This section analyzes the conditional dependence that is optimal/worst as well as favorable/unfavorable for the or decision fusion rule. Obtaining classifier ensembles with optimal conditional dependence os very unlikely. This is due to the difficulties in ensemble design for a given database. Hence, the favorable / unfavorable conditional dependences are also analyzed. The two-classifier or rule fusion is simple to analyze. The optimal/worst conditional dependence can be easily visualized for the two-classifier or rule. Hence, the two-classifier or rule is analyzed to know the optimal/worst as well as favorable/unfavorable conditional dependence. The optimal conditional dependence for the general case of N classifier fusion cannot be found. This is due to the coupling between the dependence of kth classifier dependence and the dependence of all 1, 2,...,(k 1)th classifier decisions. The favorable conditional dependence for the general case of N classifier or is obtained. 68

86 It is found that a negative (positive) correlation coefficient between authentic decisions and a positive (negative) correlation coefficient between impostor decisions is favorable (unfavorable) for two-classifier or fusion. The Q value [113] (provided in Appendix 7.4) of -1 on authentics and +1 on impostors is optimal for the two-classifier or rule. A Q value of +1 on authentics and -1 on impostors is the worst conditional dependence for the two-classifier or rule. For the general case of N classifier or fusion, kth order k = 2,...,N correlation coefficients between can describe the favorable conditional dependence. For authentic decisions, even-order correlation coefficients should be negative and odd-order correlation coefficients should be positive. For impostor decisions, even-order correlation coefficients should be positive and odd-order correlation coefficients should be negative. The OR fusion rule outputs a 1 if any of the classifiers outputs a 1; otherwise, the OR fusion output is 0. The OR fusion rule decreases the false rejection rate (FRR) from the individual classifier FRRs in general. The false acceptance rate (FAR) of the or rule in general increases over the individual classifier FARs. When the classifier decisions are conditionally-independent, the probability of false accept is given by P FA = Prob(u 0 = 1 u)p(u H 0 ) = N Prob(u 0 = 1 u) P(u i H 0 ) (3.33) u and the probability of false rejection for the fusion rule is given as follows. P FR = Prob(u 0 = 0 u)p(u H 1 ) = N Prob(u 0 = 0 u) P(u i H 1 ) (3.34) u The false acceptance rate (FAR) is determined by the classifier decisions on impostors, i.e., on the hypothesis H 0. Hence, to examine if the FAR of conditionally dependent classifiers is smaller or larger than the FAR of conditionally independent classifiers, we only need to consider the conditional dependence on impostors, i.e., on the hypothesis H 0. Similarly, the false rejection rate (FRR) is determined by the classifier decisions on authentics, i.e., on the hypothesis H 1. We only need to consider the conditional dependence on authentics, i.e., on the hypothesis H 1 to examine if the FRR is smaller or larger than the FRR of conditionally independent classifiers. Hence we separately analyze the FAR and FRR. The analysis of two classifier OR fusion is done first, following which the analysis for the i=1 i=1 69

87 general case of N classifier OR fusion is done Two Classifier OR rule: False Acceptance Probability We first analyze the FAR of the OR rule. When the classifier decisions are conditionally independent, the FAR for OR fusion of two classifiers is given by P FA = P(u H 0 ) = 2 P(u j H 0 ) u S u S j=1 = 1 P(u 1 = 0 H 0 )P(u 2 = 0 H 0 ) = P FA1 + P FA2 P FA1 P FA2 (3.35) where the subscripts refer to the individual classifiers. When the FAR of the OR rule is smaller (larger) than Eq.(3.35), favorable (unfavorable) conditional dependence on impostors for the OR rule is present. If FAR is smaller than Eq.(3.35), then the joint conditional probability of the classifier decisions must satisfy the following. P FA = 1 P(u = [ 0 0 ] H 0 ) < P FA1 + P FA2 P FA1 P FA2 (3.36) P FA1 = P(u = [ 1 0 ] H 0 ) + P(u = [ 1 1 ] H 0 ) (3.37) P FA2 = P(u = [ 0 1 ] H 0 ) + P(u = [ 1 1 ] H 0 ) (3.38) P(u = [ 0 0 ] H 0 ) + P(u = [ 0 1 ] H 0 ) + P(u = [ 1 0 ] H 0 ) + P(u = [ 1 1 ] H 0 ) = 1 (3.39) From the inequality in Eq.(3.36) and the 3 constraints in Eq.(3.37) to Eq.(3.39), four inequality constraints on each of the joint probability values can be obtained. From the inequality in Eq.(3.36), the following inequality constraint on P(u = [ 0 0 ] H 0 ) is obtained. P(u = [ 0 0 ] H 0 ) > (1 P FA1 )(1 P FA2 ) (3.40) An inequality constraint on P(u = [ 1 1 ] H 0 ) is obtained from Eq.(3.37) + Eq.(3.38) + Eq.(3.40) - Eq.(3.39) as shown below. P(u = [ 1 1 ] H 0 ) > P FA1 + P FA2 + (1 P FA1 )(1 P FA2 ) 1 = P FA1 P FA2 (3.41) 70

88 Eq.(3.37)- Eq.(3.41) provide the following inequality constraint on P(u = [ 1 0 ] H 0 ). P(u = [ 1 0 ] H 0 ) < P FA1 P FA1 P FA2. (3.42) Similarly, Eq.(3.38)- Eq.(3.41) provides an inequality constraint on P(u = [ 0 1 ] H 0 ), given by P(u = [ 0 1 ] H 0 ) < P FA2 P FA1 P FA2. (3.43) All the above 4 inequalities are subject to the constraint of Eq.(3.39). We note from the above first two inequalities that P(u 1 = u 2 H 0 ) is larger for the favorable conditionally-dependent case than the conditionally-independent case. In other words, when the classifier decisions on impostors agree more often than in the case of independent classifiers, the FAR for the OR rule is smaller than that of conditionally-independent classifier decisions. If FAR of the OR rule is larger than Eq.(3.35), then the inequality signs in Eqs.(3.40) to (3.43) are all reversed. This represents the unfavorable conditional dependence on impostors for the OR rule. There are several diversity measures enumerated in Appendix 7.4. Among them, the Yule s Q statistic [113] and the correlation coefficient ρ have a constant value of zero for independent classifiers. They also have limits of -1 and 1. For identical classifier decisions, both Q and ρ have a value of 1. For complementary classifier decisions (if one classifier decides 1, the other classifier decides 0 ), the 2-classifier Q value is -1. The correlation coefficient ρ will not always be -1 for complementary classifier decisions. Hence, the Q statistic is used here as a diversity measure for the 2 classifier decision rules. For a large number of observations, the frequencies can be approximated by probabilities and the approximate Q statistic in terms of probabilities is given by Q jk = P(u j = 1, u k = 1)P(u j = 0, u k = 0) P(u j = 0, u k = 1)P(u j = 1, u k = 0) P(u j = 1, u k = 1)P(u j = 0, u k = 0) + P(u j = 0, u k = 1)P(u j = 1, u k = 0) (3.44) An indication of the sign of the Q statistic on the impostors for favorable (unfavorable) conditional dependence for the AND rule can be found by considering the numerator of the approximate Q statistic in Eq For favorable conditional dependence, i.e., when FAR of the OR rule is 71

89 smaller than Eq.(3.35), P(u j = 1, u k = 1 H 0 )P(u j = 0, u k = 0 H 0 ) P(u j = 0, u k = 1 H 0 )P(u j = 1, u k = 0 H 0 ) > ((1 P FA1 )(1 P FA2 )) (P FA1 P FA2 ) (P FA2 (1 P FA1 )) (P FA1 (1 P FA2 )) = 0 (3.45) This implies that the Q value is positive at favorable conditional dependence on the impostor data for the OR rule. Similarly, it can be shown that the Q value on impostors is negative at unfavorable conditional dependence for the OR rule. Alternatively, the FAR for the OR rule using two classifiers can be written in terms of the individual classifier correct classification probabilities on impostors (P (u i = 0 H 0 )) as follows. ( [ ] ) ( [ ] ) ( [ ] ) P FA = P u = 1 0 H 0 + P u = 0 1 H 0 + P u = 1 1 H 0 ( [ ] ) = 1 P u = 0 0 H 0 = 1 P (u 1 = 0 u 2 = 0 H 0 ) (3.46) The sets in Figure 3.12a represent the individual classifier impostor decisions on the impostor data space. The areas of the sets represent the probability. The complement of the intersection of the two sets represents the FAR for the OR rule. For fixed individual classifier error probabilities, we would like to find the error probability for the OR rule when there is conditional dependence of classifier decisions. Then the area of each of the sets in Figure 3.12a is fixed, but the area of the intersection and union of the sets can vary depending on the conditional dependence. When there is conditional independence, the area of intersection of the sets is also fixed and is equal to P (u 1 = 0 u 2 = 0 H 0 ) = P (u 1 = 0 H 0 ) P (u 2 = 0 H 0 ) = (1 P FA1 )(1 P FA2 ) (3.47) We can see from Figure 3.12a as well as from Eq.(3.46) that FAR for the OR rule is determined from the probability of intersection of the two sets, or the probability that both classifiers correctly reject the impostors. When this probability is smaller (larger) than that of conditional independence (Eq.(3.47)), the OR rule FAR for conditional dependence is larger (smaller) than that of conditional independence. It is more likely that the classifiers in practical ensembles used will have small FAR. When the error probabilities are small as in Figure 3.12b, i.e., when P FA1 + P FA2 1, then the 72

90 smallest intersection probability is ( [ ] ) smallestp u = 0 0 H 0 = 1 (P FA1 + P FA2 ),P FA1 + P FA2 1 (3.48) When the error probabilities are large as in Figure 3.12c, then the smallest intersection probability is zero. smallestp ( [ u = 0 0 ] H 0 ) = 0, P FA1 + P FA2 1 (3.49) When the intersection probability is the smallest, then FAR for the OR rule is the largest, given by, largestp FA = 1 P ( [ u = 0 0 ] ) H 0 = 1, P FA1 + P FA2 1 P FA1 + P FA2, P FA1 + P FA2 1 (3.50) The joint conditional probabilities at the largest FAR for the OR rule are given by P P P ( [ u = ( [ u = ( [ u = ] ) H 0 = ] ) H 0 = ] ) H 0 = 1 P FA1, P FA1 + P FA2 1 P FA2, P FA1 + P FA2 1 1 P FA2, P FA1 + P FA2 1 P FA1, P FA1 + P FA2 1 P FA1 + P FA2 1, P FA1 + P FA2 1 0, P FA1 + P FA2 1 (3.51) (3.52) (3.53) From Figure 3.12d, it can be seen that when one set is completely enclosed by the other, the probability of intersection is the largest. ( [ largestp u = 0 0 ] H 0 ) = min((1 P FA1 ),(1 P FA2 )) (3.54) Then FAR for the OR rule is the smallest, and is given by ( [ ] ) smallestp FA = 1 P u = 0 0 H 0 = max (P FA1, P FA2 ) (3.55) 73

91 The joint conditional probabilities at the smallest FAR for the OR rule are given by P P P ( [ u = ( [ u = ( [ u = ] ) H 0 = ] ) H 0 = P FA2 P FA1, P FA2 P FA1 0, P FA2 < P FA1 (3.56) 0, P FA2 P FA1 (3.57) P FA1 P FA2, P FA2 < P FA1 ] ) H 0 = min(p FA1, P FA2 ) (3.58) The conditional Q values on impostors at these smallest and largest FAR of the OR rule are given by 1, at min P FA, P(u = [ 0 1 ] H 0 ) = 0 or P(u = [ 1 0 ] H 0 ) = 0 Q H0 = 1, at max P FA, P(u = [ 0 0 ] H 0 ) = 0 or P(u = [ 1 1 ] H 0 ) = 0 (3.59) To summarize, when Q H0 is positive, (from Eq.(3.45)), favorable conditional dependence for the OR rule is present. When Q H0 = +1, (from Eq.(3.59)), FAR of the OR rule has the smallest value. Q H0 is negative at unfavorable conditional dependence for the OR rule. The largest FAR is obtained when Q H0 = 1, (from Eq.(3.59)) Two Classifier OR rule: False Rejection Probability We now analyze the FRR for the OR rule. When the classifier decisions are conditionally independent, the FR probability for OR fusion using two classifiers is given by. P FR = P(u H 1 ) = 2 P(u j H 0 ) = P(u 1 = 0 H 1 )P(u 2 = 0 H 1 ) = PFRP 1 FR 2 (3.60) u S c u S c j=1 The dependence statistics conditioned on H 1 for which the FRR of the OR rule is smaller (larger) than Eq. (3.60) is of interest here. If FRR is smaller than Eq. (3.60), then the joint conditional 74

92 Figure 3.12: (a) General case of impostor classification by two classifiers. The classifiers declare impostor in the sets shown. Each classifier has a different color. The intersection of the two sets is declared as impostor by the OR rule. The complement of the intersection is the FAR of the OR rule. (b) The largest FAR for the OR rule when the sum of individual classifier FARs is large. (c) The largest FAR of the OR rule when the sum of individual classifier FARs is small. (d) The smallest FAR of the OR rule. 75

93 probability of the classifier decisions must satisfy the following. P FR = P(u = [ 0 0 ] H 1 ) < P FR1 P FR2 (3.61) P FR1 = P(u = [ 0 0 ] H 1 ) + P(u = [ 0 1 ] H 1 ) (3.62) P FR2 = P(u = [ 0 0 ] H 1 ) + P(u = [ 1 0 ] H 1 ) (3.63) P(u = [ 0 0 ] H 1 ) + P(u = [ 0 1 ] H 1 ) +P(u = [ 1 0 ] H 1 ) + P(u = [ 1 1 ] H 1 ) = 1 (3.64) Using the above inequality and 3 constraints, inequality constraints on the joint probabilities conditioned on authentics can be obtained. The derivation for this is similar to the derivation of the inequality constraints on the joint probabilities conditioned on impostors given in Eq.(3.40) to Eq.(3.43). The 4 inequality constraints on the joint probabilities conditioned on authentics are given below. P(u = [ 0 0 ] H 1 ) < P FR1 P FR2 (3.65) P(u = [ 1 1 ] H 1 ) < (1 P FR1 )(1 P FR2 ) (3.66) P(u = [ 0 1 ] H 1 ) > (1 P FR2 )P FR1 (3.67) P(u = [ 1 0 ] H 1 ) > (1 P FR1 )P FR2 (3.68) The above 4 inequalities are subject to the constraint of Eq. (3.64). We note from Eq. (3.67) and (3.68) that P (u 1 u 2 H 1 ) is larger for favorable conditional dependence of the OR rule than for conditional independence. In other words, when the classifier decisions disagree more than independent classifiers on the authentic data, the FRR for the OR rule is smaller than that of conditionally independent classifier decisions. Similarly, if FRR of the OR rule is larger than Eq. (3.60), then the inequality signs in Eq. (3.65) to (3.68) are all reversed, along with the constraint of Eq. (3.64). This signifies the unfavorable conditional dependence on authentics for the OR rule. By considering the numerator of the Q statistic (Eq. (3.44)) on authentic data in terms of probabilities, we can get an indication to the sign of the favorable dependence statistics for the OR rule 76

94 conditioned on the authentics. For favorable dependence, P(u j = 1, u k = 1 H 1 )P(u j = 0, u k = 0 H 1 ) P(u j = 0, u k = 1 H 1 )P(u j = 1, u k = 0 H 1 ) < ((1 P FR1 )(1 P FR2 )) (P FR1 P FR2 ) (P FR1 (1 P FR2 )) (P FR2 (1 P FR1 )) = 0 (3.69) This implies that the Q statistic of the conditionally dependent classifiers on the authentic data, Q H1, is negative for a favorable dependence for the OR rule. Similarly, it can be seen that Q H1 is positive for unfavorable conditional dependence for the OR rule. The P FR for the OR rule using two classifiers can be written in terms of the individual classifier correct classification probabilities on the authentics (P(u i H 1 )) as follows. ( [ P FR = P u = 0 0 { ( [ = 1 P u = 1 0 = 1 P (u 1 = 1 u 2 = 1 H 1 ) ] H 1 ) ] H 1 ) + P ( [ u = 0 1 ] ) ( [ H 1 + P u = 1 1 ] H 1 )} = 1 {P (u 1 = 1 H 1 ) + P (u 2 = 1 H 1 ) P (u 1 = 1 u 2 = 1 H 1 )} = 1 {(1 P FR1 ) + (1 P FR2 ) P (u 1 = 1 u 2 = 1 H 1 )} = P FR1 + P FR2 1 + P (u 1 = 1 u 2 = 1 H 1 ) (3.70) The sets in Figure 3.13a represent the individual classifier authentic decisions on the authentic data. The area of the sets represents the probability of correct classification of each classifier. The complement of the union of the two sets represents P FR for the OR rule. For fixed individual classifier error probabilities, the area of each of the sets in Figure 3.13 is fixed. The area of the intersection and union of the sets can vary depending on the conditional dependence. When there is conditional independence, the area of the intersection and union are fixed and are equal to the 77

95 Figure 3.13: (a) General case of classification on authentics by two classifiers. Classifiers declare authentic in the sets shown (each classifier has a different color). The union of the two sets is declared authentic by the OR rule. The complement of the union is the FRR of the OR rule. (b) Largest FRR for the OR rule (c) Smallest FRR for the OR rule when the sum of FRRs of the individual classifiers is large. (d) Smallest FRR for the OR rule when the sum of FRRs of the individual classifiers is small. following. P (u 1 = 1 u 2 = 1 H 1 ) = P (u 1 = 1 H 1 ) P (u 2 = 1 H 1 ) = (1 P FR1 )(1 P FR2 ) (3.71) P (u 1 = 1 u 2 = 1 H 1 ) = P (u 1 = 1 H 1 ) + P (u 2 = 1 H 1 ) P (u 1 = 1 u 2 = 1 H 1 ) = (1 P FR1 ) + (1 P FR2 ) (1 P FR1 )(1 P FR2 ) = 1 P FR1 P FR2 (3.72) We can see from Figure 3.13 as well as from Eq.(3.70) that FRR for the OR rule is determined from the probability of intersection of the two sets, or the probability that both classifiers make correct decisions on the authentics. When this probability is smaller (larger) than that of conditional 78

96 independence (Eq. (3.71)), P FR for the OR rule for conditional dependence is smaller (larger) than that of conditional independence (Eq. (3.60)). From Figure 3.13b, it can be seen that when one set is completely enclosed by the other, the probability of intersection is the largest. ( [ ] ) largestp u = 1 1 H 1 = min((1 P FR1 ),(1 P FR2 )) (3.73) When the probability of intersection is the largest,p FR for the OR rule is the largest. ( [ ] ) largest P OR(1,2) FR = P u = 0 0 H 1 = P FR1 + P FR2 1 + min((1 P FR1 ),(1 P FR2 )) = min(p FR1, P FR2 ) (3.74) The joint conditional probabilities at the largest P FR for the OR rule are given in Eq. (3.75-(3.77). P ( [ u = 1 0 ] ) H 1 = P FR2 P FR1, P FR2 P FR1 0, P FR2 < P FR1 (3.75) 0, P FR2 P FR1 P(u = [ 0 1 ] H 1 ) = (3.76) P FR1 P FR2, P FR2 < P FR1 P FR2, P FR2 P FR1 P(u = [ 0 0 ] H 1 ) = (3.77) P FR1, P FR2 < P FR1 When the error probabilities are small as in Figure 3.13c., specifically when P FR1 +P FR2 1, then the smallest intersection probability is given as follows. ( [ ] ) smallestp u = 1 1 H 1 = 1 (P FR1 + P FR2 ),P FR1 + P FR2 1 (3.78) When the error probabilities are large as in Figure 3.13d., specifically when P FR1 + P FR2 > 1, then the smallest intersection probability is zero. ( [ ] ) smallestp u = 1 1 H 1 = 0, P FR1 + P FR2 1 (3.79) 79

97 When the probability of intersection is the smallest, FRR for the OR rule is smallest. smallestp OR(1,2) FR = P ( [ u = 0 0 ] ) H 1 = P FR1 + P FR2 1, P FR1 + P FR2 1 0, P FR1 + P FR2 1 The joint conditional probabilities at the smallest FRR of the OR rule are given below 1 P FR1, P FR1 + P FR2 1 P(u = [ 1 0 ] H 1 ) = P FR2, P FR1 + P FR2 1 1 P FR2, P FR1 + P FR2 1 P(u = [ 0 1 ] H 1 ) = P FR1, P FR1 + P FR2 1 P FR1 + P FR2 1, P FR1 + P FR2 1 P(u = [ 0 0 ] H 1 ) = 0, P FR1 + P FR2 1 The conditional Q values on authentics at these largest and smallest P FR for the OR rule are since either one of P(u = [0 1] H 1 ) or P(u = [1 0] H 1 ) is zero. since either one of P(u = [0 0] H 0 ) or P(u = [1 1] H 0 ) is zero. (3.80) (3.81) (3.82) (3.83) Q H1 = 1 for the largest P FR (3.84) Q H1 = 1 for the smallest P FR (3.85) To summarize, Q H1 is negative (positive) when favorable (unfavorable) conditional dependence (on authentics) for the OR rule is present. When Q H1 = 1, optimal conditional dependence (on authentics) for the OP rule is present. Q H1 = 1 at the worst conditional dependence (on authentics) for the OR rule [108] Analysis of favorable statistical dependence for N classifier OR rule We now extend the results of the favorable conditional dependence between classifiers for the OR rule for the general case of N > 2 classifier fusion.. We first focus on the false rejection rate (FRR) for the OR rule at favorable conditional dependence. For OR rule fusion of N classifiers, the 80

98 P FR is given by [ P FR = P(u = ] H 1 ) (3.86) We can write the probability of detection (P D ), the complement of above equation as follows. ( [ ] ) P D = 1 P FR = 1 P u = H 1 ( ) N = P (u j = 1 H 1 ) j=1 j=1 = N P(u j = 1 H 1 ) N P(u j = 1 u k = 1 H 1 ) + N j=1 k j l j,l k j=1 k j P(u j = 1 u k = 1 u l = 1 H 1 ) + ( 1) N 1 P(u 1 = 1 u 2 = 1 u N = 1 H 1 ) (3.87) Minimizing FRR is the same as maximizing P D. From the above equation, we can maximize P D by minimizing all the terms in the RHS with a negative sign in front on them, i.e. the second, fourth, and other even order terms, and maximizing all the terms with a positive sign in front on them, i.e., the first, third, fifth and other odd order terms. We can quantify the conditional dependence through correlation coefficients of the normalized decisions. The normalized decisions with zero mean and unit variance, z h (i), are given by z h (i) = u i P(u i = 1 H h ), h = 0, 1 (3.88) P(ui = 1 H h )(1 P(u i = 1 H h )) The second and higher-order correlation coefficients of these normalized variables are defined as follows. second-order coefficient: ρ h (i 1, i 2 ) = E (z h (i 1 )z h (i 2 )), i 1 i 2, 1 i 1, 1 2 N third-order coefficient: ρ h (i 1, i 2, i 3 ) = E (z h (i 1 )z h (i 2 )z h (i 3 )),i 1 i 2 i 3, 1 i 1, i 2, i 3 N (3.89) kth-order coefficient: ρ h (i 1, i 2,, i k ) = E (z h (i 1 )z h (i 2 ) z h (i k )), i j i l, 1 i j, i l N, 1 j, l k We first consider the sign of these correlation coefficients for the favorable conditional dependence on authentics. There, the values of the kth-order probabilities in Eq.(3.87) are better 81

99 (smaller for the even-order probabilities and larger for the odd-order probabilities) than independent classifier values. For independent classifier ensembles, the probability of the intersection of events is equal to the product of the individual events. The values of the kth-order probabilities in Eq.(3.87) for independent classifiers are given by second-order probability: P(u i1 = 1, u i2 = 1 H 1 ) = P(u i1 = 1 H 1 )P(u i2 = 1 H 1 ), i 1 i 2, 1 i 1, i 2 N third-order probability: P(u i1 = 1, u i2 = 1, u i3 H 1 ) = 3 P(u ij = 1 H 1 ), i 1 i 2 i 3, 1 i 1, i 2, i 3 N j=1 kth-order probability: P(u i1 = 1, u i2 = 1,, u ik H 1 ) = k P(u ij = 1 H 1 ), i j i l, 1 i j, i l N, 1 j, l k j=1 (3.90) For the values of the kth-order probabilities in Eq.(3.87) to be better (smaller for the evenorder probabilities and larger for the odd-order probabilities) than independent classifier values, they need to be as follows second-order probability: P(u i1 = 1, u i2 = 1 H 1 ) P(u i1 = 1 H 1 )P(u i2 = 1 H 1 ), i 1 i 2, 1 i 1, i 2 N third-order probability: P(u i1 = 1, u i2 = 1, u i3 H 1 ) kth-order probability: 3 j=1 P(u ij = 1 H 1 ), i 1 i 2 i 3, 1 i 1, i 2, i 3 N P(u i1 = 1,, u ik = 1 H 1 ) k P(u ij = 1 H 1 ), if k is even, i j i l, 1 i j, i l N, 1 j, l k j=1 P(u i1 = 1,, u ik = 1 H 1 ) k P(u ij = 1 H 1 ), if k is odd, i j i l, 1 i j, i l N, 1 j, l k j=1 We can get the sign of the kth-order correlation coefficients based on the above equations. The (3.91) 82

100 second-order correlation coefficient of the normalized decisions for authentics (H 1 ) is given by ρ 1 (i 1, i 2 ) = E (z 1 (i 1 )z 1 (i 2 ) H 1 ), i 1 i 2, 1 i 1, i 2 = E (u i1 P(u i1 = 1 H 1 )) (u i2 P(u i2 = 1 H 1 )) ( P(ui1 ) ( P(ui2 ) H 1 = 1 H 1 )(1 P(u i1 = 1 H 1 )) = 1 H 1 )(1 P(u i2 = 1 H 1 )) = E ((u i 1 u i2 u i2 P(u i1 = 1 H 1 ) u i1 P(u i2 = 1 H 1 ) + P(u i1 = 1 H 1 )P(u i2 = 1 H 1 )) H 1 ) P(ui1 = 1 H 1 )(1 P(u i1 = 1 H 1 ))P(u i2 = 1 H 1 )(1 P(u i2 = 1 H 1 )) = E (u i1 u i2 H 1 ) P(u i1 = 1 H 1 )P(u i2 = 1 H 1 ) P(ui1 = 1 H 1 )(1 P(u i1 = 1 H 1 ))P(u i2 = 1 H 1 )(1 P(u i2 = 1 H 1 )) 0 since E (u i1 u i2 H 1 ) P(u i1 = 1 H 1 )P(u i2 = 1 H 1 ) from Eq.(3.91). (3.92) Similarly, the third-order correlation coefficient of the normalized decisions for authentics (H 1 ) is given by ρ 1 (i 1, i 2, i 3 ) = E (z 1 (i 1 )z 1 (i 2 )z 1 (i 3 ) H 1 ),i 1 i 2, 1 i 1, ( uij P(u ij = 1 H 1 ) ) j=1 = E 3 ( ) H 1 P(u ij = 1 H 1 )(1 P(u ij = 1 H 1 )) = j=1 E(u i1 u i2 u i3 H 1 ) j=1 m=1,m j n=1,n m,j 3 j=1 E(u ij u im H 1 )P(u in = 1 H 1 ) P(u ij = 1 H 1 ) ( P(u ij = 1 H 1 )(1 P(u ij = 1 H 1 )) E(u i1 u i2 u i3 H 1 ) 3 3 P(u ij = 1 H 1 ) P(u ij = 1 H 1 ) 3 j=1 j=1 j=1 ( ) P(u ij = 1 H 1 )(1 P(u ij = 1 H 1 )) since E (u im u in H 1 ) P(u im = 1 H 1 )P(u in = 1 H 1 ), 1 m, n 3, m n from Eq.(3.91) 3 0 since E (u i1 u i2 u i3 H 1 ) P(u ij = 1 H 1 ) from Eq.(3.91) (3.93) from Eq.(3.91). j=1 Following the same procedure, it can be shown that even-order correlation coefficients of authentics are negative and odd-order correlation coefficients of authentics are positive for favorable conditional dependence for the OR rule. It should be noted that this is only a sufficient condition for favorable conditional dependence. Since the probability of detection (Eq. (3.87)) is composed of all ) j=1 83

101 the kth-order probabilities, k = 2, 3,...,N, the relative weightage of these terms plays a role in the favorable conditional dependence for the OR rule. It is possible that there is favorable conditional dependence for the OR rule when some of the even-order correlation coefficients are positive and some of the odd-order correlation coefficients are negative. We now focus on the FAR of the OR rule to know the favorable conditional dependence on impostors. For the OR rule fusion of N classifiers, the P FA is given by [ ] P FA = 1 P(u = H 0 ) (3.94) Similar to Eq.(3.87), we can write the FAR for the OR rule as ( [ ] ) P FA = 1 P u = H 0 ( ) N = P (u j = 1 H 0 ) j=1 j=1 = N P(u j = 1 H 0 ) N P(u j = 1 u k = 1 H 0 ) + N j=1 k j l j,l k j=1 k j P(u j = 1 u k = 1 u l = 1 H 0 ) + ( 1) N 1 P(u 1 = 1 u 2 = 1 u N = 1 H 0 ) (3.95) Using a similar analysis to that of reducing P D for the OR rule, reducing the FAR for the OR rule would imply that the even-order normalized correlation coefficients should be positive, and the odd-order normalized correlation coefficients should be negative for impostors. For the two classifier case, the second-order correlation coefficient has the same sign as the Q statistic, which should be positive for impostors and has an upper limit of 1 for the best conditional dependence on impostors, thus agreeing with our previous analysis of two classifier OR fusion [108]. The limits for the kth-order conditional probabilities for best conditional dependence for the OR rule are difficult to obtain. This is because the limits for the kth-order conditional probabilities depend on the individual classifier error probabilities as well as the l = 2, 3,...,k 1th order conditional probability limits. Since this problem is complex, the solution is intractable. The limits of the correlation coefficients for minimizing the error for the N classifier OR fusion are more difficult to obtain. The limits of correlation coefficients depends on non-linear functions of multiple correlation coefficients, which are difficult to solve. 84

102 3.4 Analysis of conditionally-dependent classifiers for the AND rule The AND fusion rule is the complement of the OR fusion rule. Analysis similar to the OR rule can be carried out to find the conditional dependence favorable and unfavorable to the AND fusion rule. Analysis on P FA for the AND rule is similar to the analysis on P FR for the OR rule. Analysis on P FR for the AND rule is similar to the analysis on P FA for the OR rule. For favorable (unfavorable) conditional dependence for the AND rule, the Q statistics between pairs of classifiers should be positive for the authentics and should be negative for the impostors. Figs and 3.15 show the limits of P FA and P FR for the two classifier AND rule. For the N classifier AND fusion, positive even-order correlation coefficients and negative-odd order correlation coefficients are favorable on authentics. Negative even-order correlation coefficients and positive odd-order correlation coefficients are favorable on the impostors. In the next section, we analyze the Majority rule in a similar way to our analysis for the OR and AND rule, which is somewhat different from the analysis done in the literature on the Majority rule. 3.5 Analysis of conditionally dependent classifiers for the Majority Rule OVERVIEW: The optimal statistical dependence for the 3-classifier Majority rule is found in this section. The pairwise classifier Q values can be positive or negative at the optimal statistical dependence. Hence a 3-classifier diversity measure is needed to characterize the statistical dependence for the Majority rule. This result clarifies the ambiguous results found by Kuncheva et al. [27] in predicting optimal statistical dependence using the pairwise classifier Q statistic. Matan [35], Demirekler et al [28] and Kuncheva et al. [27] (in chronological order) have provided analysis of the limits of accuracy of the Majority rule. The corresponding distributions of decisions at best/worst accuracy of the Majority rule is provided by Demirekler et al [28]. Our analysis is similar to that of Demirekler et al. [28]. Our analysis is different from that of Kuncheva et al. [27] because we use the values of the conditional distributions at the best/worst case accuracy to find the values of diversity measures at those distributions. Kuncheva et al. [27] empirically find the values of the Q statistic for the best/worst distributions for various simulated classifier ensembles. They observed that these provide ambiguous results. They just consider the second-order statistics, the average Q values between pairs of 85

103 Figure 3.14: a. General case of sets of impostor images classified correctly by two classifiers. Each set of impostor images classified correctly by a classifier has a different color. The union of the two sets is correctly classified by the AND rule and the c omplement of the union corresponds to the probability of false acceptance (FA). b. The largest probability of FA for the AND rule when the sum of individual classifier probabilities is large. c. The smallest probability of FA for the AND rule when the sum of individual classifier FA probabilities is small. d. The smallest probability of FA for the AND rule when the sum of individual classifier FA probabilities is large. 86

104 Figure 3.15: a. General case of sets of authentic images classified correctly by two classifiers. Each set of authentic images classified correctly by a classifier has a different color. The intersection of the two sets is correctly classified by the AND rule and the complement of the intersection corresponds to the probability of false rejection (FR). b. Largest probability of FR for the AND rule when the sum of the FR probabilities of the individual classifiers is large. c. Largest probability of FR for the AND rule when the sum of the FR probabilities of the individual classifiers is small. d. Smallest probability of FR for the OR rule. 87

105 classifiers, which we find is insufficient to describe completely the N > 2 classifier ensemble statistical dependence. For 3-classifier Majority rule fusion, it is found that a 3-classifier diversity measure is needed to characterize the optimal statistical dependence. We show that 2-classifier Q values can be positive or negative at optimal statistical dependence for the Majority rule. Hence, it is not surprising that Kuncheva et al. [27] found ambiguous results with 2-classifier Q values on predicting the optimality of the statistical dependence for the Majority rule. In this section, we follow the notation and similar analysis to the OR/AND fusion rules for consistency in this dissertation instead of following a similar notation to the previous work on Majority rule analysis by [35], [28], [27]. When the classifier decisions are conditionally independent, the error probabilities of MAJORITY fusion using three classifiers are as follows. P FA = P FA1 P FA2 (1 P FA3 ) + P FA2 P FA3 (1 P FA1 ) + P FA3 P FA1 (1 P FA2 ) + P FA1 P FA2 P FA3 = P FA1 P FA2 + P FA2 P FA3 + P FA3 P FA1 2P FA1 P FA2 P FA3 (3.96) P FR = P FR1 P FR2 (1 P FR3 ) + P FR2 P FR3 (1 P FR1 ) + P FR3 P FR1 (1 P FR2 ) + P FR1 P FR2 P FR3 = P FR1 P FR2 + P FR2 P FR3 + P FR3 P FR1 2P FR1 P FR2 P FR3 (3.97) We show the optimal conditional dependence needed for three classifier fusion with MAJOR- ITY rule to illustrate the advantages of our approach. The Majority rule is a symmetric rule on authentic and impostors. Therefore, the favorable/optimal conditional dependence for the Majority rule is the same for both authentics and impostors. Hence, analysis to know the favorable/optimal conditional dependence for the Majority rule for only one of them (authentics or impostors) is sufficient. The favorable/optimal conditional dependence on authentics for the Majority rule is analyzed below. In order to have favorable conditional dependence for the given individual classifier FRRs, there 88

106 are the following constraints. P FR = P(u = [ ] H 1 ) + P(u = [ ] H 1 ) + P(u = [ ] H 1 ) + P(u = [ ] H 1 ) < P FR1 P FR2 + P FR2 P FR3 + P FR3 P FR1 2P FR1 P FR2 P FR3 (3.98) P FR1 = P(u = [ ] H 1 ) + P(u = [ ] H 1 ) + P(u = [ ] H 1 ) + P(u = [ ] H 1 ) (3.99) P FR2 = P(u = [ ] H 1 ) + P(u = [ ] H 1 ) + P(u = [ ] H 1 ) + P(u = [ ] H 1 ) (3.100) P FR3 = P(u = [ ] H 1 ) + P(u = [ ] H 1 ) + P(u = [ ] H 1 ) + P(u = [ ] H 1 ) (3.101) P(u H 1 ) = 1 (3.102) u We have 4 equations and 1 inequality for favorable conditional dependence, while there are 8 variables. Hence, there are infinite solutions for favorable conditional dependence. For knowing the optimal conditional dependence, we have to solve a constrained minimization problem with the constraints in Eq.(3.99) to Eq.(3.102). There are two cases for the solution, which are provided below. Case1 : P FR1 + P FR2 + P FR3 1 When the classifier FRRs are low, it is possible to have a minimum error of zero for the Majority rule. Specifically, the sum of the classifier error probabilities has to be less than 1 for this to happen. It is more likely that the classifiers used have low FRR satisfying this case. The idea here is to concentrate all probability into the cases where two classifiers and three classifiers are correct. By ensuring zero probability when zero or only one classifier is correct, the Majority rule makes no errors on authentics. This is simpler to understand by considering the joint probabilities when one classifier is correct. Assume, without loss of generality, that classifier 1 is correct. There are four combinations of decisions from the other classifiers, as shown in the last four equations 89

107 in Eq.(3.105). The last three equations represent when two or three classifiers are present. The sum of these probabilities gives the accuracy of classifier 1. By splitting the correct classification probability into the last three equations, it is ensured that the probability where only classifier 1 is correct is zero. In addition to the 4 constraints in Eq.(3.99) to Eq.(3.102), the probability that exactly one classifier is correct and all classifiers are incorrect should be zero, for zero FRR for the Majority rule. The 8 equations in 8 variables has a unique solution given in Eq.(3.105). ( [ ] ) P u = H 1 = 0 ( [ ] ) P u = H 1 = 0 ( [ ] ) P u = H 1 = 0 ( [ ] ) P u = H 1 = P FR1 ( [ ] ) P u = H 1 = 0 ( [ ] ) P u = H 1 = P FR2 ( [ ] ) P u = H 1 = P FR3 ( [ ] ) P u = H 1 = 1 (P FR1 + P FR2 + P FR3 ) (3.103) At optimal conditional dependence on authentics when P FR1 + P FR2 + P FR3 1, the Majority rule FRR is zero. smallestp FA = 0, P FR1 + P FR2 + P FR3 1 (3.104) The Q statistic for two classifiers is given by ) ( ) Q j,k = N11 N 00 N 01 N (1 10 (P j FR + P FR k ) (0) P j FR (P k ) FR N 11 N 00 + N 01 N 10= ) ( ) (1 (P j FR + P FR k ) (0)+ P j FR (P ) = 1 (3.105) FR k Here the Q statistic on the authentics between any pair of classifiers is -1 since N 11 is zero. Case2 : P FR1 + P FR2 + P FR3 > 1 When the classifier FRRs are large, the majority rule FRR cannot be zero. Specifically, when 90

108 the sum of the classifier FRRs is greater than 1, the Majority rule FRR cannot be zero. In this case, the idea is to maximize the probability that exactly two classifiers are correct. The probability that all three classifiers make correct decisions actually increases the Majority rule FRR. Although this may seem non-intuitive, it can be proved. Consider the sum of the three equations from Eq.(3.99) to Eq.(3.101) with the difference of Eq.(3.102). This would lead to the following equation after some math. ( [ ] ) ( [ P FR1 + P FR2 + P FR3 1 = P FR + P u = H 1 P u = ( [ ] ) ( [ P FR = (P FR1 + P FR2 + P FR3 1) P u = H 1 + P u = ] H 1 ) (3.106) From the above equation, it can be seen that the Majority rule FRR is minimum when the probability that all classfifiers make correct decisions is minimum. This proves the statement (for Case 2) that the probability that exactly two classifiers are correct should be maximized for minimizing the Majority rule FRR. By setting the probabilities that exactly one classifier makes a correct decision and exactly three classifiers make correct decisions to zero, the smallest Majority rule FRR is obtained. This provides 4 equations in addition to the 4 constraints in Eq.(3.99) to Eq.(3.102). The 8 equations in 8 variables have a unique solution given in Eq.(3.107). ( [ ] ) P u = H 1 = 1 2 (P FR1 + P FR2 + P FR3 1) ( [ ] ) P u = H 1 = 0 ( [ ] ) P u = H 1 = 0 ( [ ] ) P u = H 1 = P FR1 1 2 (P FR1 + P FR2 + P FR3 1) ( [ ] ) P u = H 1 = 0 ( [ ] ) P u = H 1 = P FR2 1 2 (P FR1 + P FR2 + P FR3 1) ( [ ] ) P u = H 1 = P FR3 1 2 (P FR1 + P FR2 + P FR3 1) ( [ ] ) P u = H 1 = 0 (3.107) At optimal conditional dependence on authentics when P FR1 +P FR2 +P FR3 > 1, the Majority ] H 1 ) 91

109 rule FRR is given by smallestp FR = 1 2 (P FR1 + P FR2 + P FR3 1), P FR1 + P FR2 + P FR3 > 1 (3.108) Here, at optimal conditional dependence on authentics, the Q value between a pair of classifiers is given by Q j,k = N11 N 00 N 01 N 10 N 11 N 00 +N 01 N 10 = (P FR) (P i FR P FR) (P k FR P FR) (P j FR P FR) (P FR ) (P i FR P FR)+(P k FR P FR) (P j FR P FR), i j, k (3.109) Here, neither N 11 nor N 00 is zero for any pair of classifiers. Hence a two-classifier Q statistic can be positive or negative based on the values of P FR1, P FR2, P FR3. Therefore, the Q statistic is not a suitable statistic to measure the best statistical dependence of the MAJORITY rule. Kuncheva et al [27] used the pairwise Q statistic as a measure to distinguish the optimal dependence on the Majority rule. So, it is not surprising that they got ambiguous results. Hence, our approach is more powerful. From the joint conditional probability, it is to be noted that either in Eq.(3.103) or Eq.(3.107), the probability that only one of the classifiers makes a correct decision is zero. Most of the probability is concentrated on the cases where two classifiers make correct decisions while the other makes an incorrect decision. Hence a 3-classifier diversity measure is needed to quantify the optimal statistical dependence for the 3-classifier Majority rule. The last three sections have investigated the important decision fusion rules, namely, or, and, Majority. These were found to be important in Section 3.2. The selection of thresholds on classifier scores makes a difference in the statistical dependence of classifier decisions. Section 3.2 used an exhaustive search strategy to find optimal threshold sets for the different 3-classifier decision fusion rules. This brute force search for the optimal thresholds will become computationally infeasible for a large number of classifiers. The next section investigates whether this search can be minimized. 3.6 Optimal ROC Fusion of Decision Fusion Rules OVERVIEW: Obtaining optimal ROCs for decision fusion rules requires finding the optimal set of thresholds on the classifier scores. An exhaustive search for the best threshold set becomes 92

110 computationally infeasible for a large number of classifiers. Finding the best points on the individual ROCs that are used to obtain the optimal ROC for the AND rule has been shown for statistically independent classifiers in the literature. This approach is not valid for statistically dependent classifiers. A two step approach for finding the best points on the individual ROCs for optimal AND/OR rule fusion of statistically dependent classifiers is presented here. Each of the classifier scores are thresholded and the decisions are fused using decision fusion rules such as AND, OR, etc. For a given value of FAR for say, the AND rule, there will be many FRR points corresponding to different set of thresholds on the classifier scores. We need to find the optimal set of thresholds to get the lowest FRR of the AND rule for a given FAR of the AND rule. The brute force search for the best thresholds on the N classifier scores is a search for the minimum value of FRR in an N dimensional space and is computationally very expensive. Zhang and Chen [116] present a method to find the optimal thresholds of independent classifiers for fusion with the AND rule. However, for statistically dependent classifiers, this method would not provide the optimal ROC. We present a method of finding the optimal thresholds on statistically dependent classifier scores for fusion with the AND rule for a special case where the third and higher-order correlations between the scores are zero. When the scores are Gaussian, decorrelating the scores would make them independent. Applying Zhang and Chen s [116] method on these statistically independent scores provides the optimal AND rule ROC. This ROC for the AND rule is then optimal for the original statistically dependent scores too. A transform that decorrelates the scores provides a function relating the thresholds of the statistically dependent scores to the thresholds of the statistically independent scores. A threshold on the obtained statistically independent score is a function of all thresholds on the statistically dependent scores. The procedure to find the optimal ROC for the And decision rule with statistically dependent classifiers is described in detail in Appendix 7.1. The two step process used to obtain the optimal ROC of the AND rule from the individual ROCs of statistically dependent classifiers is summarized in Table 3.5. The OR rule is a complement of the AND rule. The procedure for ROC fusion for the OR rule is not presented here because of the similarity to the ROC fusion for the AND rule. More explanation is given in Appendix 7.1. Zhang and Chen [116] do not present a method of obtaining the optimal 93

111 Table 3.5: Two step procedure to obtain the optimal ROC for the AND rule from individual ROCs of statistically dependent classifiers. Step 1. The statistically dependent scores y are assumed to follow a Gaussian distribution. Statistically independent scores z are obtained by applying a linear transform S to decorrelate the statistically dependent scores y. z = Sy. A set of thresholds t on z is related to the set of thresholds t on y by t = St Step 2. The procedure described by Zhang and Chen [116] is used to obtain the optimal ROC for the AND rule from the individual ROCs of statistically independent scores z. ROC of the OR rule from the statistically independent individual classifier ROCs. However, by replacing the AND rule FAR by the OR rule FRR, and the AND rule FRR by the OR rule FAR, the same method provides the optimal ROC for the OR rule from the statistically independent classifier ROCs. For statistically dependent classifiers, the first step proposed in our method here creates statistically independent classifiers (whose thresholds are coupled). The optimal ROC for the OR rule can be found for these statistically independent classifiers by Zhang and Chen s [116] approach. The optimal thresholds on these statistically independent classifiers can be related to the coupled optimal thresholds on the original statistically dependent classifiers. Currently, we do not have a method (other than the brute force search) to obtain the optimal thresholds on the individual classifier ROCs for the optimal ROC of Majority rule even with independent classifiers. The ROC curve fusion for Majority and other decision fusion rules calls for further investigation. However, due to the limitations of this thesis, this topic is not of primary focus of this thesis. 3.7 Summary and Conclusions Optimal score fusion error using the Bayes rule is difficult to obtain analytically even for simple Gaussian distributions of the authentic and impostor scores. Thia is because of the complex quadratic decision boundaries for unequal authentic and impostor covariances. In practice, the distribution of the scores may be more complex than Gaussians and often in Biometric applications, there is insufficient data to even model the distributions. Decision fusion rules are easier to apply on 94

112 the scores. Finding the best decision fusion rule is a complex problem. For statistically dependent classifiers, the search space for the best decision rule from 2 2N rules cannot be reduced in general, but can be reduced to monotonic rules for a specific range of statistical correlation of the authentic and the impostors. Moreover, the value of statistical correlation for authentic and impostor scores can be useful in indicating the best monotonic fusion rule. It is found in Section 3.2 that the OR, AND and MAJORITY decision rules are the major rules for three classifier fusion. Hence, we focus on analyzing these rules. There are optimal values of statistical dependence for each of these rules, which are found in Sections 3.3, 3.4 and 3.5 for two and three classifier fusion. Conversely, these rules are shown to be optimal at those values of statistical dependence in Section 3.2. This is useful in reducing the search for the best decision rule. For the general N classifier fusion, finding the optimal values of statistical dependence is difficult; however, favorable values of statistical dependence are found for the N classifier OR/AND decision rules in Sections 3.3 and 3.4 The original contributions in this chapter are as follows. Theoretical method to obtain the optimal ROC of AND/ OR rules from the individual classifier ROCs and authentic and impostor score covariances Proof to show that monotonic fusion rules are not necessarily optimal in general for statistiscally dependent classifiers. Derivation of regions of statistical dependence where monotonic fusion rules are optimal. Theoretical analysis of the best/worst as well as favorable/unfavorable conditional dependence for the OR/AND rules for 2 classifiers. Theoretical analysis of favorable/unfavorable conditional dependence for the N classifier OR/AND rules. Simulation results to show that the best monotonic decision fusion rule is indicated by the second-order correlation coefficient between classifier scores. The key contribution of this dissertation is on generating such favorable classifier ensembles for these decision fusion rules; which have not been successful in current literature. In the next chapter, 95

113 we investigate some ideas for generating such favorable classifier ensembles for the OR, AND and MAJORITY rules. 96

114 CHAPTER 4 CLASSIFIER ENSEMBLE DESIGN FOR DIFFERENT RULES ON SIMULATED DATA In the last chapter, the statistical dependences of the classifier ensemble that are favorable to the OR, AND and MAJORITY rules were investigated. In this chapter, the design of these classifier ensembles that are favorable for each of the rules is investigated. The classifier design is the key idea of this thesis that is different from the ideas proposed in literature. A generative strategy is used for designing classifier ensembles using different training data subsets. A data partitioning method is followed, but it is not a random partitioning method. The data distribution and the base classifier effectively decide the fusion rule and the ensemble design strategy that is best for that fusion rule. Typically biometric data is in multiple clusters or in different parametric curves (e.g., in-plane rotation of a face or iris image by 360 degrees can be represented by a closed curve in pixel space, and out-of-plane rotation is in a different curve). The optimal Bayes decision boundary would be very complex and the base classifiers (e.g., Correlation filters, SVM, LDA, etc.) cannot offer such complex boundaries. When the base classifier is unable to fit all the clusters/parametric curves, then the single classifier accuracy is insufficient. For complex biometric data, no single classifier can exactly fit the data. This inadequacy of the single classifier can be compensated by using multiple classifier fusion. This idea is illustrated using some simulated data clusters, assuming that the base classifiers are limited to linear classifiers, and designing the classifier ensemble for a given fusion rule to minimize error. It perhaps may be thought that score fusion has more flexibility than decision fusion. However, 97

115 the fact is that once the classifier ensemble is designed, the scores are fixed and the diversity in the scores will affect the fusion, which includes the best fusion method and the accuracy. If the scores are all similar, no fusion strategy on the scores will help in improving accuracy. Since the decision rules OR, AND, and MAJORITY decision rules are similar to quantized versions of MAX, MIN and AVERAGE score fusion rules, respectively, the classifier design strategies discussed in this chapter are similar to the classifier design for MAX, MIN and AVERAGE score fusion rules. Some sample two dimensional data distributions are presented in Figures 4.1 to 4.4. These data distributions of authentics and impostors are mixtures of Gaussian distributions. For each of these distributions, the optimal Bayes decision boundary is either quadratic or more complex. For purposes of illustration, we assume the base classifier to be linear (to reflect the difference in the type of decision boundary between the optimal Bayes rule and those of base classifiers used on real data). In Sections 4.1 to 4.3, each of these data distributions are used to illustrate the classifier ensemble design procedure for each of the three major decision fusion rules, Viz., OR, AND, MAJORITY. It should be noted that we do not need the real data distributions to be Gaussians or mixtures of Gaussians. If they are in the form of clusters (each cluster need not have a Gaussian distribution), then the design procedure provided here would be useful. It should also be noted that these sample data distributions are merely used to illustrate the design procedure since it is easy to visualize the decision boundaries of each classifier and that of decision fusion rule in 2D. 4.1 Ensemble design approach for the OR rule In this section, we design ensembles for OR rule fusion for each of the four data distributions in Figures 4.1 to 4.4. If the authentic data is in clusters, the ensemble design principle for the OR rules is to design each classifier to separate each authentic cluster from the entire set of impostors. One of the challenges in applying this to real data is being able to identify these authentic clusters and be able to separate images/features of one cluster from others. Each classifier would then have a large false rejection rate but very low false acceptance rate. The authentic decision region for the OR rule is the union of the authentic decision regions of all the classifiers, which would lower the false rejection rate drastically from those of individual classifiers. The impostor decision region for the OR rule is the intersection of the impostor decision regions of all the classifiers and if this covers 98

116 Authentic Impostor 4 Dara dimension Data dimension 1 Figure 4.1: Data Distribution 1. Figure 4.2: Data Distribution 2 99

117 Figure 4.3: Data Distribution 3 Figure 4.4: Data Distribution 4 100

118 most of the impostors for well designed ensembles, then there will still be a low false acceptance rate. In each of the Subsections to 4.1.4, we illustrate this simple principle of ensemble design for the OR rule and compare the errors with the optimal Bayes rule error OR rule ensemble design for data distribution 1 For the sample data distribution in Fig. 4.1, no single linear classifier can effectively separate the authentics from impostors. For illustration purposes, the authentic data is in 3 clusters and the impostor data is in a single large cluster. It is to be noted that real data need not be Gaussian or a mixture of Gaussians for this ensemble design strategy to work. The optimal Bayes decision boundary for the given sample data is complex and hence linear classifiers would not be accurate. However, if each linear classifier, as shown in Fig. 4.5, can effectively separate one cluster of authentics from most of the impostors, then the OR rule fusion of all the classifier decisions can do a good job in authentic/impostor separation. It can be observed from Figure 4.5 that each pair of linear classifiers make different decisions on most of the authentic data and similar decisions on most of the impostor data. Let 1 represent authentic decisions and 0 represent impostor decisions. If the pink and green linear classifiers are considered, they make decisions of (1,0) on the top-most authentic cluster, (0,1) on the rightmost authentic cluster, and (0,0) on the bottom-most authentic cluster, respectively. Hence, for about 2/3rds of the authentic data, they have different decisions. The pink and green linear classifiers make decisions of (0,0) on almost all of the impostor data. Hence, they mostly agree (correctly) on the impostor data. This classifier pair has favorable correlation coefficient or Q value between decisions for the OR rule fusion. All other classifier pairs of this ensemble have a similar disagreement on most authentic data and agreement on most impostor data. Hence, this classifier ensemble is suitable for fusion with the OR rule. The OR rule fusion has an FAR of 4.7% and an FRR of 4.3%. Sections and refer to the AND rule or the MAJORITY rule ensemble design, respectively, on this data distribution with linear base classifiers. While details can be obtained from these sections, it is found that AND rule ensemble design would be poor and the MAJORITY rule ensemble design would not have optimal MAJORITY rule diversity, leading to a higher error rate 101

119 than the OR rule ensemble. In other words, the data distribution and the base classifier choice play a major role in choosing the best fusion rule and corresponding ensemble design strategy. Figure 4.5: Optimum ensemble for OR rule fusion on data distribution 1. The pink, green and blue lines are the linear classifier decision boundaries. The dashed line is the OR rule decision boundary OR rule ensemble design for data distribution 2 For this distribution, we cannot design a classifier ensemble suitable for OR rule fusion using linear classifiers. This is because the authentic data is in one cluster, while the impostor data is in multiple clusters. For the OR rule, each linear classifier should be able to separate most of the impostors while separating one clusters of authentics. Such linear classifiers cannot be designed for this distribution OR rule ensemble design for data distribution 3 In Figure 4.3, we see that the data has 3 authentic clusters and 3 impostor clusters. The impostor clusters are situated in between the authentic clusters. In Figure 4.6, we can see that for this data distribution, it is possible to design each linear classifier to effectively separate one authentic cluster 102

120 from the rest of the impostors. The decision boundaries of each classifier are shown in Figure 4.6. The marked area shows the authentic decision region for the OR rule and the unmarked ares is the impostor decision region for the OR rule. We can see that this classifier ensemble design followed by fusion with the OR rule has a low error and can effectively separate the authentics from the impostors. From the design principle for the OR rule ensemble, a linear classifier can separate each authentic cluster from all the impostor clusters in this example. From Figure 4.6, it is observed that the three linear classifiers make different decisions on authentic data and similar (correct) decisions on impostor data. The pink and green classifiers make decisions of (1,0) on most authentic data in the topmost authentic cluster, (0,1) on most authentic data of the bottom left authentic cluster and (0,0) on authentic data of the bottom right authentic cluster, respectively. Hence, for about 2/3rd of the authentic data, they disagree. For most of the impostor data in all clusters, they make the same correct decision of (0,0). Since these classifiers disagree more on authentic data and agree more on impostor data, this pair of classifiers are favorable for fusion with the OR decision rule. Similarly, the other pair of classifiers also disagree more on authentic data and agree on impostor data. Due to this property, this linear classifier ensemble is favorable for fusion with the OR rule. The OR rule fusion has an FAR of 6.6 ± 1.1% and an FRR of 6.6 ± 1.1%. We shall see later that for this data distribution, the ensemble design for the Majority rule followed by fusion with the Majority rule has similar effectiveness in separating the authentics from impostors. This is an example that shows that for some data distributions, ensemble design for some rules (in this case, the majority rule and the OR rule) followed by fusion by the corresponding rules can have similar accuracy OR rule ensemble design for data distribution 4 In Figure 4.4, we see that the data has 3 authentic clusters and 6 impostor clusters. Three impostor clusters are situated in between the authentic clusters and the other three impostor clusters are situated outside the authentic clusters. In Figure 4.7, we can see that for this data distribution, it is not possible to design each linear classifier to effectively separate one authentic cluster from the rest of the impostors. Each linear classifier can separate one authentic cluster from the inner three 103

121 Figure 4.6: Design of multiple linear classifiers for OR rule fusion on data distribution 3. The marked area denotes the authentic decision region for the OR rule. impostor clusters. The decision boundaries of each classifier are shown in Figure 4.7. The marked area shows the authentic decision region for the OR rule and the unmarked ares is the impostor decision region for the OR rule. We can see that this classifier ensemble design followed by fusion with the OR rule has been able to effectively separate the inner three impostor clusters from the authentics but unable to separate the outer three impostor clusters from the authentics. From Figure 4.7, and the analysis in Section 4.1.3, the three classifier pairs disagree on most of the authentic data. In other words, they have a negative Q/correlation coefficient on authentic data. On the impostor data, the pink and green classifiers disagree on most of the impostor data in the left-most impostor cluster, disagree on about half of the right-most impostor cluster, agree (with a decision of (0,0)) on most of the impostor data in the bottom-most impostor cluster, agree (with a decision of (0,0)) on all three of the inner set of impostor clusters. On the whole, they agree on more than half of the impostor data. The other classifier pairs also have similar diversity as this classifier pair. For favorable fusion with the OR rule, the classifier pairs should agree on most of the impostor decisions. Since the three pairs of classifiers disagree on most of the impostor data in the outer impostor cluster set, they make errors on this outer impostor cluster set on OR rule fusion. Hence, this classifier ensemble has a large FAR with a small FRR. 104

122 We shall see later that for this data distribution, the ensemble design for the Majority rule followed by fusion with the Majority rule is more effective in separating the authentics from impostors and has a lower error rate. This is an example that shows that for some data distributions, ensemble design for some rules (in this case, the majority rule) are more effective than ensemble design for other rules (in this case, the OR, AND rules). Figure 4.7: Design of multiple linear classifiers for OR rule fusion on data distribution 4. The marked area denotes the authentic decision region for the OR rule. 4.2 Ensemble design approach for the AND rule In this section, we design ensembles for AND rule fusion for each of the four data distributions in Figures 4.1 to 4.4. The AND rule is the complement of the OR rule and hence, the ensemble design strategy for the AND rule is the complement of the ensemble design strategy for the OR rule. If the impostor data is in clusters, the ensemble design principle for the AND rule is to design each classifier to separate each impostor cluster from the entire set of authentics. Each classifier would then have a large false acceptance rate (FAR) but very low false rejection rate (FRR). The authentic decision region for the AND rule is the intersection of the authentic decision regions of all the classifiers and the impostor decision region for the AND rule is the union of the impostor decision regions of all the classifiers. For well-designed ensembles, the authentic decision region 105

123 for the AND rule covers most of the authentics and the impostor decision region for the AND rule covers most of the impostors. In each of the Subsections to 4.2.4, we illustrate this simple principle of ensemble design for the AND rule and compare the errors with the optimal Bayes rule error AND rule ensemble design for data distribution 1 For this distribution in Fig. 4.1, we cannot design a classifier ensemble suitable for AND rule fusion using linear classifiers. This is because the impostor data is in one cluster, while the authentic data is in multiple clusters. For the AND rule, each linear classifier should be able to separate most of the authentics while separating one cluster of impostors. Such linear classifiers cannot be designed for this distribution AND rule ensemble design for data distribution 2 From Figure 4.2, we see that the data has 3 impostor clusters and a single authentic cluster. The impostor clusters are situated around the authentic distribution. For optimal ensemble design for the AND rule, each classifier should effectively separate each impostor cluster from the authentics. From Figure 4.8, we can see that for this data distribution, it is possible to design three linear classifiers to effectively separate each of the three impostor clusters from the authentics. The decision boundaries of each classifier are shown in Figure 4.8. The marked area shows the impostor decision region for the AND rule and the unmarked area is the authentic decision region for the AND rule. For this 2D data distribution, with the specific alignment of impostor clusters, two linear classifiers designed to separate the each of the two extreme impostor clusters are sufficient to produce an effective separation between authentics and impostors on AND fusion. However, in general, for multi-dimensional data, this may not be possible. If each classifier can separate one impostor cluster from all the authentic clusters effectively, desirable diversity for the AND rule would obtained. The pink and green linear classifiers that separate the extreme impostor clusters from the authentic cluster have different decisions on these impostor clusters and same decisions on the authentic cluster. The pink and blue classifiers make decisions of (0,0) on most impostor data in the top left impostor cluster, (0,1) on most impostor data in the bottom left impostor cluster, 106

124 and (1,1) on most impostor data in the top right impostor cluster. AND fusion of just the pink and blue classifiers would improve accuracy on impostors. However, they do not have favorable diversity on impostor data since their decisions agree on most of the impostor data. They have favorable diversity on authentic data decisions, since they both decide (1,1) on most authentic data. It is the same case between the green and blue classifiers. Thus, all pairs of classifiers have favorable diversity on authentic data, and one of the three classifier pairs ( pink and green ) have favorable diversity on the impostor data. The AND rule fusion with the three linear classifiers has an FAR of 3.8 ± 0.8 % and an FRR of 2.1 ± 0.5%. We shall see later that this classifier ensemble design followed by fusion with the AND rule has a similar error rate with ensemble design for the Majority rule followed by Majority fusion in Subsection This is an example that shows that for some data distributions, there are multiple decision rules for which ensemble design has maximum accuracy (in this case, the AND rule and MAJORITY rule). Figure 4.8: Design of multiple linear classifiers for AND rule fusion on data distribution 2. The marked area denotes the impostor decision region for the AND rule. 107

125 4.2.3 AND rule ensemble design for data distribution 3 For this distribution in Fig. 4.3, we cannot design a classifier ensemble suitable for AND rule fusion using linear classifiers. There are three authentic clusters and three impostor clusters. However, since the authentic clusters are outside the impostor clusters, it is not possible to design a linear classifier to separate an impostor cluster from all the authentics. Hence a classifier ensemble suited for the AND rule is not possible for this distribution AND rule ensemble design for data distribution 4 In Figure 4.4, we see that the data has 3 authentic clusters and 6 impostor clusters. Three impostor clusters are situated in between the authentic clusters and the other three impostor clusters are situated outside the authentic clusters. In Figure 4.9, we can see that for this data distribution, it is possible to design a linear classifier to effectively separate one outer impostor cluster from the rest of the authentics. The decision boundaries of these three classifiers are shown in Figure 4.9. However, it is not possible to design a linear classifier to effectively separate one inner impostor cluster from the authentics. AND fusion using the three linear classifiers that separate one of the outer impostor clusters results in the impostor decision region for the AND rule shown by the marked area in the figure. The unmarked area is the authentic decision region for the AND rule. From Figure 4.9, it is observed that the pink and green classifiers make decisions of (0,1) on the left-most impostor cluster, (1,0) on the right-most impostor cluster, (1,1) on the bottom-most impostor cluster and (1,1) on all the three inner set of impostor clusters. Since they agree on four out of the six impostor clusters, they will have a positive Q/correlation coefficient on impostors. This is unfavorable for AND rule fusion. The pink and green classifiers decide (1,1) on all the three authentic clusters, which leads to a positive Q/correlation coefficient on authentics. This is favorable to the AND rule fusion. The other classifier pairs also have similar diversity as the pink and green classifier pair. For this classifier ensemble, there is unfavorable Q/correlation coefficient between classifier pairs for the AND rule. Since the three inner set of impostor clusters are incorrectly classified, there is a large error on impostors. We shall see later that for this data distribution, the ensemble design for the Majority rule followed by fusion with the Majority rule is more effective in separating the authentics from impostors 108

126 and has a lower error rate. This is an example that shows that for some data distributions, only for one decision fusion rule, the ensemble design is more effective. In this case, ensemble design for the Majority rule followed by fusion with the Majority rule outperforms ensemble design for other rules. Figure 4.9: Design of multiple linear classifiers for AND rule fusion on data distribution 4. The marked area denotes the impostor decision region for the AND rule. 4.3 Ensemble design approach for the Majority rule In this section, we design ensembles for Majority rule fusion for each of the four data distributions in Figures 4.1 to 4.4. The optimal ensemble design strategy for the Majority rule is different and more complex than the ensemble design strategies for the AND, OR rules. The ensemble design strategy for the Majority rule is optimal when both the authentic and impostor data are in clusters. For optimal ensemble design for the majority rule, each classifier should separate (N a + 1)/2 authentic clusters from (N i + 1)/2 impostor clusters, where N a is the number of authentic clusters and N i is the number of impostor clusters. The number of classifiers in the ensemble would be N = min(n a, N i ). This ensemble design procedure would result in negative conditional dependence on both authentics and impostors, which is optimal for the Majority rule. If N a < N i, then each authentic cluster is declared authentic by (N a +1)/2 classifiers, resulting in a authentic 109

127 decision for the Majority rule that covers all the authentic clusters. Hence, the majority rule fusion would result in a low fasle reject rate. In each of the Subsections to 4.3.4, we illustrate this general principle of ensemble design for the Majority rule and compare the errors with the optimal Bayes rule error Majority rule ensemble design for data distribution 1 From Figure 4.1, we see that the data has 3 authentic clusters and a single impostor cluster. The authentic clusters are situated around the impostor distribution. For optimal ensemble design for the Majority rule, each classifier should separate two (out of the three) authentic clusters from (at least 2/3rds of) the impostors. The decision boundaries of each classifier for the Majority rule ensemble are shown in Figure The marked area shows the authentic decision region for the Majority rule and the unmarked area is the impostor decision region for the Majority rule. From Figure 4.10, we can see that for this data distribution, it is possible to design two linear classifiers ( pink and cyan ) to effectively separate two authentic clusters from the impostors. These two classifiers have the same decision (of authentic) on one authentic cluster centered at (6,0), and different decisions on the other two authentic clusters. Hence they have a negative pairwise classifier Q/correlation coefficient on authentics. The third linear classifier (green) that separates the extreme authentic clusters has a very poor separation from the impostors, much less than the optimal 2/3rds correct classification on the impostor distribution. This classifier makes the same decision for two of the three authentic clusters with each the other two classifiers. The pink and green classifiers make decisions of (1,1) on the top-most and right-most authentic clusters and (0,1) on the bottom-most authentic cluster. They agree more (2/3rd) on the authentic data. Hence there is a positive pairwise classifier Q/correlation coefficient on authentics between this pair of classifiers. On the other hand, they disagree more on the impostor data, which would lead to a negative Q/correlation coefficient on impostors. Similar diversity is present between the green and blue classifier pairs. A negative pairwise classifier Q /correlation coefficient between all pairs of classifiers on both authentic and impostor data is favorable for the Majority rule. Hence, this classifier ensemble will not have optimal statistical dependence between classifier decisions for the Majority rule. The Majority rule fusion has an FAR of 40.7% and an FRR of 4.85%. 110

128 From the classification boundaries in Figures 4.10 and 4.5, it is observed that the Majority rule classifier ensemble design followed by fusion with the Majority rule has higher error than ensemble design for the OR rule followed by OR fusion in Subsection This is an example that shows that for some data distributions, there is one optimal decision rule for which ensemble design has maximum accuracy. Figure 4.10: Design of multiple linear classifiers for Majority rule fusion on data distribution 1. The marked area denotes the authentic decision region for the Majority rule Majority rule ensemble design for data distribution 2 From Figure 4.2, we see that the data has 3 impostor clusters and a single authentic cluster. The impostor clusters are situated around the authentic distribution. For good ensemble design for the Majority rule, each classifier should separate two (out of the three) impostor clusters from (at least 111

129 2/3rds of) the authentics. From Figure 4.11, we can see that for this data distribution, it is possible to design two linear classifiers to effectively separate two impostor clusters from the authentics. The third linear classifier that separates the extreme impostor clusters has very poor separation from the authentics, much less than the optimal 2/3rds correct classification on the authentic distribution. This classifier ensemble will not have optimal statistical dependence between classifier decisions for the Majority rule. However, it is still the optimal ensemble design with linear classifiers for the Majority rule on this data distribution. The decision boundaries of each classifier are shown in Figure The marked area shows the impostor decision region for the Majority rule and the unmarked area is the authentic decision region for the Majority rule. The pink linear classifier separates the top left and the top right impostor clusters from the authentic cluster. The green linear classifier separates the top left and the bottom left impostor clusters from the authentic cluster. These two classifiers have different decisions on the top right and bottom left impostor clusters and same (correct) decisions on the authentic cluster. Due to the disagreement on most impostor data, they have a negative Q value/ correlation coefficient on impostor decisions. This is favorable for the Majority rule fusion. However the agreement on most authentic data is not favorable to the Majority fusion rule. The blue classifier separates the two extreme impostor clusters from the authentic data. The performance of this classifier is poor on the authentic data. The pink and blue classifiers make decisions of (0,0) on most impostor data in the top right and top left impostor cluster, (1,0) on most impostor data in the bottom left impostor cluster. In other words, they agree more on the impostor data. This leads to a positive Q/correlation coefficient value on impostors, which is not favorable for the Majority rule. On the authentic data, they disagree more since the blue classifier makes an incorrect decision on most authentic data while the pink classifier makes a correct decision on most authentic data. This leads to a negative Q/correlation coefficient on the authentics, which is favorable for the Majority rule. The green and blue classifiers have similar diversity as the pink and blue classifiers. Although the classifier ensemble does not have favorable diversity for the Majority rule, the ensemble design guidelines lead to a good decision boundary. The Majority rule fusion with the three linear classifiers has an FAR of 5.5 ± 1.0% and an FRR of 2.0 ± 0.5%, which is comparable to the AND rule fusion error rates. We can see that this classifier ensemble design followed by fusion with the Majority rule has 112

130 similar error to the ensemble design for the AND rule followed by AND fusion in Subsection This is an example that shows that for some data distributions, there is multiple decision rules for which ensemble design has maximum accuracy. Figure 4.11: Design of multiple linear classifiers for Majority rule fusion on data distribution 2. The marked area denotes the impostor decision region for the Majority rule Majority rule ensemble design for data distribution 3 From Figure 4.3, we see that the data has 3 authentic clusters and 3 impostor clusters. The impostor clusters are situated in between the authentic clusters. For optimal ensemble design for the Majority rule, each classifier should separate two (out of the three) authentic clusters from two (out of the three) impostor clusters. From Figure 4.12, we can see that for this data distribution, it is possible to design each linear classifier to effectively separate two authentic clusters from the two impostor clusters. The decision boundaries of each classifier are shown in Figure The marked area shows the authentic decision region for the Majority rule and the unmarked area is the impostor decision region for the Majority rule. From Figure 4.12, it can be observed that all pairs of classifiers disagree more on authentic as well as impostor data. The pink and green linear classifiers make decisions of (1,0) on the topmost authentic cluster, (1,1) on the bottom right authentic cluster and (0,1) on the bottom left 113

131 authentic cluster, respectively. Hence, they disagree on most (about 2/3rds) of the authentic data. This pair of classifiers makes decisions of (0,0) on the top left impostor cluster, (1,0) on the top right impostor cluster and (0,1) on the bottom-most impostor cluster. Thus they disagree more (on 2/3rds) of the impostor data too. This leads to a negative Q/correlation coefficient on authentics and impostors, which is favorable to the Majority rule. The same diversity can be seen between other pairs of classifiers in this ensemble. Hence, this ensemble is favorable to the Majority rule. It is difficult to achieve a negative Q/correlation coefficient on both authentics and impostors. Only for some data distribution and base classifier combination, it is possible to achieve this. The Majority rule fusion has an FAR of 8.2 ± 1.2% and an FRR of 8.5 ± 1.2%. We can see that this classifier ensemble design followed by fusion with the Majority rule has a low error rate and can effectively separate the authentics from the impostors. For this data distribution, this ensemble design (followed by Majority fusion) has similar accuracy (overlapping 95% confidence intervals) to the ensemble design for the OR rule (followed by OR fusion) in Subsection The OR rule ensemble has an FAR of 6.6 ±1.1% and FRR of 6.6 ±1.1%. This is an example that shows that for some data distributions, ensemble design for some rules (in this case, the majority rule and the OR rule) followed by fusion by the corresponding rules can have similar accuracy Majority rule ensemble design for data distribution 4 In Figure 4.4, we see that the data has 3 authentic clusters and 6 impostor clusters. Three impostor clusters are situated in between the authentic clusters and the other three impostor clusters are situated outside the authentic clusters. For optimal ensemble design for the majority rule, each classifier should separate (N a + 1)/2 authentic clusters from (N i + 1)/2 impostor clusters, where N a is the number of authentic clusters and N i is the number of impostor clusters. Here, we have N a = 3 and N i = 6. From Figure 4.13, we can see that for this data distribution, it is possible to design each linear classifier to effectively separate (3 + 1)/2 = 2 authentic clusters from (6 + 1)/2 = 4 impostor clusters. The decision boundaries of each of the three linear classifiers are shown in Figure The marked area shows the authentic decision region for the Majority rule and the unmarked area is the impostor decision region for the Majority rule. 114

132 Figure 4.12: Design of multiple linear classifiers for Majority rule fusion on data distribution 3. The marked area denotes the authentic decision region for the Majority rule. The pink classifier separates the top-most authentic cluster and the bottom right authentic cluster from four out of the six impostor clusters. The green classifier separates the bottom left and the bottom right authentic clusters from a different four impostor clusters. The blue classifier separates the top-most and bottom left authentic clusters from a yet another set of four impostor clusters. From the analysis in Section 4.3.3, it is known that all pairs of classifiers disagree more on authentics, resulting in a negative Q/correlation coefficient on authentics. This is favorable to the Majority rule. On the impostor data, the pink and green classifiers make decisions of (0,0) on most impostor data of the left-most impostor cluster, (0,0) on the top left of the inner set of impostor clusters, (1,0) on the top right of the inner set of impostor clusters, (1,0) on the right-most impostor cluster, (0,1) on the bottom cluster of the inner set of impostor clusters and (0,1) on the bottom-most impostor cluster. Thus, on most of the 4 of the 6 impostor clusters, the pink and green classifiers disagree. This leads to a negative Q/Correlation coefficient on the impostor data for this classifier pair. All other pairs of classifiers also disagree on most of the impostor data. This ensemble has a negative Q/correlation coefficient on the authentic and impostor data, which is favorable for Majority rule fusion. We can see that this classifier ensemble design followed by fusion with the Majority rule has 115

133 been able to effectively separate the authentics from the most of the impostors. For this data distribution, this ensemble design strategy followed by fusion by the Majority rule is more accurate than ensemble design for the OR rule followed by fusion with the OR rule in Subsection This is an example that shows that for some data distributions, ensemble design for some rules (in this case, the majority rule) are more effective than ensemble design for other rules (in this case, the OR rule). Figure 4.13: Design of multiple linear classifiers for Majority rule fusion on data distribution 4. The marked area denotes the authentic decision region for the Majority rule. 4.4 Conclusions In this chapter, we have provided general principles for ensemble optimal design for the three major decision fusion rules: AND, OR, MAJORITY. Since the decision rules OR, AND, and MA- JORITY decision rules are similar to quantized versions of MAX, MIN and AVERAGE score fusion rules, respectively, the classifier design strategies discussed in this chapter are similar to the classifier design for MAX, MIN and AVERAGE score fusion rules. The classifier ensemble design is more important than the fusion method (score fusion or decision fusion), since once the classifiers are designed, the statistical dependence between scores is fixed. We have illustrated the ensemble design procedure with linear classifiers on simulated 2D data distributions where the authentic and impostor data are in multiple clusters. The 2D sample data 116

134 distributions are just to showcase our ideas for optimal ensemble design for different decision fusion rules. For each of the decision fusion rules, the optimal ensemble design procedure is different. For the same data distributions and the same type of base classifiers, in this case, linear classifiers, any one of the following scenarios has been shown possible in this chapter even with the limited examples of possible data distributions. It is possible that there is similar accuracy for optimal ensembles for two different decision rules followed by fusion with the corresponding rule. This is likely between Majority and the OR rule, or the Majority and the AND rule; but unlikely between OR and AND rules. It is possible that there exists an optimal ensemble design for one rule but an optimal ensemble design for another rule does not exist (especially between the complementary OR, AND rules). It is possible for there is only one (optimum) decision fusion rule whose optimal ensemble design (followed by fusion with the corresponding rule) has the best accuracy. The correlation coefficients or Q values between the classifier decisions are affected by the data distribution and the ensemble design guidelines. In other words, obtaining the favorable Q/correlation coefficient values for a decision fusion rule depends on the data distribution. This has been observed for the different data distributions and the ensemble design for the OR, AND and Majority fusion rules illustrated in this chapter. The conclusions that are made in this chapter are not restricted to linear base classifiers. If the base classifiers are non-linear, it is effectively increasing the data dimensionality. The data distribution in the higher dimensional space (corresponding to the non-linear classifiers) will have a different set of authentic and impostor clusters than the original space. However, all the ensemble design ideas and conclusions hold in this higher dimensional space too. In other words, the ensemble design guidelines are the same for any type of base classifier, either linear or non-linear. While we have provided ensemble generation strategies favorable to the given decision rule, the feasibility of ensemble design and ensemble fusion accuracy depends on the data distribution and the base classifier chosen. In the next chapter, we apply these design techniques to fingerprint and face databases. The challenge will be in identifying the clusters (of authentics and impostors), because it 117

135 is difficult to recognize these cluster configurations in the high-dimensional image space. The other challenge will be type of decision boundaries of different base classifiers; more specifically, how the base classifiers separate the authentic and impostor clusters. 118

136 CHAPTER 5 ENSEMBLE DESIGN FOR DECISION FUSION RULES ON BIOMETRIC DATA Most of the literature is on methods of improving fusion for a given set of classifiers. Not much attention is given to the classifier ensemble design for a given fusion rule. This may be more important because if the classifier ensemble has poor diversity, no fusion method can achieve significant improvement. Most of the ensemble design methods such as Bagging [8], Random Subspaces [89], etc. aim to produce independent classifiers. However, in practice, these have positive statistical dependence and are not statistically independent [10]. Even Adaboost [9] and other boosting methods [79] aimed at producing diverse classifiers have also been shown to exhibit positive statistical dependence [10]. Classifiers with positive statistical dependence produce similar decisions on a large subset of the test inputs. Classifiers that make different decisions on large subsets of the test inputs offer more potential for improving the accuracy on fusion. There have been some attempts at obtaining diverse classifier ensembles, but these have had limited success. One approach proposed in literature is to overproduce the classifiers, and pick those that are best according to a diversity measure. Giacinto and Roli [63] use the Double Fault measure [34], Margineantu and Dietterich [64] use the measure of Interrater Agreement κ [34], Banfield et al. [66] define and use the ensemble diversity measure (reviewed in Chapter 2) to select the classifiers. However, the set of classifiers used in these approaches is only a subset of possible classifiers and hence does not guarantee that the best ensemble is obtained at the end of the selection process. Further, the diversity measures do not provide complete information of the ensemble joint 119

137 probability, and therefore do not imply the best selection strategy. Some generative strategies for classifier ensembles are proposed in [7]. As explained in Chapter 2, Tumer and Ghosh [7] partition data randomly into k subsets, and train each classifier on different k 1 subsets. They found that this was more promising than feature partitioning or spatial partitioning. For example, the individual RBF classifier had an error rate of 6.79%±0.41%std, while a seven classifier ensemble generated by data partitioning had an error rate of 5.97%±0.22%std on fusion with the average score fusion rule (which was found to be the best fusion rule) for one of the databases (oceanic database) used in [7]. However, it is not a large improvement and the data partitioning is still random. Our generative approach is not random; the data partitioning is done according to the decision fusion rule for which the classifier ensemble is designed for. The ensemble design strategies tuned to major decision fusion rules (OR, AND, MAJORITY) that were presented in the last chapter are applied to real databases in this chapter. Examples of these design strategies are shown on the NIST 24 fingerprint database [111], the CMU PIE face database [117], and the AR face database [112]. 5.1 Classifier ensemble design for the OR rule OVERVIEW: Classifier ensembles that are favorable to the OR rule are generated on the PIE face database, NIST 24 fingerprint plastic distortion and rotation databases. The proposed classifier ensembles are compared to ensembles generated by Bagging and Boosting, wherever possible. The accuracy of the proposed ensembles is significantly greater than Bagging and Boosting accuracy. The conditional dependence of the proposed ensembles is favorable to the OR rule for all three databases, validating the ensemble generation approach for the OR rule. From the analysis for the conditionally-dependent classifiers for the OR rule in Section 3.3, Q statistics between pairs of classifiers should be positive for impostors and negative for authentics for favorable dependence. Based on the illustration of OR rule ensemble design in Chapter 4, e.g., Figure 4.5, we hope to make the overall false rejection rate (FRR) for the OR rule smaller, by designing the classifiers such that each classifier correctly classifies a different, preferably disjoint set of authentic images. Further, based on Figure 4.5, we hope to make the overall false acceptance rate (FAR) for the OR rule smaller by designing the classifiers such that both classifiers correctly 120

138 reject similar, preferably same, sets of impostors. We choose optimal tradeoff correlation filters [114] as classifiers because these filters can reject impostors well. But their FRR may not be low. By training the correlation filter based classifiers on different sets of authentics, we hope to make the FRR low after the OR rule fusion. The NIST 24 [111] plastic distortion dataset of fingerprints and the CMU PIE [117] pose and illumination subsets of faces are chosen for performance evaluation PIE database evaluation OVERVIEW: The Unconstrained Minimum Average Correlation Energy (UMACE) filter is used as the base classifier for classifier ensembles generated on the PIE face database. The UMACE filters are tolerant to illumination changes. Based on this fact, the proposed OR rule ensemble is composed of 13 UMACE filters per person, one UMACE filter for each of the 13 poses. Each UMACE filter is trained on extreme illumination images of a pose. The decisions of these 13 correlation filters on a test face image are fused with the OR rule. The Equal Error Rate (EER) of this proposed ensemble is 0.75% and is an order of magnitude lower than the EERs of Bagging and Boosting. Bagging and Boosting on the same training and test set with 13 UMACE classifiers per person have an EER of 9.3% and 6.2%, respectively. The conditional dependence of the proposed ensemble is optimal to the OR rule, while the conditional dependences for the Bagging and the Boosting ensembles are not optimal to any decision fusion rule. The CMU PIE [117] dataset contains face images of 65 people with 13 poses for each person and 21 different illuminations without background lighting for each pose. Sample images of different poses are shown in Figure 5.1. The unconstrained minimum average correlation energy (UMACE) filter [118] has been shown to be tolerant to illumination variation when designed using training images representing extreme lighting conditions [73]. For the frontal face images in the PIE database, there are only a few misclassified authentic images at zero FAR with the UMACE filter [73]. The Unconstrained Optimal Tradeoff Filter (UOTF) [114] adds a noise tolerance term to the UMACE filter. Since the face verification performance is good for illumination tolerance in [73], even without noise tolerance, the UMACE filter is chosen the base classifier. Only authentic image training (without impostor image training) is sufficient to achieve the good face verification accuracy in [73]. Hence, here too, 121

139 Figure 5.1: Images of different face poses of a person. only authentic images are used in the training of the UMACE classifier ensemble. The UMACE filter is not very tolerant to pose variation. The UMACE filter is quite specific to the training images used in its design. In other words, if a UMACE filter is built from images of one pose, it would falsely reject authentic images from other poses. However, since images from impostor images are different from the authentic training images, the UMACE filter is expected to correctly reject impostor images of all poses. Following the ensemble design strategy for the OR rule in the last Chapter, designing one UMACE classifier for each pose and applying the OR fusion to the classifier decisions would be very effective in separating authentics from impostors overall. This can achieve the desired conditional dependence of negative Q statistic for authentics and positive Q statistic for impostors, which would be best fused with the OR rule. For obtaining tolerance to illumination in that pose, 3 extreme illumination authentic images of that pose are used to train a pose-specific UMACE filter. The remaining images of the dataset, i.e., images other than the training images, form the test set. For each person, there are [(21-3(training)=18) illuminations* 13 poses] = 234 authentic images and [21 illuminations * 13 poses * 64 impostor persons] = impostor images. The peak-to-sidelobe ratio (PSR) is used as the performance metric for the UMACE filter, and a decision is obtained by thresholding the PSR [57]. To 122

140 obtain statistically meaningful results, authentic and impostor decisions from all persons are used. In other words, statistical analysis of the test error and diversity of the ensemble is done on (234 authentics per person * 65 persons) = 15,210 authentic ensemble decisions (ensemble composed of 13 classifiers) and (17472 impostors per person * 65 persons) = 1,135,680 impostor ensemble decisions. The performance of the OR rule ensemble design with 13 UMACE classifiers per person is compared to the performance of one UMACE classifier per person trained on the entire training set. The single UMACE filter is trained on all the 39 (13 poses * 3 illuminations) authentic training images of a person. On the other hand, each of the 13 UMACE classifiers in the OR rule ensemble are trained on 3 authentic class illuminations of a single pose. The test set ROC of one UMACE filter per person, which is trained on all 39 authentic images, is displayed on Figure 5.2. As explained before, there are 234 authentic scores and impostor scores per person in the test set. The performance curve shown in Figure 5.2 is a Global ROC. In other words, for a threshold τ on the match score (PSR), the number of authentic scores from all persons, which would be 15,210 (234 authentic scores per person *65 people) authentic scores, that are below this threshold τ would be an FRR point. The number of impostor scores from all persons, which would be 1,135,680 (17472 impostor scores per person * 65 persons) impostor scores, that are above this thresholds τ would be the corresponding FAR point in this ROC. The Equal Error Rate (EER) of this single classifier ROC is 7.5%. On face verification with the OR rule classifier ensemble, each test image is correlated with each of the 13 UMACE filters of the claimant. 13 match scores (PSRs) are obtained from the UMACE ensemble. A threshold on the match scores produces a decision. The same threshold is used on all the 13 match scores from the 13 classifier ensemble to obtain 13 decisions. The 13 decisions from the classifier ensemble are fused with the OR fusion rule. In other words, if one of the decisions is an authentic decision, then the OR rule decision is authentic. Figure 5.3 shows the individual classifier ROCs for the 13 classifier OR rule ensemble on the test set, which are quite poor with an equal error rate (EER) of nearly 45%. The authentic scores of each classifier will be high for the pose for which it is trained for, and low for most of the other 12 poses. The impostor scores for each classifier are low for all poses of all persons. Hence, the individual classifier ROC has a poor performance. The classifier that is trained on authentic images of all the 123

141 10 0 PIE: ROC of one filter using all training images P FR P FA Figure 5.2: ROC of a single classifier per person trained on all 39 (13 poses * 3 illuminations) authentic training images. poses has a much better accuracy (EER of 7.5%), because of training on the entire authentic training set. Second order Q values on the classifier decisions are used as diversity measures for the 13 classifier set. The Q values are computed for a given threshold on the match score (PSR). The pairwise Q values for the 13 C 2 pairs of decisions are averaged for a given match score (PSR) threshold. Figure 5.4 shows the average authentic and impostor pair-wise Q values on the test set for different PSR thresholds. From the figure, it can be observed that for PSR thresholds greater than 10, the desired conditional dependence between classifiers for OR rule fusion, i.e., positive Q statistic for impostors and negative Q statistic for authentics is obtained. At a PSR value greater than 20, the impostor image decisions are all zero, which results in a pair-wise impostor Q value of 1. Above a PSR threshold of 20, the individual authentic image decision is 1 for images of the same pose as the pose used in training the classifier, and is 0 for most images of other poses. For a pair of classifiers, the first classifier trained on authentic images of pose i and the other classifier trained on authentic images of pose j, j i, the pair of classifier decisions would be (1,0) for most authentic images of pose i, (0, 1) for most authentic images of pose j and (0, 0) for most authentic images 124

142 of pose k i, j. From the definition of the pair-wise Q statistic in Appendix 7.4.2, this would result in an authentic Q value close to -1 because there are very few authentic images where the pair-wise authentic decisions is (1, 1). There would be only a small variation between the second order Q values of the 13 C 2 different pairs of classifiers because of the symmetry between the pairs of classifiers. The test set ROC on OR fusion of the designed ensemble is shown in Fig Some important points on the ROC curve are EER=0.75%, FRR=2.7% at FAR=0.1% and FRR=5.6% at 0.01% FAR. The EER on OR rule fusion of the 13 classifier ensemble is an order of magnitude lower than the EER of the single classifier trained on the entire authentic training set. The single classifier based on the entire authentic training set is constrained to have distortion tolerance for all poses of the authentic images. Increasing the distortion tolerance reduces the discrimination capability. Hence, the single classifier using all 39 authentic training images has a lower performance than the OR rule fusion of 13 classifiers, each trained on 3 images of a pose. ROC of individual classifiers for proposed design 0 10 P FR P FA Figure 5.3: Individual classifier ROCs of our designed ensemble on the entire PIE pose and illumination database. The legend refer to the labels given to different poses in the PIE database Adaboost on the PIE database Our ensemble design strategy is compared to Adaboost [9], which is a commonly used ensemble design and fusion strategy. The basic idea of this approach is: 1)design the first classifier by weighting all training images equally, 2)compute a weighted training error of the current classifier, 125

143 Average Q statistics of the classifiers for the PIE database Authentic Q Impostor Q Q value Threshold on PSR Figure 5.4: Average authentic and imposter Q values of our ensemble designed for the OR rule ROC of PIE database 10 1 P FR EER = 0.75% P FA Figure 5.5: ROC of the OR rule fusion using the designed classifiers. 126

144 3) design the next classifier by giving more weight to the misclassified samples by the last classifier, 4)repeat the process till desired number of classifiers are obtained and abort if either the weighted error is zero or greater than 0.5. The Adaboost ensemble decision on the test input is a weighted sum of the classifier decisions, where the weights are inversely proportional to the classifier accuracy. To make a fair comparison, the same training set of [3 illuminations * 13 poses]= 39 authentic images are used, and the Adaboost [9] algorithm for the UMACE base classifier described in Table 5.1 is set to design 13 classifiers. Each UMACE filter in the Adaboost ensemble is designed using all M = 39 authentic training images, with different weights on each image. Different pose images are registered with respect to each other so that at least one of the eyes is aligned. There are some intrinsic problems associated in applying Adaboost with UMACE classifiers. Adaboost is designed to work with weak learners; however, the UMACE classifier is highly tuned to the training data. The UMACE filter is designed to produce high PSRs, i.e., authentic decisions on the authentic training data. A single UMACE filter may have difficulty in fitting to all poses in the training data. However, in a few iterations, the classifier focuses on a subset (due to reweighting) of the authentic training images. Since the UMACE filter is highly tuned to the training images, the error on the weighted training set becomes zero in a few iterations. This results in a completion of the Adaboost training. Many faces result in fewer than 13 UMACE filters, with as few as one UMACE classifier. The other problem is deciding whether each UMACE should have the same PSR threshold to make decisions, and whether this threshold should be defined before designing the Adaboost ensemble. The remaining images are the test set; [(21-3 training) = 18 illuminations * 13 poses]= 234 authentic images and [21 illuminations * 13 poses * 66 impostor persons] = impostor images per person. The individual classifier ROCs for a sample person (40th person) shown in Fig. 5.6 are much more accurate (with EER between 5% and 12%) than the corresponding individual classifiers in our proposed ensemble design. However, the ROC of the weighted sum of the Adaboost classifier decisions shown in Fig. 5.7 for the sample (40th) person is worse than the OR fusion of our proposed ensemble. The EER of the ROC obtained by weighted decision fusion employed by Adaboost for this person is 5.2%, not much better than the best individual classifier obtained by the Adaboost algorithm. The first classifier in the Adaboost ensemble is the UMACE filter using all authentic training images weighted equally. The performance of this filter has been shown in Figure 5.2. The average 127

145 Table 5.1: The Adaboost algorithm for the UMACE base classifier. Let the lexicographically ordered Fourier transform of the authentic training images be x i, i = 1, 2,.., M. Let D be a diagonal matrix having the average spectral density of the training images along the diagonal. Initialize the training image weights wi 1 = 1/M, i = 1, 2,.., M. For l = 1, 2,.., N 1. Compute the lexicographically ordered UMACE filter in the frequency domain h l = D 1 M wi lx i. i=1 2. Calculate the PSR p i on correlating the filter h l with each of the training images x i. 3. Let the decision of { h l on the training image x i, i = 1,...,M be 1, pi τ d i = h l (x i ) = 0, p i < τ 4. Calculate the weighted error of h l : ǫ l = M wi l(1 d i) i=1 5. If ǫ l = 0, stop the Adaboost algorithm. Set N=l. Set β l = δ 1 M 1, δ > Otherwise, if ǫ l > 0, Set β l = ǫ l 1 ǫ l 7. Set the new weights of the training images to be wi l+1 = wl i βdi l Output the final decision 1, if N N log(1/β d f (x) = l )d l (x) 1 2 log(1/β l ) l=1 l=1 0, otherwise MP i=1 w l i βd i l ROC of all persons for the Adaboost is shown in Fig It should be noted that the average ROC for the Adaboost is obtained differently: the Adaboost ROCs of each person are combined by averaging the FRRs of each person s ROC at a given FAR. This is done because the weights for the weighted decision fusion vary from person to person. Further, there are different numbers of classifiers obtained by Adaboost for each person. The EER of the Adaboost algorithm for the UMACE base classifier is 6.2%, which is only about 1% smaller than the EER of the first UMACE classifier in the Adaboost ensemble, which is 7.5% (as observed from Figure 5.2). In other words, the Adaboost algorithm only makes a small improvement over the first classifier in the ensemble. Further, the EER of the Adaboost algorithm is an order of magnitude larger than the EER of the proposed OR rule ensemble, which is 0.75%. This proves the superiority of our proposed ensemble design. 128

146 The average pair-wise Q values for a sample person (40th) for the Adaboost ensemble as seen in Fig. 5.4 shows positive dependence for both authentics and impostors, because of which there is not a significant improvement on Adaboost fusion. The average pair-wise classifier authentic and impostor Q values over all persons are shown in Figure 5.9. This is shown as a function of the PSR thresholds at which each classifier is thresholded to obtain decisions. At low PSR thresholds, most authentic PSRs are accepted as authentics and hence there is a high positive conditional dependence on authentics. When the PSR threshold is increased, there are differences in the authentic decisions for outliers, which lowers the conditional dependence on authentics. This trend is observed in authentic Q values between PSR thresholds of 5 and 20. While the authentic Q value reduces for PSR thresholds between 5 and 20, it is still positive. As the PSR threshold increases further, after a point, the authentic outliers start getting rejected by all classifiers, thus increasing the conditional dependence on authentics. This trend is observed between PSR thresholds of 20 and 50. At low PSR thresholds, a lot of impostors are falsely accepted. When classifiers make different decisions on the impostors, the impostor conditional dependence is low. As the PSR threshold is increased, the impostor outliers start getting correctly rejected by all classifiers, which increases the impostor conditional dependence. This is observed between PSR thresholds of 5 and 15. When the PSR threshold is sufficiently high, all classifiers correctly reject all of the impostors, thus providing a constant Q value. This is reflected in PSR thresholds above 15. The impostor Q value is positive and increases for PSR thresholds between 5 and 15 and remains constant thereafter. The Adaboost classifier fusion is a weighted majority rule. For the majority rule, the pair-wise Q values should be negative for both authentics and impostors to be a favorable (lower error than independent) classifier ensemble. Since this is not the case, the classifier ensemble designed by Adaboost is not favorable for the fusion rule it is designed for. Possible Adaboost improvement by monitoring classifier diversity: The Adaboost generation of classifiers can be improved by monitoring the classifier ensemble diversity. A cross-validation set is needed to estimate the classifier ensemble diversity. As each new classifier is generated in the sequence, the pair-wise Q values between the previously generated classifiers in the set can be computed from the classifier decisions made on the cross-validation set. The classifier decisions can be made based on a pre-defined threshold on the classifier scores. The average authentic and impostor 129

147 ROCs of Individual Classifiers of Adaboost for a Sample Person 10 0 P FR P FA Figure 5.6: ROCs of individual classifiers of Adaboost for a sample (40th) person. The EERs of the individual classifiers are between 5% and 12%. Adaboost Weighted Decison Fusion ROC for a Sample Person P FR P FA Figure 5.7: Sample (40th) person s ROC of weighted decision fusion of individual classifiers by Adaboost. The EER is 5.2%. 130

148 Adaboost ROC: Average FRR for given FAR over all classes 10 1 P FR P FA Figure 5.8: Average ROC of Adaboost applied on all the authentic training images of a person. The averaging is done by averaging FRR across all persons for a given FAR. The EER is 6.2%. Adaboost classifiers: Average Q values for the PIE database Q value Authentic Q Impostor Q Threshold on PSR Figure 5.9: Average pair-wise classifiers Q values for the Adaboost ensemble on the PIE database. This is obtained by first averaging pair-wise Q values for each person, and then averaging over all persons. 131

149 Q values can be used to decide whether the Adaboost iterations to generate further classifiers should continue. At each new iteration, if the authentic and impostor Q values move closer to one of the optimal points (Q a, Q i ) = (1, 1), ( 1, 1), ( 1, 1), then the iterations should continue. If they move farther away from one of these points from the previous iteration, Adaboost training should stop. In other words, if the iterations increase favorable conditional statistical dependence, they should continue; otherwise, the iterations should stop Bagging on the PIE database The proposed ensemble design for the OR rule is also compared to Bagging [8]. The same training set (composed of only authentic images) and the test set used in the proposed ensemble design is also used here. UOTF filters are trained on 13 bootstrap samples of the training set. Each of these 13 UOTF filters are tested on all the test set images. For each test image, there is a set of 13 PSRs obtained by correlating the test image with each of the UOTF filters. These 13 PSRs are thresholded to obtain 13 decisions. These decisions are combined using a decision fusion rule to obtain the global decision for the test image. Bagging uses the Majority decision rule to fuse the 13 decisions. Bagging assumes that the Majority decision fusion is optimal. If the classifiers in the ensemble were independent, this would be a correct assumption. However, the classifiers in the Bagging ensemble are not independent in practice [10]. Hence, some other decision rule may be optimal for the Bagging ensemble. As the AND, OR, Majority are the major decision fusion rules, the OR, AND rules are also tested in addition to the Majority rule on the Bagging ensemble. The average ROCs for the Majority, OR, AND rules for bagging on the PIE database are displayed in Figure The decision fusion ROCs of each person are first obtained. The mean of the FRR of all persons for a given FAR produces the average ROCs shown in the figure. The EERs of the Majority, OR, AND rules for bagging ensembles are 9.3%, 4.7%, 27.5%, respectively. Although Bagging uses the Majority fusion rule, it is found that the OR rule is the best decision fusion rule for these Bootstrap classifiers. Hence, the assumption made in Bagging that the Majority rule would be optimal is an incorrect decision. Depending on the statistical dependence between classifiers, other decision fusion rules may be the best. The OR fusion is found to be best for bagging with UOTF filters on the PIE database. The OR rule EER for bagging is an order of magnitude larger than the 132

150 10 0 Average ROCs of major rules on the Bagging ensemble 10 1 P FR Or Majority And P FA Figure 5.10: database Average ROC for major decision rules applied on the bagging ensemble for the PIE OR rule EER for the proposed OR rule ensemble design. The average Q values between classifier pairs as a function of PSR threshold are shown in Figure 5.11 for bagging on the PIE database. The average Q values of a person are obtained by taking the mean of the pair-wise Q values of all pairs of classifiers at a given PSR threshold for that person. Averaging the Q values at a given PSR threshold over all persons gives the plots shown in Figure As observed, Bagging does not produce independent classifiers. Both the authentic and impostor Q values are positive, which is not favorable for any decision rule. The classifiers have positive dependence, providing lower accuracy than independent classifiers on their fusion Choosing the threshold set on classifier scores In the evaluation of the performance of decision fusion of classifier ensembles on the PIE database, the same threshold is used on all the classifier scores to make decisions. Choosing the same threshold is not optimal in general. The performance changes when a different set of thresholds are applied on the classifier scores to make decisions. By choosing an optimal set of thresholds on the classifier scores, a lower error rate can be obtained. Hence, the performance will be better than those displayed for equal thresholds on the scores for this database, as well as any other 133

151 Figure 5.11: Average pair-wise Q values of the bagging ensemble on the PIE database. The averaging is done over all pairs of classifiers of a person, and then over all persons. database in general. We have shown that the proposed ensemble design for the OR rule is an order of magnitude better than bagging and boosting for the PIE database. In the next subsection, ensemble design for the OR rule on fingerprints is investigated NIST 24 plastic distortion dataset evaluation OVERVIEW: The proposed classifier ensemble generation for the OR rule is compared to Bagging on the NIST 24 plastic distortion fingerprint database. The Unconstrained Optimal Tradeoff (UOTF) correlation filters are used as base classifiers because of their distortion tolerance and discrimination capability. The proposed classifier ensemble is generated by grouping authentic fingerprint images of similar distortion into the training of each classifier for a finger. The conditional dependence of the proposed ensemble is favorable to the OR rule. The EER of the proposed ensemble is 1.8%, which is significantly lower than the Bagging EER of 3.1% with the same number of UOTF classifiers per person. Boosting is not possible on this database as the initial UOTF classifier does not make any errors on the training set. 134

152 Classifier ensemble design for the OR rule on the NIST 24 [111] plastic distortion fingerprint database is evaluated here [108]. The NIST 24 database contains 10 second videos at 30 frames per second of 10 fingers of 10 people, i.e., 100 fingers in total. The images are captured using an optical sensor at 500 dpi resolution and are of size pixels. Here, the images are padded to pixels. A brief description of the databases followed by the experimental evaluation is given in following sections. The fingerprints in the plastic distortion dataset have a lot of distortion since the fingers are rolled and twisted here. Some sample images of a finger showing distortion as well as partial fingerprints are shown in Fig All 300 images from each finger are used here, without any pre-processing done on the images. Downsampled (by averaging) images of size are used for evaluation for faster processing time as this resolution results in reasonable accuracy [74]. Figure 5.12: Distorted and partial fingerprints of a sample finger in the NIST 24 plastic distortion dataset Twenty uniformly sampled images from the 300 images of a finger, starting from the 1st image, are used as the authentic training set and the 1st image from each of the remaining 99 fingers are used as the impostor training set to design Unconstrained Optimal Trade-off (UOTF) Filters [114]. The UOTF filters have been shown to have better performance than the MACE filters [119], [74], and hence are used here. The UOTF filter is discriminative even when designed without the impostor training images and has shown good performance when designed with just the authentic training 135

153 images [74]. All the training images are normalized to unit energy. While more details of the filter can be obtained from [114], the UOTF filter provides a trade-off between distortion tolerance and discrimination. The equation for the UOTF filter in the frequency domain is given below. h = (αc + βd) 1 m, α 2 + β 2 = 1, α > 0β > 0 (5.1) Here, m is the mean of the authentic training images in the frequency domain, C is the noise spectral density, assumed to be an identity matrix, D is the average training image spectral density, α is the noise tolerance coefficient and β represents the peak sharpness coefficient. α and β are varied to trade-off between distortion tolerance and discrimination. Typically, a cross-validation set is used to evaluate the performance of the UOTF filter at a range of values of α and β. The value of α and β is picked at which the performance in the cross-validation set is the best. An α of 10 6 is chosen here since it gave the best performance for a single UOTF filter among a range of α values for this database in [74]. In this experiment, a set of UOTF filters are used. It may be possible that there is another value of α for which the performance of the UOTF ensemble is optimal. However, this evaluation is not done due to lack of time. To show the accuracy improvement of multiple classifier fusion over the best single classifier, a comparison of performance of a single correlation filter using all training images is made with the fusion performance of multiple correlation filters, each trained with a subset of the training images. For improved OR rule fusion, the multiple classifiers are designed to classify different regions of distortion in the authentic space. The following guidelines can be used to design a set of multiple UOTF filters for the OR fusion rule by partitioning the authentic training set. The entire impostor training set is used to design each of the multiple filters. The authentic training images are divided into multiple subsets of similar plastic distortion in the following way. 1. Pick an image, say the 1st image of the training set, to build a filter. 2. Build a filter and cross-correlate the filter with the rest of the training images. 3. Pick the image which is most different from the current filter(s) by choosing the one with lowest PSR to build the next filter. 4. Cross-correlate the rest of the training set with all the current filters. For each image, store 136

154 the maximum PSR across different filters (in order to compare between different images in step 3). 5. Repeat step 3 till required number of filters have been built or when all images have a sufficiently high PSR (greater than a specified threshold). 6. The remaining images are used to update the closest filter (the filter for which the max. PSR is obtained) All filters of each finger are tested against the 279 remaining authentic images of that finger and 20 randomly chosen impostor images from each of the 99 impostor fingers. Only 20 images per impostor are chosen to reduce processing time and since the impostors are well rejected by the UOTF filters, this subset is a good representation of the accuracy on impostors. There are a total of 279 authentic images per finger 100 fingers = 27,900 authentic images and a total of impostor images per finger 100 fingers = 198,000 impostor images. Zhang et. al [116] proved that for N independent classifiers, the best ROC for the AND fusion of classifiers can be obtained by searching over a one-dimensional space, instead of a brute force search over an N dimensional threshold space of the classifiers. However, in our work, the classifiers are statistically dependent. A brute force search to find the best set of thresholds on the classifier scores is performed to obtain optimal ROCs for each of the decision fusion rules here. Fig shows the authentic and impostor Q values for each pair of classifiers for the best set of thresholds found for each point on the ROC curve for the OR rule. The x axis for the plot is the index of the threshold set. The impostor Q values are positive, which is favorable for the OR rule. It can be seen that the authentic Q values are negative at the higher indices of the threshold set, which is favorable for the OR rule. Only three classifiers are used in this OR rule ensemble because the authentic Q values for each of the three classifier pairs is negative only at the higher indices of the threshold set. Using more classifiers in the ensemble would result in a positive authentic Q value for at least one pair of classifiers. In other words, additional classifiers may be similar to one of the three classifiers present in the ensemble, which is not desirable for fusion. The three subsets of authentic images that make each of the three UOTF filters are shown in Figures 5.14 to 5.16for a sample finger (right thumb of person 10). These three subsets of authentic images are obtained from the algorithm mentioned in this section. Each subset consists of images of 137

155 Q Values Q Value A 12 I 12 A 13 I 13 A 23 I Threshold set index Figure 5.13: Average authentic and imposter Q values of pair-wise classifiers in our ensemble on the NIST 24 plastic distortion set. A set of best thresholds on each of the classifiers are found for a given FAR/FRR point on the OR fusion ROC. The x-axis in this figure represents the index for these threshold sets. similar plastic distortion, and the different subsets correspond to different types of plastic distortion. Figure 5.14: Set 1 of the three authentic training image subsets of a sample finger. Each training subset is used to make one UOTF filter in the OR rule ensemble. Fig shows the Receiver Operating Characteristic (ROC) Curves for the three classifiers 138

156 Figure 5.15: Set 2 of the three authentic training image subsets of a sample finger. This is used is building the second UOTF filter in the ensemble. Figure 5.16: Set 3 of the three authentic training image subsets of a sample finger. This is used is building the third UOTF filter in the ensemble. using partitions of the authentic training set and the ROCs for the all possible monotonic decision fusion rules [19] for 3 classifiers. As a comparison, the ROC of a single classifier using the entire training set is also shown. It is to be noted that each of the 3 classifiers in the proposed ensemble use only a subset of the authentic training images. It is observed that while the individual classifiers in the ensemble have poor accuracy, the OR rule fusion has a good accuracy. It is observed that the OR rule using all three classifiers has the best ROC among all monotonic fusion rules. The OR rule fusion has better accuracy than the single UOTF classifier using all training images. This shows that 139

157 Figure 5.17: Comparison of ROCs for NIST 24 plastic distortion set: Three individual classifiers in our ensemble (each trained a subset of the authentic training ser), single OTF classifier using the entire authentic training set, and OR fusion of our ensemble. This shows that classifier ensemble fusion can be better than the best individual classifier fusion can improve accuracy not just over the individual UOTF classifiers in the ensemble but also over the best possible single (UOTF) classifier Bagging on the NIST 24 plastic distortion set A classifier ensemble obtained through bagging is used as a comparison to our proposed classifier ensemble design for the OR rule. The unconstrained optimal trade-off (UOTF) correlation filters [114] with filter parameters of 10 6 for the noise tolerance coefficient and a peak sharpness coefficient of 1 are the base classifiers used here. The same training set used in our proposed ensemble is used here, i.e., 21 authentic images and 99 impostor images (the first image from all the 99 impostor fingers). The test set for each finger consists of 279 authentic images other than the training set and 20 randomly sampled images from each of the 99 impostor fingers, since the UOTF filter is shown to be discriminative [74]. Three bootstrap [79] classifiers are obtained by training on a random subset of the authentic training data and a random subset of the impostor training data. The random subsets are obtained by random sampling with replacement from the training set. The best ROCs for all three classifier decision fusion rules obtained through an exhaustive search for the best thresholds on the three classifiers of the bagging classifier ensemble are shown in Figure 140

158 Figure 5.18: Test ROCs of all fusion rules for the three classifier ensemble designed for the OR rule on the NIST 24 plastic distortion dataset Bagging uses the majority rule for fusion of the classifier decisions. Figure 5.20 shows the ROCs for just the Majority, OR, AND rules. From these we see that the EER of bagging, i.e., using majority rule on the three bootstrap classifiers, is 3.1%. The EER of the OR rule on the three bootstrap classifiers is 2.9% and is the OR rule is the best rule in terms of the EER for this ensemble. As we can see, the classifier ensemble produced by bagging has poorer performance than our proposed classifier ensemble for the OR rule on the NIST 24 plastic distortion set Adaboost on the NIST 24 plastic distortion set Adaboost, or any kind of boosting, on the NIST 24 plastic distortion set using the same training set does not produce an ensemble of classifiers. This is because the initial classifier, i.e., the UOTF classifier built on all the training images, makes no errors on the training set. The performance of the single UOTF classifier using all training images, is shown in Figure NIST 24 rotation dataset evaluation Comparison of a 5 classifier ensemble to the best single classifier OVERVIEW: The Optimal Tradeoff Circular Harmonic Function (OTCHF) filter is used as a base classifier on the NIST 24 fingerprint rotation database as it is designed to be tolerant to inplane rotation. The proposed OR rule classifier ensemble generation is somewhat different from 141

159 Figure 5.19: Three classifier decision fusion ROCs for the bagging classifier ensemble on the NIST 24 plastic distortion set ROCs of major rules on the Bagging ensemble And Majority Or P FR P FA Figure 5.20: ROCs for the And, Or, Majority rules for the bagging classifier ensemble on the NIST 24 plastic distortion set 142

160 the previous ensemble generation techniques. Feature partitioning is done here rather than data partitioning. Each of the OTCHF classifiers is trained to have rotation tolerance in different rotation ranges. The EER of the proposed ensemble with the OR rule fusion is 14.6%. This is more accurate than a single OTCHF designed to have rotation tolerance in the entire rotation angle range present in the test set, which has an EER of 26%. Bagging and Boosting classifier ensembles are not possible here because the number of training images are kept constant in each OTCHF classifier. In the rotation dataset, the finger is lifted and placed at different rotation angles from approximately 45 to 45. While there is less distortion than the plastic distortion dataset, there is elastic distortion because the finger is not rigid. Fig shows some sample images of a finger at different rotation angles. Because of the lifting of the finger, some of the 300 frames of each finger s video are blank or have very faint or very small area fingerprints. Some samples of such images are shown in Fig These images are removed, and the rest are kept, unlike in [120] where only the best images of a rotation angle are kept. There are some images which are faint or have small fingerprint area as shown in Fig No pre-processing is done on the images, unlike [120] where the images are de-rotated to zero rotation angle and processed to have good contrast and same fingerprint area for all fingers. Approximately, 150 images per finger are kept. All 100 fingers of the dataset are used for evaluation. For faster processing the images are downsampled (by averaging) to size pixels. The Optimal Trade-off Circular Harmonic Function (OTCHF) filter [121] is used here because it can be designed to provide tolerance to geometric rotation of the image for a specified rotation angle range with a specified target output [108]. The OTCHF filter is also designed to provide a tradeoff between distortion tolerance and discrimination. Just as in the plastic distortion dataset, here too, a noise tolerance parameter of 10 6 and a discrimination parameter of 1 were used. Multiple authentic images are used for training; however, no impostor images are used for training. In [120], it is found that eight or more training images are needed to design a good classifier for this dataset. We use 8 uniformly sampled training images for each finger, which are de-rotated to a single rotation angle and registered, for the purposed of training the OTCHF filter. Since the training images need to be registered in rotation as well as translation before computing the filter, the training images are de-rotated with respect to one of the images that is most upright. The rotation is done in the original 512 dpi resolution in steps of 1, and the cross-correlated with the reference image. The 143

161 Figure 5.21: dataset. Sample images of rotated fingerprints of the same finger in the NIST 24 rotation Figure 5.22: dataset. Sample faint fingerprints of the same finger in Figure 5.21 in the NIST 24 rotation 144

162 rotation angle having the highest cross-correlation is considered as the rotation required to register that image. A single OTCHF filter is designed to provide a constant peak value of 1 in the rotation range [- 50, 50 ] and a zero peak value for all other rotation angles of authentic images. This is expected to have rotation tolerance for all authentic fingerprint rotations in the test set. Hence, this OTCHF filter will be the best possible single classifier (for the OTCHF base classifier). The proposed OTCHF classifier ensemble attempts to improve verification performance over this classifier. The proposed OR rule classifier ensemble consists of five OTCHF filters, each designed to provide a constant peak of 1 for the rotation ranges [-50, -30 ], [-30, -10 ], [-10, 10 ], [10, 30 ] and [30, 50 ], respectively. All filters are designed using all 8 authentic training images. The difference between the filters is the rotation angle range in which they are designed to have rotation tolerance. The test images are not de-rotated. All the images in the rotation dataset are used for testing. Evaluation is done on all images of the 100 fingers for each of the filters; including the training images, because they were de-rotated before computing the filter. There were approximately 15,000 joint authentic scores and 1.5 million joint impostor scores in total from all fingers. Fig shows the ROCs for the OTCHF filter designed for ±50 tolerance and the five OTCHF filters, each having a rotation tolerance range of 20 deg. Figure 5.25 shows the ROCs for the OR, AND and MAJORITY fusion of the 5 OTCHF filters, using a one-dimensional search space on the thresholds; i.e., the same threshold is used for all 5 OTCHFs. While the ROC should have been computed using the best thresholds for each classifier, it would be no worse than this ROC. As expected, it is observed in Fig that the OR rule fusion leads to a better ROC than the five individual OTCHF classifiers. It is also much more accurate than the single OTCHF designed for ±50 deg rotation tolerance. Since there are too many possible monotonic fusion rules for 5 classifiers [19], the ROCs for only the AND and MAJORITY rule are shown in Fig Since the ensemble is designed for the OR decision rule, the OR rule fusion leads to a better ROC than other fusion rules. The OR rule EER is 14.6% and the EER of the OTCHF designed for ±50 is 26%. The error rates for the classifier ensemble fusion are high. This is probably because the resolution of the images used here is too low (125 dpi). The training images are de-rotated to a specified angle. These images get blurred after de-rotation at 125 dpi resolution. This results in a reduction in discrimination capability of the OTCHF filter. The error may be lower in the original 500 dpi 145

163 resolution of the fingerprints. The accuracy on the NIST rotation dataset reported by Casasent et al [120] is higher than the accuracy reported above. This is because the evaluation in [120] was done against pure elastic distortion, with the rotation removed. Further, the evaluation was performed on the best 55 fingers of a larger 200 finger dataset, and the best images of each finger. The accuracy we obtain on the NIST plastic distortion set, where there is no rotation but has more distortion than the rotation set, is in the same order of magnitude as the accuracy obtained in [120]. The minutiae matching performance for the same images in this NIST 24 Rotation set is also reported by Casasent et al [120]. For an FAR of 0.1%, the minutiae matching FAR is 15%. Hao et al [122] perform fingerprint matching through error propagation on the NIST 24 database. They report an EER of 4.5% for this database. However, their evaluation dataset is not clear. It is not clear whether they use one or both of the plastic distortion and rotation sets. They also only evaluate performance on 50 fingers instead of the 100 fingers present in this database. The performance improvement of the OR rule fusion over the best single classifier OTCHF can be explained with the help of Fig The left plot on Figure 5.23 shows the PSRs for the 5 individual classifiers, while the right plot on Figure 5.23 shows the authentic and impostor PSRs for the OTCHF having ±50 deg rotation tolerance for one finger. The best single classifier is a complex one and is unable to handle the distortion present in the entire rotation range. Each of the individual classifiers has higher authentic PSR values and smaller error across the rotation range it is designed for, because it is designed to solve a simpler problem. The fusion of multiple simpler classifiers is thus more accurate. Figure 5.26 shows the authentic and impostor Q values between pairs of classifiers as a function of the PSR threshold on the first classifier. For all the classifier pairs, the impostor Q values are positive and close to 1 for the higher thresholds, which is favorable for the OR rule. Except for one classifier pair (filter 4 and filter 5), the Q values on authentics between all classifier pairs are negative, which is favorable for the OR rule. The authentic match scores (PSRs) of filters 4 and 5 are shown for a sample finger in Figure For many of the authentic images where the PSRs of filter 5 are high (> 15), the PSRs of filter 4 are also high. Hence, the pair-wise authentic Q values between classifiers 4 and 5 are positive above a PSR threshold of 15. Figure 5.28 shows the pair-wise classifier scores of filters 146

164 Multiple Classifiers Single Classifier 25 A 1 A 2 25 A 3 A I 20 A 4 A 20 5 Psr 15 I 1 Psr I 2 I I 4 I Image Index Image Index Figure 5.23: Comparison of PSRs of the designed OTCHF classifier ensemble for the OR rule Vs the PSRs of a single OTCHF designed for the entire rotation range. Figure 5.24: Comparison of ROCs on NIST 24 rotation set: ROC of a single OTCHF designed for rotation tolerance to all rotations in the test set, ROCs of 5 individual classifiers in the proposed ensemble, and OR fusion of the proposed ensemble. 147

165 Figure 5.25: Test ROCs of the OR, AND, MAJORITY rules on the designed classifier ensemble for the OR rule on the NIST 24 rotation dataset Authentic Q Impostor Q Q Value Psr Threshold of classifier 1 Figure 5.26: Authentic and imposter Q values of pair-wise classifiers in our ensemble on NIST 24 rotation set. 148

166 Figure 5.27: Authentic PSRs of classifiers 4 and 5 in our ensemble for a sample finger (Finger 16) on NIST 24 rotation set. 4 and 5 for all authentic images of the 100 fingers. There are many points where both scores are high (corresponding to authentic decisions by both classifiers). This results in positive authentic Q. The authentic scores of classifiers 4 and 5 can be compared to the authentic scores of classifiers 1 and 5 in Figure In this case, when one classifier s score is high, the other classifier s score is low, which results in a negative Q value between classifier pairs 1 and 5. Filter 5 is designed for the edge of the possible rotation angle range of the test set ([30, 50 ]). There are possibly fewer test images that have these rotation angles, and most of these may be closer to 30. Filter 4 (designed for rotation tolerance in [10, 30 ]) provides a high PSR for many authentic images for this set (close to 30 ). This may be the likely reason for the positive Q between these classifiers Improvement of the classifier ensemble performance The ensemble of OTCHF filters is modified in order to improve the accuracy on the NIST 24 rotation database. The ensemble performance is enhanced by addition of training images, higher fingerprint image resolution, and addition of the number of OTCHF filters in the ensemble. Details of the ensemble are presented below. It was found that elastic distortion in the test fingerprint images is the main cause of the low accuracy. To increase the tolerance to elastic distortion in the fingerprints, the number of training 149

167 Figure 5.28: Authentic PSRs of classifiers 4 and 5 in our ensemble for all fingers in NIST 24 rotation set. Figure 5.29: Authentic PSRs of classifiers 1 and 5 in our ensemble for all fingers in NIST 24 rotation set. 150

168 images is increased. Twenty one authentic images per finger form the training set. Impostor images are not used in training each classifier. A reference image that is used for training image registration is manually selected from each finger. The rest of the twenty training images are obtained by uniform sampling of the images of each finger, starting from the 5th image. The second cause for the poor performance is the low resolution of the fingerprint images. In the last section, downsampled images of 125 dpi resolution are used. This is a much lower resolution from the original fingerprint image resolution of 500 dpi. At 125 dpi resolution, there is only a one pixel wide ridge/valley in the fingerprint image. The training image registration (before OTCHF filter generation) requires that all training images be present at a same rotation angle. Since the training images are present at different rotation angles, registration by rotating all fingerprints to the same angle causes blurring in the fingerprints. This affects the discrimination capability of the OTCHF filter. Hence, fingerprint images at the original 500 dpi resolution are used here. The training image registration is done as follows. A reference fingerprint image that is approximately upright is manually selected from each finger. If the major axis of the fingerprint is at 0 to the vertical axis, it is considered to be upright. If this reference fingerprint is orientated by more than 10 to the vertical axis, it is manually rotated to be at 0 to the vertical axis. The 20 training images of each finger are rotated to align with this reference image x r. This is done through an automatic process, described in Table 5.2. Table 5.2: Training image registration. Build a reference UOTF filter h with a noise tolerance coefficient of 10 5 using the reference fingerprint x r. For each training image x i, i = 1, 2,...,M, compute the orientation φ to the reference image as follows. For θ = 55 to55 1. Rotate the training image x i by theta to get x i,θ 2. Correlate x i,θ with filter h and compute PSR p theta. Find the angle φ corresponding to the maximum PSR. Rotate x i by the angle φ to get x i,φ. Compute the translation necessary to align x i,φ with x r as follows. Find the peak location of the correlation output obtained by matching x i,φ with h. The location of the peak (h, v) from the center of the correlation output determines the translation required. Translate x i,φ by ( h, v) to get the registered image x i,a that is aligned with the reference image. 151

169 An addition to the number of classifiers is the third aspect of the enhanced OTCHF ensemble used here. A smaller range of rotation tolerance constraints in the OTCHF filter provides more room for an improvement in discrimination. In other words, increasing the rotation angle tolerance range constraints decreases the discrimination capability, and vice-versa. Twenty OTCHF filters, with overlapping rotation angle range constraints, form the classifier ensemble. The ensemble on the whole is designed to provide rotation tolerance for the [ 50, 50 ] angle range. Each OTCHF filter is constrained to provide rotation tolerance in 7 range, with a 2 overlapping range with its two neighboring OTCHF filters. The rotation angle tolerance range constraints for the twenty filters are [ 51, 45 ], [ 46, 40 ], [ 41, 36 ],, [44, 50 ]. The overlapping ranges are used to avoid a drop in performance of the ensembles at the edges of the rotation angle ranges. All the 20 registered images per finger along with the reference image of that finger are used in training each of the OTCHF filters. The test images are present at different angles to the vertical axis. They are not processed, i.e., they are not rotated to align with the vertical axis. All the images of the authentic finger form the authentic test set. To reduce the time taken for impostor image evaluation, only 10 randomly chosen images from each of the 99 impostor fingers are chosen for the impostor test set. Each test image is correlated with the 20 OTCHF filters of the claimed finger ensemble. The PSR of each of the 20 correlation outputs are calculated. 20 decisions are obtained by thresholding each PSR with the same threshold. The same threshold is chosen for simplicity. However, if an optimal choice of the thresholds is made, the accuracy will improve. The 20 decisions are combined using the OR rule, i.e., if any of the 20 decisions is authentic, the OR rule declares authentic. The improvement of performance from Section can be seen from Figure The authentic PSRs in Figure 5.30 are much larger than the corresponding authentic PSRs of the same finger in Figure 5.23a. There is a smaller deviation in the maximum authentic score in Figure 5.30 as compared to Figure 5.23a. There is more tolerance to elastic distortion because of the larger number of authentic training images (20 as compared to 8). This results in a high authentic score (> 20) from at least one of the 20 filters. The maximum authentic score is larger due to the higher resolution of images (500 dpi as compared to 125 dpi), and to a lesser degree because of the large number of filters (20 as compared to 5). The impostor PSRs also increase from Figure 5.23a. However the scale of increase for impostor PSRs is smaller than that of authentic PSRs. 152

170 Figure 5.30: Authentic and impostor PSRs (match scores) for all 20 classifiers in the ensemble for a sample finger in the NIST rotation database. The authentic coefficient of pairwise classifier scores is plotted in Figure For neighboring filters, there is a high authentic PSR (> 20) for an overlapping set of images. This is due to two reasons. There is some overlap in the rotation angle tolerance between neighboring filters (2 overlap). High authentic PSRs are observed for a few degrees (< 5 ) outside the constrained rotation range due to the distortion tolerance present in the filter design. This results in a positive pairwise authentic correlation coefficient between neighboring classifiers. When the classifier indices differ by a large number, there is a negative authentic correlation coefficient of pairwise classifier scores. This negative correlation coefficient is favorable to OR rule fusion. The impostor correlation coefficient of pairwise classifier scores are all positive, as observed in Figure This is favorable to OR rule fusion. There is a lower impostor correlation coefficient between classifier 11 and all other classifiers. Classifier 11 is constrained to have rotation tolerance in [ 1, 5 ]. The ROCs of the 20 classifiers in the ensembles are compared to the ROC of the OR fusion of this ensemble in Figure The individual classifier ROCs have less accuracy than in Section This is because the rotation tolerance constraints for each classifier are for a much smaller range (7 as compared to 20 ). As expected the OR rule fusion has a much higher accuracy than 153

171 Figure 5.31: Authentic correlation coefficient between each pair of classifier scores in the 20 classifier ensemble in the NIST rotation database. Figure 5.32: Impostor correlation coefficient between each pair of classifier scores in the 20 classifier ensemble in the NIST rotation database. 154

172 in Section The EER of the OR rule fusion reduces to 2.7% here from 14.6% in Section The performance obtained here is more accurate than the performance observed for parts of this database by Casasent et al [120] and Hao et al [122]. Figure 5.33: ROCs of the 20 individual classifiers in the ensemble generated for the NIST rotation database along with their OR rule fusion. Since the ensemble design with the OTCHF has been based on the extent of the rotation range, a straightforward comparison to Bagging and Boosting is not possible. Both Bagging and Boosting require a selection or reweighting of the training set, which is not of much use here. Hence, we do not show a comparison to Bagging or Boosting here. 5.2 Classifier ensemble design for the AND rule OVERVIEW: Classifier ensembles are generated to be favorable to the AND decision rule fusion in this section. The same base classifiers used in OR rule ensemble generation are not useful for AND rule ensemble generation. This is demonstrated on the NIST 24 plastic distortion fingerprint database. Base classifiers capable of making diverse decisions on impostors and similar correct decisions on authentics are required here. Classifiers that are affected by impostor data training 155

173 are used here. Due to the extreme pose and rotation variations in the authentic images in the PIE database and the NIST 24 fingerprint rotation database, desirable correct decisions on most of the authentics cannot be obtained; hence ensemble generation is not done for these databases. The Support Vector Machine (SVM) is evaluated as a base classifier for the NIST 24 plastic distortion database. The Linear Discriminant base classifier is found to be more successful than the SVM on the AR face database. The proposed AND rule ensembles have superior performance to Bagging ensembles. Similar decisions on authentics and diverse decisions on impostors are favorable for AND decision rule fusion. However, the obtained diversity on impostor decisions is not high for the proposed ensembles (although higher than Bagging classifier decision diversity). The difficulty lies in getting an accurate estimate of the number of impostor clusters and the cluster elements. The AND rule is complementary to the OR rule. Hence, the favorable conditional dependence of the OR rule and the favorable conditional dependence of the AND rule are reversed. The classifier ensemble design strategies are also reversed. For the OR rule ensemble design, clusters of authentic data are separated by each classifier from most of the impostor data. By this design, OR rule fusion would mean each cluster of authentic data will be declared authentic (since only one classifier declared it authentic), and the impostor data is declared impostor (since all classifiers declared it impostor). For the AND rule ensemble design, clusters of impostor data are separated by each classifier from most of the authentic data. By this design, AND rule fusion would mean each cluster of impostor data will be declared impostor (since one classifier declared it impostor), and the authentic data is declared authentic (since all classifiers declared it authentic) NIST 24 plastic distortion dataset evaluation OVERVIEW: AND rule classifier ensembles for the NIST 24 plastic distortion fingerprint database are generated in this subsection. While UOTF filters are good base classifiers for the OR rule ensemble, they are shown to be poor base classifiers for the AND rule ensemble. The SVM base classifier is used here since it is affected by impostor training. The proposed AND rule ensemble has an EER of 1.2% while Bagging with same number of SVM classifiers has an EER of 2.6%. The diversity on impostor decisions is higher than Bagging for the proposed AND rule ensemble, but is not the desired negative conditional dependence on impostors. 156

174 In Section 5.1.2, we used the UOTF filters [114] to separate authentic clusters from most of the impostors in the NIST 24 fingerprint plastic distortion set [111]; which made it a good base classifier for the OR rule ensemble design. For the same database, a UOTF filter would not be able to separate most of the authentic data from a cluster of impostor data; which is necessary for a good ensemble design for the AND rule. In general, the base classifier that is capable of separating most of the impostors from a cluster of authentics would not be able to do the reverse, i.e., separating most of the authentics from a cluster of impostors. Hence, we need some other base classifier (other than the UOTF filter) that is capable of separating most of the authentics from clusters of impostors for successful ensemble design for the AND rule on the NIST 24 plastic distortion database. If this is not possible, we need to demonstrate ensemble design for the AND rule on some other database that has some clusters of impostors where classifiers (which are different from UOTF filters) are able to separate most of the authentics from a cluster of impostors. The same training set used for the OR rule classifier ensemble is used here. 21 uniformly sampled authentic training images, and one (the first) image from each of the 99 impostor fingers are used as the training set. To design a classifier ensemble for the AND rule, the classifiers should be designed to produce different decisions on the impostors while providing similar (correct) decisions on authentics. It is not realistic to separate all authentics from all impostors. An attempt to divide this complex problem into a set of simpler problems is made. Wanting to separate all authentics from a subset of impostors is a simpler problem. When each classifier separates all authentics from a different subset of impostors than other classifiers, fusion with the AND decision rule would reduce the error. However, separating all authentics from different subsets of impostors for each classifier is not easy. Hence, the classifier ensemble generation for the AND rule is tougher than the OR rule ensemble generation. We propose to use all the authentic training images and a different subset of impostor training images for each classifier in order to obtain favorable diversity in the decisions for the AND rule. The training algorithm for correlation filters is outlined below. The impostor training images are divided into three subsets in the following way. 1. Pick one impostor image, say the 1st image of the impostor training set, to build a filter along with the entire authentic training set. 157

175 2. Build a filter and cross-correlate the filter with the rest of the impostor training images. 3. Pick the impostor image with the highest PSR among the current filter(s) to build the next filter (with that impostor image and the entire authentic training set). 4. Cross-correlate the rest of the impostor training set with all the current filters. For each image, store the minimum PSR across different filters (in order to compare between different images in step 3). 5. Repeat step 3 till required number of filters have been built. 6. The remaining impostor images are used to update the filter that can easily reject it (the filter for which the minimum PSR is obtained) The test is done on the same test set used for the OR rule classifier ensemble. 260 authentic images and 1980 impostor images (20 randomly sampled images from each of the 99 impostor fingers) are tested for each filter. Figure 5.34 shows the ROCs of each of the UOTF filters in the ensemble. We notice that each of the filters has similar performance. The average EER of the 3 individual classifiers is 2.82%. Table 5.3 shows the pair-wise authentic and impostor correlation coefficients between the three UOTF PSRs. The authentic and impostor correlation coefficients are positive and are close to 1. This means they are providing similar decisions on authentic and well as impostor images, and hence their fusion with any decision fusion rule would not be very effective. The AND decision fusion EER is 2.71%, which is only slightly smaller to the average individual classifier EER. Classifier pairs 1,2 2,3 1,3 Authentic ρ Impostor ρ Table 5.3: Correlation coefficients between pair-wise classifier scores. The ensemble is designed for AND rule using UOTF filters on the NIST 24 plastic distortion database Figure 5.34 shows the ROC for the AND rule applied on the three UOTF filter ensemble. An exhaustive search on the three-dimensional PSR threshold space is done to find the best set of thresholds for the AND rule on the three UOTF PSRs. We observe that the AND rule ROC has similar performance with each of the three filters in the ensemble. 158

176 UOTF classifiers designed for the AND rule And Figure 5.34: ROCs of classifier ensemble designed for the AND rule and ROC of their AND fusion. An exhaustive search for the best thresholds on the PSRs of the three classifiers is used to find the best ROC. The average EER of the 3 individual classifiers is 2.82% and the AND decision fusion EER is 2.71%. The results are not surprising because the UOTF filters are not good base classifiers for the AND rule on this database. Even if impostor images are not included in the training set, the filters reject most of the impostors correctly. Including impostor images does not provide much benefit. Hence, changing the impostor training set for the UOTF filter would not affect the filter outputs. Due to this reason, all the UOTF filters make similar decisions and hence their fusion is not useful. A base classifier that is affected by a change in the impostor training set would be needed for the AND rule ensemble design. UOTF filters do not satisfy this criteria and hence will not be used for AND rule ensemble design for other databases too AND rule ensemble design with Support Vector Machines A two-class Support Vector Machine (SVM) is a base classifier that uses impostor training data. Due to this fact, we will now design a classifier ensemble for the AND rule on the NIST 24 plastic distortion set using the SVM as a base classifier. The training and test set for the NIST 24 database remains unchanged from Section For the proposed ensemble design, the cosine distance metric is used for dividing the impostor training data into different subsets. For each finger, the k-means clustering algorithm is used to 159

177 obtain three clusters from the impostor training set consisting of 99 images. The initial cluster centroids are chosen at random. During the k-means clustering iterations, if one of the clusters loses all its members, it is removed. The k-means clustering is repeated five times with a different set of initial cluster centroids in order to obtain a good set of clusters. One impostor cluster and the entire authentic training set (consisting of 20 authentic images) are used to train a SVM classifier. Three SVM classifiers, one for each impostor cluster, form the AND rule ensemble. The proposed AND rule ensemble design is compared to Bagging. For Bagging, a random subset of authentic training images and a random subset of impostor training images are used to train an SVM classifier. Three such SVM classifiers form the ensemble for Bagging. As explained in Section , Adaboost is not feasible for the NIST 24 plastic distortion dataset even for the SVM base classifier. This is because there are no training errors made by an SVM classifier that uses all training images. Hence no further classifiers are made in Adaboost. The ROCs of 15 monotonic decision rules applied the three SVM classifiers in the proposed And rule ensemble are shown in Figure The ROCs are obtained by finding a set of optimal thresholds on the three classifier scores for each decision fusion rule. The optimal sets of thresholds are obtained through an exhaustive search over the 3D threshold space. It is observed that the And rule is the best rule with an EER of 1.2%. The performance of the monotonic fusion rules on the Bagging classifier ensemble is displayed in Figure Here, the ROCs of many the decision fusion rules are comparable. The EERs of the Or rule and the Majority Rule are 2.4% and 2.6% and hence are comparable. The proposed And rule classifier ensemble has improved the EER accuracy over Bagging by about 1%. The diversity in the classifier decisions is measured using the pair-wise classifier Q values. The Q values are given in Figure 5.37 and Figure 5.38, for the proposed And rule ensemble, and the Bagging ensemble, respectively. The authentic and impostor pair-wise classifier Q values are shown as a function of the optimal set of thresholds for the And rule. The authentic and impostor Q values are positive in both figures. The Bagging ensemble has authentic and impostor Q values close to 1, which implies similar decisions by the classifiers and low diversity. The impostor Q values of the proposed ensemble are lower and close to 0.5, which signifies more diversity on the impostor decisions than the Bagging ensemble. Negative impostor Q values and positive authentic Q values are favorable for the And rule. While this target has not been reached, more diversity than 160

178 Figure 5.35: Performance of proposed And rule ensemble with SVM classifiers on NIST 24 plastic distortion database. ROCs of 15 monotonic decision fusion rules. Figure 5.36: Performance of Bagging with SVM classifiers on NIST 24 plastic distortion database. ROCs of 15 monotonic decision fusion rules. 161

179 Table 5.4: Pair-wise score correlation coefficients of the proposed AND rule SVM ensemble ρ(1, 2) ρ(2, 3) ρ(3, 1) Average ρ Authentic Impostor the Bagging classifier ensemble has been achieved by our proposed design. The proposed ensemble generation used k-means clustering using the Cosine distance metric for obtaining impostor training subsets. The Cosine distance metric does not capture the classification strategy of the SVM and is hence a poor metric for clustering images. Due to the difficulty in obtaining good clusters of impostors, it is tough to obtain very diverse decisions on impostors. 1 Q values of proposed And rule ensemble 0.5 Q Value 0 A 12 I 12 A I 13 A 23 I Threshold set index Figure 5.37: Pair-wise classifier Q values of the proposed SVM And rule ensemble on the NIST 24 plastic distortion database. These are shown as a function of the best threshold set for the And rule. Dashed (solid) lines are impostor (authentic) Q values. An alternative way to measure diversity is to use correlation coefficients of scores. The authentic and impostor pair-wise score correlation coefficient for the proposed AND rule SVM ensemble is given in Table 5.4. This ensemble has more pair-wise classifier diversity than the Bagging ensemble, the diversity of which is shown in Table 5.5. The proposed ensemble generation used k-means clustering using the Cosine distance metric for 162

180 1 Q values of Bagging ensemble 0.5 A 12 I 12 Q Value 0 A 13 I 13 A I Threshold set index Figure 5.38: Pair-wise Q values of the Bagging ensemble of SVM classifiers generated on the NIST 24 plastic distortion database. An optimal set of thresholds are selected for the And rule, for which the Q values are shown. Dashed (solid) lines are impostor (authentic) Q values. Table 5.5: Pair-wise score correlation coefficients of the Bagging SVM ensemble ρ(1, 2) ρ(2, 3) ρ(3, 1) Average ρ Authentic Impostor

181 obtaining impostor training subsets. The Cosine distance metric does not capture the classification strategy of the SVM and is hence a poor metric for clustering images. Due to the difficulty in obtaining good clusters of impostors, it is tough to obtain very diverse decisions on impostors. In the next subsection, we demonstrate examples of ensemble design for the AND rule on the AR face database [112] with a few different base classifiers AR database evaluation OVERVIEW: Ensemble generation favorable to the AND decision fusion is evaluated on the AR face database. Among a few base classifiers evaluated, the Linear Discriminant classifier is the best. A two classifier ensemble based on male-female partitioning of the impostor training set has a reasonable improvement over the Bagging classifier ensemble. The FAR/FRR for the proposed ensemble is 3.3% / 2.4%, as compared to 4.3% / 3.8% for the Bagging ensemble. Improved diversity in impostor decisions is obtained when each classifier in the ensemble is trained on just one impostor class. Even with poor individual classifier accuracy, the EER of this 94 classifier ensemble is lowered to 0.77%, as compared to the Bagging EER of 3.77% with a 94 classifier ensemble. However, the AND rule fusion is best for the Bagging ensemble leading to a comparable EER of 0.81%. This is probable due to the large number of classifiers in the Bagging ensemble. The AR face database [112] is used here for performance evaluation. It contains color images of expression, illumination and occlusion variations taken at two sessions separated by two weeks. There is some slight pose variation also present in the images. Registered and cropped grayscale images (size pixels) of 95 people are used for evaluation here because of missing data for some of the people. Performance on 20 images of expression, illumination and scarf occlusion per class is evaluated here since the registration of sunglass images is difficult. Sample images from AR database are shown in Figure Two classifiers are designed and their decisions are fused with the AND rule. One common way to design an ensemble is Bagging [8]. Bootstrap [79] samples (randomly chosen subsets) of the training set are used to train the different classifiers in the ensemble. A random subset of the authentic training data and a random subset of the impostor training data are generated by random sampling with replacement on the authentic and impostor training sets, respectively. We compare 164

182 Figure 5.39: Sample images of the variations present in AR database. this method to a more informed way of generating the two classifiers for the AND rule based on the knowledge obtained from Chapter 4. The AND fusion rule is beneficial in general when there is high FAR at low FRR. The FAR on AND fusion is lower than the FAR of the individual classifiers. The AND rule FRR is in general higher than the individual classifier FRRs. Assuming that the AND decision fusion FRR is acceptable at low individual classifier FRRs, we hope that the FAR is lowered by the AND fusion to an acceptable value. In the most favorable conditional dependence for the AND rule, the classifiers partition the impostor space (or the correct classifier decisions are disjoint) and all the classifiers make correct decisions on the authentics. In order to design classifiers with diverse decisions on impostors, we need to use impostor data for training the classifier. Three base classifiers, Linear Discriminants [115], Support Vector Machines (SVM) [51] and Distance Classifier Correlation Filters (DCCF) [123], that use authentic and impostor data for training are evaluated here. We use three different clustering techniques on impostor data to design two classifiers for the AND fusion rule. K-means clustering is used to obtain two clusters of impostors from the impostor training set based on two distance metrics, Euclidean distance ( x y ) 165

183 ( ) and cosine angle 1 xt y x 2 y (where x 2 represents the L 2 norm of vector x). A third set of 2 classifiers is based on dividing the impostors into male and female impostor clusters because the male and female means are different. Each impostor cluster along with the entire authentic training set is used to train a classifier. These methods of partitioning the training set are compared to bootstrapping. The Fisher Discriminant [41] finds a projection direction where the distance between the two class means is increased while the scatter of the two classes is reduced. When the number of training images is smaller than the dimensionality of the images, then the within-class scatter matrix used in finding the projection direction is singular and the classical Linear Discriminant Analysis (LDA) fails. To avoid this, a Gram Schmidt (GS) Orthogonalization based approach for LDA proposed in [115] is used here. While implementation details can be found from [115], this method finds the projection direction that is in the null space of the within-class scatter matrix and in the nonnull space of the scatter matrix. The dimension of the projection vector is the same as the image dimension. By setting a threshold on the projection of the test image, a decision is made. The Support Vector Machine (SVM) [51] finds a separating hyperplane that maximizes the separation between the support vectors, i.e., the closest training points from the two classes. When the data are non-separable, a penalty paid by the points that are misclassified is taken into account while finding the hyperplane. A linear SVM is used here as a base classifier. The distance of the test image to the separating hyperplane is computed and a decision is made by setting a threshold on this distance. The Distance Classifier Correlation Filter [123] finds a transformation that increases the Euclidean distance between the two class means while reducing the scatter of the two classes. This is different from the Fisher Discriminant since the transformation does not change the dimension of the images, whereas in Fisher Discrimination, the projection reduces the dimension. Moreover, the scatter matrix in DCCF is diagonal and of full rank whereas, in Fisher LDA, the rank is less than the number of training images. More details of DCCF are in [123]. A decision is made by setting a threshold on the difference of the Euclidean distance of the test image to the two class means in the transformed domain. Three images from each person (images 1, 4 and 5 corresponding to two expression variations (neutral and scream) and one lighting variation (left lighting) from the first session) are used for 166

184 Table 5.6: Performance (with 95% confidence intervals) of single classifiers on the AR database. EER(%) LDA 4.4±0.92 SVM 10.8±1.39 DCCF 12.3±1.48 Table 5.7: Performance (with 95% confidence intervals) of two classifier LDA fused with the AND rule LDA EER 1 EER 2 FAR AND(1,2) FRR AND(1,2) Bootstrapping 7.4± ± ± ±.86 Male-Female clusters 5.5± ± ± ±0.68 Euclidean dist clusters 6.2± ± ± ±0.81 Cosine dist clusters 6.2± ± ± ±0.76 training. By using training data from all 95 classes, the authentic training set per class consists of 3 images and the impostor training set per class consists of 94*3 images. The test set consists of all 20 images of all classes, i.e., 20 authentic images and 20*94 impostor images per class. When a single classifier (per class) is built using all training images, the average performance of Fisher Discriminant [107], SVM and DCCF on all classes are given in Table 5.6. We hope to reduce the error by designing two classifiers from the training set. The rationale is that a single classifier may not be able to handle all possible variations in the training set, but multiple classifiers built on smaller subsets may be better able to handle the reduced variations and their fusion would lead to an overall improvement in performance. Tables 5.7, 5.8, and 5.9 compare the performance of the different clustering methods to bootstrapping using LDA, SVM and DCCF as the base classifiers, respectively. The thresholds were tuned for each person as opposed to global thresholds on all people. The thresholds on the outputs of the two classifiers were chosen so that the FAR is equal or close to FRR for the AND rule. Table 5.8: Performance (with 95% confidence intervals) of two classifier SVMs fused with the AND rule. SVM EER 1 EER 2 FAR AND(1,2) FRR AND(1,2) Bootstrapping 20.0± ± ± ±1.54 Male-Female clusters 11.7± ± ± ±1.29 Euclidean dist clusters 12.8± ± ± ±1.30 Cosine dist clusters 11.6± ± ± ±

185 Table 5.9: Performance (with 95% confidence intervals) of two classifier DCCFs fused with the AND rule. SVM EER 1 EER 2 FAR AND(1,2) FRR AND(1,2) Bootstrapping 12.6± ± ± ±1.33 Male-Female clusters 16.5± ± ± ±1.22 Euclidean dist clusters 14.7± ± ± ±1.31 Cosine dist clusters 17.3± ± ± ±1.26 It can be seen that two classifier AND rule fusion generally improves performance over a single classifier that uses all training images. The tables also display the theoretical values for the AND rule FAR and FRR if the two classifiers were independent. These are simply used to check if the designed classifiers have the desired conditional dependence statistics as given in Section 3 and do not represent a practical design of classifiers. From these tables, it is observed that for any set, there is positive dependence between the two classifiers on authentics, which is favorable for the AND rule. However, there is also positive dependence between the two classifiers on impostors, which is unfavorable for the AND rule. Nevertheless, it can be seen that the proposed methods of designing classifiers by partitioning the impostor training and utilizing the entire authentic training set have lower error rates than bootstrapping, where classifiers are not designed for the AND rule. For the DCCF based classifiers, all three clustering methods have comparable performance. For the SVM and LDA based classifiers, more conclusive inferences can be made. Clustering based on male and female impostors has the best performance for LDA and SVM. Euclidean distance and Cosine distance were not useful as distance metrics for the sake of clustering. The distance metric should be based on the classification strategy of the base classifier. In other words, different base classifiers should use different distance metrics for clustering. Finally, it is noted that LDA has the best performance on this database Multiple Clusters An extension to more than a 2 classifier set is desired for the AND rule on the AR database. LDA is considered as the base classifier since it was found to be the best classifier for the AR database. Male-Female clustering of impostors in classifier set training was found to yield the best two classifier set for the AND rule on the AR database. However, Male-Female clustering does not lend itself to an extension for more than two clusters. The other extreme clustering method on 168

186 impostors is to use images of each impostor as a cluster. Each classifier is a two class classifier, separating the authentic from one impostor. The number of classifiers is the number of impostors in the training set. For this dataset, this is a better choice than using images of one type of distortion, say the left illumination images as a cluster. This is because the LDA classifier has been shown to be fairly tolerant to illumination variations and distortions. The difference in LDA classification scores between images of the same person under different illumination and expression are smaller than the difference in scores between images of different persons. However, this statement will not be valid for the face pose distortion. The difference in LDA classifier scores may be larger between large pose variations of a person s face than between images of different persons at the same pose. The same training set of 3 authentic images and 94*3 impostor images is used to create 94 LDA classifiers for each person. Images of one impostor in the training set and the entire authentic training set are used to train a LDA classifier using the Gram Schmidt (GS) Orthogonalization approach [115]. 94 LDA classifiers are formed, since there are 94 impostors in the training set. This proposed ensemble generation method is compared to Bagging with 94 LDA classifiers. The training set remains the same. A random subset of the three authentic image training set and a random subset of the 94*3 impostor image training set are chosen to form a LDA classifier. 94 such classifiers from different random subsets of training data form the classifier ensemble in Bagging. The test set is the entire AR dataset, which is the same as before. For each ensemble, each of the 94 classifier scores on each test image is thresholded to obtain 94 decisions. The AND rule is applied on these 94 decisions from the proposed AND rule classifier ensemble to obtain a global decision. Bagging uses the Majority rule on the 94 decisions obtained from the 94 LDA classifiers to obtain the final decision. To check for optimality, the performance of the OR and AND rules on these decisions are also compared. The average EER of the two classifier ensembles for the And, Or, Majority rules are shown in Table The EER for each person is found for different decision fusion rules, and averaged over the number of persons. For the proposed ensemble, it is observed that the And rule is the best decision rule among the three major decision fusion rules. This is as expected since the ensemble was designed for the And rule. The Majority and Or rule EERs are higher than the And rule EER, since the classifier ensemble diversity is favorable to the And rule and unfavorable to the Or and Majority decision rules. 169

187 Ensemble And EER(%) Majority EER(%) Or EER(%) Proposed Bagging Table 5.10: Comparison between the proposed AND rule ensemble with 94 LDA classifiers and Bagging with 94 LDA classifiers. Since Bagging uses Majority rule fusion, the EER of Bagging with the 94 classifier ensemble, 3.77%, is significantly higher than the And rule EER of our proposed ensemble, 0.77%. It is also of significance that the Majority rule is not the best rule for the Bagging classifier ensemble. This is because, in practice, randomly generated classifiers are dependent, which violates the independence assumption in Bagging. And rule fusion of the Bagging classifier ensemble has comparable EER to the And rule fusion of the proposed classifier ensemble. 5.3 Classifier ensemble design for the MAJORITY rule OVERVIEW: Ensembles favorable to the MAJORITY rule fusion are generated in this section. The classifiers in the ensemble are expected to make diverse decisions on both authentics and impostors. Further, the individual classifiers are expected to make a majority of correct decisions on both authentics and impostors. Hence the classifier ensemble design for the MAJORITY rule is tougher than the design for OR and AND rules. Ensembles are generated for the NIST 24 fingerprint database. For the plastic distortion set, diversity in the authentic decisions is obtained but the diversity in the impostor decisions is insufficient for favorable conditional dependence for the MAJORITY rule. The stringent requirements for the MAJORITY rule ensemble result in a poor individual classifier discrimination capability for the rotation set. Due to the extreme pose variations in the authentic images in the PIE database, desirable individual classifier accuracy for the MAJORITY rule ensemble cannot be obtained; hence ensemble generation is not done here. We apply classifier ensemble design techniques for the majority rule on the NIST 24 fingerprint database, both on the plastic distortion dataset and the rotation dataset in Sections and For the majority decision rule to be correct, more than half the classifiers should make a correct decision. If the training data can be divided into N subsets, each classifier should be trained on a different set of N+1 2 subsets for an optimal coverage of the training set. This provides the most 170

188 significant Majority decision rule improvement over the individual classifier and the optimal diversity for the majority rule. Table 5.11 shows an example of the training subsets used in training each classifier for Majority fusion of 3 classifiers. This principle is used in training classifier ensembles for the NIST 24 fingerprint database. Table 5.11: Training of each classifier for Majority Decision Fusion of a three classifier set. Each training subset is used in the training sets of 2 classifiers. This results in maximum accuracy since at least two classifiers produce a correct decision on that training subset. Each classifier is trained on a different set of two training subsets for maximum diversity and most significant improvement over the single classifier. Training Subsets Classifier 1 Used Used Not used Classifier 2 Not used Used Used Classifier 3 Used Not used Used NIST 24 plastic distortion dataset evaluation OVERVIEW: UOTF classifiers for the MAJORITY rule ensemble are trained on a majority of the authentic class plastic distortions in the training set. For the three-classifier ensemble, each set of authentic plastic distortion is trained on by two classifiers. Similar division of the impostor training set is also done. The desirable negative conditional dependence on authentics is achieved for two of the three classifier pairs. Better impostor decision diversity from the OR rule ensemble is achieved. However, since the UOTF classifier is not affected much by impostor training, the favorable negative conditional dependence on impostors is not obtained. The accuracy of this ensemble is good, with an EER of 0.5% with the best monotonic decision fusion rule for this ensemble. This accuracy is higher than the Bagging classifier ensemble. A three classifier ensemble is designed for the majority rule on the NIST 24 plastic distortion dataset. A few different training methods are used, which will be described below. The UOTF classifier [114] is used here. The training is done on the authentic training images since that is most effective for the UOTF filter. The impostor training images hardly affect the UOTF outputs and hence are not used. The authentic training set is composed of twenty uniformly sampled images starting from the 171

189 1st image, i.e., every 15th image of the 300 authentic images, starting from the 1st image. Three training subsets based on plastic distortion that are used in designing classifiers for the OR rule in Section are taken. In the given authentic training set, three images that are most different to each other are found first. The rest of the images in the training set are grouped with the set that they are most similar to. The metric used for measuring similarity is the PSR, which is used in performance evaluation for correlation filters. Two authentic training subsets are used to train each UOTF classifier in the method described in Table The authentic images that are not used in training are used for test. There are = 280 authentic test images. For impostor image test, twenty randomly sampled images from each of the 99 impostor fingers are used. This is a representative test for impostors since the UOTF filter has a low PSR for impostor images, thus rejecting them effectively. There are 99*20=1980 impostor test images for each finger. Each of the three UOTF classifiers are applied on a test image, and the PSR of the correlation filter output is evaluated. A decision is made for each classifier by thresholding the PSR value. The three decisions are fused using a decision rule to get a global decision for the test image. The threshold applied on the PSR can be different for each classifier. There is an optimal three classifier threshold set for a given FAR of the decision fusion rule. This optimal threshold set has the minimum FRR of the decision fusion rule for the given FAR. The optimal threshold set is found through an exhaustive search on the 3D space of thresholds. The computation for the search is minimized by using a multi-resolution method, as described in Chapter 3. The individual classifiers ROCs of the proposed ensemble for the Majority rule are displayed in Figure The EERs of the individual classifiers are 6.1%, 9.8% and 6.8%. Figure 5.41shows the PSRs for the three classifiers on a sample finger (finger 7). It is observed that two of the three classifiers have a high PSR on most of the authentic images, which results on the desired authentic conditional dependence for the Majority rule. However, most of the impostors are rejected by all classifiers. While this is a desirable trait, it does not provide a negative diversity on impostor decisions. Hence the optimal diversity for the Majority rule is not obtained here. The correlation coefficients between pair-wise classifier scores are given in Table The authentic score correlation coefficients are negative between classifiers 1 and 2, as well as between classifiers 2 and 3. This is favorable for the majority rule. However, there is positive correlation 172

190 NIST 24 plastic: Individual classifiers of Majority rule ensemble P FR P FA Figure 5.40: Individual classifier ROCs of the proposed Majority rule ensemble Figure 5.41: PSRs of a sample finger when trained with every 15th authentic training image, starting from the 1st that are divided into the same three groups as used in the OR rule design. 173

191 between classifiers 1 and 3. The decisions made by classifiers 1 and 3 are not sufficiently different for an optimal ensemble. The impostor correlation coefficients are all positive. For the Majority rule, all pair-wise correlation coefficients should be negative for authentics as well as impostors. The conditional dependence between authentics is promising for the majority rule. However, the desired conditional dependence on impostors is not achieved for the Majority rule. The UOTF filters reject impostors well. Due to this, the impostor correlation coefficients between all classifiers will be positive. The majority rule will not be the best decision fusion rule because of these reasons. Since the authentic correlation coefficients between all pairs of classifiers is not negative, the OR fusion of all three classifiers will not be the best rule. The ROCs for all monotonic three classifier fusion rules are shown in Figure The best decision fusion rule is a combination of OR/AND fusion among the classifiers. This rule combines classifiers 2 and 3 by the OR rule, which is then combined with classifier 1 by the AND rule. This result is reasonable. Since classifiers 2 and 3 have negative dependence on authentics and positive dependence on impostors, OR fusion is best for them (Or 23 ). The authentic conditional dependence between Or 23 and classifier 1 will be positive. This is because of the positive correlation coefficient between classifiers 1 and 3. Hence Or 1,Or23 = Or 123 will not be the best rule. The only other monotonic rule is And 1,Or2,3, which turns out to be the best decision fusion rule. Classifier pairs 1,2 2,3 1,3 Authentic ρ Impostor ρ Table 5.12: Pair-wise correlation coefficients of UOTF filter PSRs NIST 24 rotation dataset evaluation OVERVIEW: OTCHF classifiers for the MAJORITY rule ensemble are trained for rotation tolerance of authentics in a large angle range. The test set rotation angle range is divided into bins. The idea here is to have each classifier make correct authentic decisions for a majority of rotation angle bins. However, these constraints result in poor discrimination capability and hence poor individual classifier accuracy. A classifier ensemble is designed for the majority rule on the NIST 24 fingerprint rotation 174

192 ROCs for Majority rule ensemble design with 3 UOTF filters 0.1 P FR And 123 Majority Or 3,AND 12 And 1,Or 23 And 2,Or 13 And 3,Or 12 Or 1,And 23 Or 2,And 13 Or % min. EER P FA Figure 5.42: Test results on the ensemble designed for Majority rule on the NIST 24 plastic distortion set. ROCs of three classifier decision rules are shown. And 123 : And fusion of all three classifiers. And 1,Or2,3 : Or fusion of classifiers 2 and 3 is done first. This result is then fused with classifier 1 by the And rule. dataset. The optimal tradeoff circular harmonic function (OTCHF) [121] filters are used here. The OTCHF filters are used in the classifier ensemble for the OR rule on this dataset in Section The ensemble design procedure is modified to suit the majority rule. If the training data can be divided into N subsets, each classifier should be trained on a different set of N 2 subsets for optimal diversity for the majority rule, maximal accuracy of the classifier ensemble with the majority rule with the most significant improvement over accuracy of the individual classifier. In this section, we divide the rotation range of 50 to 50 in which the fingerprint images are present into N bins and each of the N classifiers is designed to produce a peak for authentic images in a different set of N+1 2 rotation range bins. The required rotation range for a single classifier is N N, which for odd values of N is equal to N. It is to be noted that the OTCHF filter designed to produce a peak for a smaller rotation range is more discriminative. Increasing the number of classifiers, N, reduces the required rotation range for a single classifier. Hence, we use a larger set of classifiers here to reduce the constraints on a single OTCHF filter and hence increase its discriminative capability. We design a classifier ensemble with 9 classifiers for the majority rule here. Table 175

193 5.13 shows the rotation ranges for which each of the nine OTCHF filters are trained to produce a peak on authentic images. Table 5.13: The desired rotation tolerance range for each of the nine classifiers used in Majority fusion is provided here. Classifier 1 Classifier 2 Classifier 3 [ 50, 5 ] [ 39, 16 ] [ 28, 27 ] Classifier 4 Classifier 5 Classifier 6 [ 17, 38 ] [ 6, 49 ] [ 50, 38 ] [5, 55 ] Classifier 7 Classifier 8 Classifier 9 [ 50, 27 ] [16, 55 ] [ 50, 16 ] [27, 55 ] [ 50, 7 ] [38, 55 ] We use an OTCHF classifier with a noise tolerance parameter α = 10 6, average correlation energy (ACE) minimization parameter β 1 and a zero average (dis)similarity metric (ASM) minimization parameter γ = 0. The same eight authentic training images per finger used in the ensemble design for the OR rule are also used here. These images are approximately de-rotated to zero degree angle (upright image) in the original size of pixels, and then downsampled to pixels. No impostor images were used in training. The training and testing are done on downsampled images of pixels. The test images are not de-rotated. All the images from all fingers in the rotation dataset are used for testing. Evaluation is done on all images of the 100 fingers for each of the filters; including the training images, because they were de-rotated before computing the filter. There were approximately 15,000 joint authentic matches and 1.5 million joint impostor matches in total from all fingers. Figure 5.43 shows the authentic and impostor PSRs for a sample finger (finger 7) with the nine classifier ensemble designed for the majority rule with the eight authentic training images. We can see the relatively high authentic PSRs for the rotation range that each classifier was designed for. However, we note that there is only a small difference (or large overlap) between the values of authentic and impostor PSRs. We check to see if the difference between authentic and impostor PSRs increase with more authentic training images. Figures 5.44, 5.45, 5.46 show the authentic and some impostor PSRs for the same sample finger (finger 7) with eleven, thirteen and sixteen authentic training images, respectively. From these figures, we note the trend that the some of the authentic 176

194 image PSR values reduce while the impostor image PSRs increase as more training images are added, reducing the difference between authentic and impostor PSR values. This may be because that adding more training images increases the constraints on the OTCHF filter which reduces its discriminating capability. Figure 5.43: PSRs of authentic test images of a sample finger (finger 7) and some impostor test images for the nine classifier ensemble designed for the Majority rule. Eight training images are used. The OTCHF filter has a trade-off between distortion tolerance and discrimination. The large rotation range for which each OTCHF filter is designed to produce an authentic peak is greater than 50, which imposes too much distortion tolerance constraint on the OTCHF filter, which reduces the discrimination capability of the filter. Adding more training images only increases the distortion tolerance constraints, which further reduces the discrimination capability. This ensemble design for the majority rule using the OTCHF filter has been unsuccessful in providing sufficient discrimination. Designing the OTCHF filter in the original pixel size increases the number of variables (the number of frequencies of the filter), which may be helpful in increasing the discrimination capability. 177

195 Figure 5.44: Eleven authentic training images are used to design a nine classifier ensemble designed for the majority rule for a sample finger (finger 7). PSRs of authentic test images and some impostor test images for the nine classifier ensemble are shown. 178

196 Figure 5.45: Thirteen authentic training images are used to design a nine classifier ensemble designed for the majority rule for a sample finger (finger 7). PSRs of authentic test images and some impostor test images for the nine classifier ensemble are shown. 179

197 Figure 5.46: Sixteen authentic training images are used to design a nine classifier ensemble designed for the majority rule for a sample finger (finger 7). PSRs of authentic test images and some impostor test images for the nine classifier ensemble are shown. 180

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Biometrics: Introduction and Examples. Raymond Veldhuis

Biometrics: Introduction and Examples. Raymond Veldhuis Biometrics: Introduction and Examples Raymond Veldhuis 1 Overview Biometric recognition Face recognition Challenges Transparent face recognition Large-scale identification Watch list Anonymous biometrics

More information

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

Score calibration for optimal biometric identification

Score calibration for optimal biometric identification Score calibration for optimal biometric identification (see also NIST IBPC 2010 online proceedings: http://biometrics.nist.gov/ibpc2010) AI/GI/CRV 2010, Ottawa Dmitry O. Gorodnichy Head of Video Surveillance

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

This is an accepted version of a paper published in Elsevier Information Fusion. If you wish to cite this paper, please use the following reference:

This is an accepted version of a paper published in Elsevier Information Fusion. If you wish to cite this paper, please use the following reference: This is an accepted version of a paper published in Elsevier Information Fusion. If you wish to cite this paper, please use the following reference: T. Murakami, T. Ohki, K. Takahashi, Optimal sequential

More information

When enough is enough: early stopping of biometrics error rate testing

When enough is enough: early stopping of biometrics error rate testing When enough is enough: early stopping of biometrics error rate testing Michael E. Schuckers Department of Mathematics, Computer Science and Statistics St. Lawrence University and Center for Identification

More information

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Predictive analysis on Multivariate, Time Series datasets using Shapelets 1 Predictive analysis on Multivariate, Time Series datasets using Shapelets Hemal Thakkar Department of Computer Science, Stanford University hemal@stanford.edu hemal.tt@gmail.com Abstract Multivariate,

More information

Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations

Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations Mohammed El-Shambakey Dissertation Submitted to the Faculty of the Virginia Polytechnic Institute and State

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how

More information

2D Image Processing Face Detection and Recognition

2D Image Processing Face Detection and Recognition 2D Image Processing Face Detection and Recognition Prof. Didier Stricker Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de

More information

Face detection and recognition. Detection Recognition Sally

Face detection and recognition. Detection Recognition Sally Face detection and recognition Detection Recognition Sally Face detection & recognition Viola & Jones detector Available in open CV Face recognition Eigenfaces for face recognition Metric learning identification

More information

Experimental designs for multiple responses with different models

Experimental designs for multiple responses with different models Graduate Theses and Dissertations Graduate College 2015 Experimental designs for multiple responses with different models Wilmina Mary Marget Iowa State University Follow this and additional works at:

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

ECE 661: Homework 10 Fall 2014

ECE 661: Homework 10 Fall 2014 ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;

More information

A COMPLEXITY FRAMEWORK FOR COMBINATION OF CLASSIFIERS IN VERIFICATION AND IDENTIFICATION SYSTEMS

A COMPLEXITY FRAMEWORK FOR COMBINATION OF CLASSIFIERS IN VERIFICATION AND IDENTIFICATION SYSTEMS A COMPLEXITY FRAMEWORK FOR COMBINATION OF CLASSIFIERS IN VERIFICATION AND IDENTIFICATION SYSTEMS By Sergey Tulyakov May 2006 a dissertation submitted to the faculty of the graduate school of state university

More information

Analyzing dynamic ensemble selection techniques using dissimilarity analysis

Analyzing dynamic ensemble selection techniques using dissimilarity analysis Analyzing dynamic ensemble selection techniques using dissimilarity analysis George D. C. Cavalcanti 1 1 Centro de Informática - Universidade Federal de Pernambuco (UFPE), Brazil www.cin.ufpe.br/~gdcc

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Article from Predictive Analytics and Futurism July 2016 Issue 13 Regression and Classification: A Deeper Look By Jeff Heaton Classification and regression are the two most common forms of models fitted

More information

Role of Assembling Invariant Moments and SVM in Fingerprint Recognition

Role of Assembling Invariant Moments and SVM in Fingerprint Recognition 56 Role of Assembling Invariant Moments SVM in Fingerprint Recognition 1 Supriya Wable, 2 Chaitali Laulkar 1, 2 Department of Computer Engineering, University of Pune Sinhgad College of Engineering, Pune-411

More information

If you wish to cite this paper, please use the following reference:

If you wish to cite this paper, please use the following reference: This is an accepted version of a paper published in Proceedings of the st IEEE International Workshop on Information Forensics and Security (WIFS 2009). If you wish to cite this paper, please use the following

More information

Bayesian Decision Theory

Bayesian Decision Theory Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is

More information

Performance Evaluation

Performance Evaluation Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,

More information

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI Department of Computer Science APPROVED: Vladik Kreinovich,

More information

Using Fuzzy Logic as a Complement to Probabilistic Radioactive Waste Disposal Facilities Safety Assessment -8450

Using Fuzzy Logic as a Complement to Probabilistic Radioactive Waste Disposal Facilities Safety Assessment -8450 Using Fuzzy Logic as a Complement to Probabilistic Radioactive Waste Disposal Facilities Safety Assessment -8450 F. L. De Lemos CNEN- National Nuclear Energy Commission; Rua Prof. Mario Werneck, s/n, BH

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Likelihood Ratio in a SVM Framework: Fusing Linear and Non-Linear Face Classifiers

Likelihood Ratio in a SVM Framework: Fusing Linear and Non-Linear Face Classifiers Likelihood Ratio in a SVM Framework: Fusing Linear and Non-Linear Face Classifiers Mayank Vatsa, Richa Singh, Arun Ross, and Afzel Noore Lane Department of Computer Science and Electrical Engineering West

More information

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3 Biometrics: A Pattern Recognition System Yes/No Pattern classification Biometrics CSE 190 Lecture 3 Authentication False accept rate (FAR): Proportion of imposters accepted False reject rate (FRR): Proportion

More information

BIOMETRIC verification systems are used to verify the

BIOMETRIC verification systems are used to verify the 86 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 1, JANUARY 2004 Likelihood-Ratio-Based Biometric Verification Asker M. Bazen and Raymond N. J. Veldhuis Abstract This paper

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

I D I A P R E S E A R C H R E P O R T. Samy Bengio a. November submitted for publication

I D I A P R E S E A R C H R E P O R T. Samy Bengio a. November submitted for publication R E S E A R C H R E P O R T I D I A P Why Do Multi-Stream, Multi-Band and Multi-Modal Approaches Work on Biometric User Authentication Tasks? Norman Poh Hoon Thian a IDIAP RR 03-59 November 2003 Samy Bengio

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Boosting: Algorithms and Applications

Boosting: Algorithms and Applications Boosting: Algorithms and Applications Lecture 11, ENGN 4522/6520, Statistical Pattern Recognition and Its Applications in Computer Vision ANU 2 nd Semester, 2008 Chunhua Shen, NICTA/RSISE Boosting Definition

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

CITS 4402 Computer Vision

CITS 4402 Computer Vision CITS 4402 Computer Vision A/Prof Ajmal Mian Adj/A/Prof Mehdi Ravanbakhsh Lecture 06 Object Recognition Objectives To understand the concept of image based object recognition To learn how to match images

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Causal Inference with Big Data Sets

Causal Inference with Big Data Sets Causal Inference with Big Data Sets Marcelo Coca Perraillon University of Colorado AMC November 2016 1 / 1 Outlone Outline Big data Causal inference in economics and statistics Regression discontinuity

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification International Journal of Engineering Science Invention Volume 1 Issue 1 December. 2012 PP.18-23 Estimation of Relative Operating Characteristics of Text Independent Speaker Verification Palivela Hema 1,

More information

8. Classification and Pattern Recognition

8. Classification and Pattern Recognition 8. Classification and Pattern Recognition 1 Introduction: Classification is arranging things by class or category. Pattern recognition involves identification of objects. Pattern recognition can also be

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Significance Tests for Bizarre Measures in 2-Class Classification Tasks

Significance Tests for Bizarre Measures in 2-Class Classification Tasks R E S E A R C H R E P O R T I D I A P Significance Tests for Bizarre Measures in 2-Class Classification Tasks Mikaela Keller 1 Johnny Mariéthoz 2 Samy Bengio 3 IDIAP RR 04-34 October 4, 2004 D a l l e

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES THEORY AND PRACTICE Bogustaw Cyganek AGH University of Science and Technology, Poland WILEY A John Wiley &. Sons, Ltd., Publication Contents Preface Acknowledgements

More information

COMPUTATIONAL ALGORITHMS FOR FINGERPRINT RECOGNITION

COMPUTATIONAL ALGORITHMS FOR FINGERPRINT RECOGNITION COMPUTATIONAL ALGORITHMS FOR FINGERPRINT RECOGNITION Kluwer International Series on Biometrics Professor David D. Zhang Consulting editor Department of Computer Science Hong Kong Polytechnic University

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski

More information

Linear Combiners for Fusion of Pattern Classifiers

Linear Combiners for Fusion of Pattern Classifiers International Summer School on eural ets duardo R. Caianiello" SMBL MTHODS FOR LARIG MACHIS Vietri sul Mare(Salerno) ITALY 22-28 September 2002 Linear Combiners for Fusion of Pattern Classifiers Lecturer

More information

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

Technical Report. Results from 200 billion iris cross-comparisons. John Daugman. Number 635. June Computer Laboratory

Technical Report. Results from 200 billion iris cross-comparisons. John Daugman. Number 635. June Computer Laboratory Technical Report UCAM-CL-TR-635 ISSN 1476-2986 Number 635 Computer Laboratory Results from 200 billion iris cross-comparisons John Daugman June 2005 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Fusion of Dependent and Independent Biometric Information Sources

Fusion of Dependent and Independent Biometric Information Sources Fusion of Dependent and Independent Biometric Information Sources Dongliang Huang ECE, University of Calgary Henry Leung ECE, University of Calgary Winston Li ECE, University of Calgary Department of Electrical

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Algorithm Independent Topics Lecture 6

Algorithm Independent Topics Lecture 6 Algorithm Independent Topics Lecture 6 Jason Corso SUNY at Buffalo Feb. 23 2009 J. Corso (SUNY at Buffalo) Algorithm Independent Topics Lecture 6 Feb. 23 2009 1 / 45 Introduction Now that we ve built an

More information

Decision Tree And Random Forest

Decision Tree And Random Forest Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: Ammar@cu.edu.eg

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

Sensitivity of the Elucidative Fusion System to the Choice of the Underlying Similarity Metric

Sensitivity of the Elucidative Fusion System to the Choice of the Underlying Similarity Metric Sensitivity of the Elucidative Fusion System to the Choice of the Underlying Similarity Metric Belur V. Dasarathy Dynetics, Inc., P. O. Box 5500 Huntsville, AL. 35814-5500, USA Belur.d@dynetics.com Abstract

More information

Automatic Identity Verification Using Face Images

Automatic Identity Verification Using Face Images Automatic Identity Verification Using Face Images Sabry F. Saraya and John F. W. Zaki Computer & Systems Eng. Dept. Faculty of Engineering Mansoura University. Abstract This paper presents two types of

More information

COS 429: COMPUTER VISON Face Recognition

COS 429: COMPUTER VISON Face Recognition COS 429: COMPUTER VISON Face Recognition Intro to recognition PCA and Eigenfaces LDA and Fisherfaces Face detection: Viola & Jones (Optional) generic object models for faces: the Constellation Model Reading:

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes Mixtures of Gaussians with Sparse Regression Matrices Constantinos Boulis, Jeffrey Bilmes {boulis,bilmes}@ee.washington.edu Dept of EE, University of Washington Seattle WA, 98195-2500 UW Electrical Engineering

More information

Fingerprint Individuality

Fingerprint Individuality Fingerprint Individuality On the Individuality of Fingerprints, Sharat Pankanti, Anil Jain and Salil Prabhakar, IEEE Transactions on PAMI, 2002 US DOJ, Office of the Inspector General, A Review of the

More information

Robotics 2 AdaBoost for People and Place Detection

Robotics 2 AdaBoost for People and Place Detection Robotics 2 AdaBoost for People and Place Detection Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard v.1.0, Kai Arras, Oct 09, including material by Luciano Spinello and Oscar Martinez Mozos

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu What is Machine Learning? Overview slides by ETHEM ALPAYDIN Why Learn? Learn: programming computers to optimize a performance criterion using example

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Measures of Diversity in Combining Classifiers

Measures of Diversity in Combining Classifiers Measures of Diversity in Combining Classifiers Part. Non-pairwise diversity measures For fewer cartoons and more formulas: http://www.bangor.ac.uk/~mas00a/publications.html Random forest :, x, θ k (i.i.d,

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS

MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MATTHEW J. CRACKNELL BSc (Hons) ARC Centre of Excellence in Ore Deposits (CODES) School of Physical Sciences (Earth Sciences) Submitted

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Two-Layered Face Detection System using Evolutionary Algorithm

Two-Layered Face Detection System using Evolutionary Algorithm Two-Layered Face Detection System using Evolutionary Algorithm Jun-Su Jang Jong-Hwan Kim Dept. of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST),

More information

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING Vishwanath Mantha Department for Electrical and Computer Engineering Mississippi State University, Mississippi State, MS 39762 mantha@isip.msstate.edu ABSTRACT

More information

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If

More information

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-1 NATIONAL SECURITY COMPLEX J. A. Mullens, J. K. Mattingly, L. G. Chiang, R. B. Oberer, J. T. Mihalczo ABSTRACT This paper describes a template matching

More information

Stat 502X Exam 2 Spring 2014

Stat 502X Exam 2 Spring 2014 Stat 502X Exam 2 Spring 2014 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This exam consists of 12 parts. I'll score it at 10 points per problem/part

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Contents Introduction to Machine Learning Concept Learning Varun Chandola February 2, 2018 1 Concept Learning 1 1.1 Example Finding Malignant Tumors............. 2 1.2 Notation..............................

More information

PCA FACE RECOGNITION

PCA FACE RECOGNITION PCA FACE RECOGNITION The slides are from several sources through James Hays (Brown); Srinivasa Narasimhan (CMU); Silvio Savarese (U. of Michigan); Shree Nayar (Columbia) including their own slides. Goal

More information

Unsupervised Learning Methods

Unsupervised Learning Methods Structural Health Monitoring Using Statistical Pattern Recognition Unsupervised Learning Methods Keith Worden and Graeme Manson Presented by Keith Worden The Structural Health Monitoring Process 1. Operational

More information

Newey, Philip Simon (2009) Colony mate recognition in the weaver ant Oecophylla smaragdina. PhD thesis, James Cook University.

Newey, Philip Simon (2009) Colony mate recognition in the weaver ant Oecophylla smaragdina. PhD thesis, James Cook University. This file is part of the following reference: Newey, Philip Simon (2009) Colony mate recognition in the weaver ant Oecophylla smaragdina. PhD thesis, James Cook University. Access to this file is available

More information

Incremental Construction of Complex Aggregates: Counting over a Secondary Table

Incremental Construction of Complex Aggregates: Counting over a Secondary Table Incremental Construction of Complex Aggregates: Counting over a Secondary Table Clément Charnay 1, Nicolas Lachiche 1, and Agnès Braud 1 ICube, Université de Strasbourg, CNRS 300 Bd Sébastien Brant - CS

More information

Facial Expression Recognition using Eigenfaces and SVM

Facial Expression Recognition using Eigenfaces and SVM Facial Expression Recognition using Eigenfaces and SVM Prof. Lalita B. Patil Assistant Professor Dept of Electronics and Telecommunication, MGMCET, Kamothe, Navi Mumbai (Maharashtra), INDIA. Prof.V.R.Bhosale

More information

Score Normalization in Multimodal Biometric Systems

Score Normalization in Multimodal Biometric Systems Score Normalization in Multimodal Biometric Systems Karthik Nandakumar and Anil K. Jain Michigan State University, East Lansing, MI Arun A. Ross West Virginia University, Morgantown, WV http://biometrics.cse.mse.edu

More information

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1 CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 IEEE Expert, October 1996 CptS 570 - Machine Learning 2 Given sample S from all possible examples D Learner

More information