Biometric scores fusion based on total error rate minimization

Size: px

Start display at page:

Download "Biometric scores fusion based on total error rate minimization"

Peter Blair
6 years ago
Views:

Pattern Recognition 4 (28) 66 82 www.elsevier.

Engineering, Yonsei University, 34 Shinchon-dong, Seodaemun-gu, Seoul, 2-749, Korea Received 3 October 26; received in revised form 23 June 27; accepted 25 July 27 Abstract This paper addresses the

1 Pattern Recognition 4 (28) Biometric scores fusion based on total error rate minimization Kar-Ann Toh, Jaihie Kim, Sangyoun Lee Biometrics Engineering Research Center, School of Electrical & Electronic Engineering, Yonsei University, 34 Shinchon-dong, Seodaemun-gu, Seoul, 2-749, Korea Received 3 October 26; received in revised form 23 June 27; accepted 25 July 27 Abstract This paper addresses the biometric scores fusion problem from the error rate minimization point of view. Comparing to the conventional approach which treats fusion classifier design and performance evaluation as a two-stage process, this work directly optimizes the target performance with respect to fusion classifier design. Based on a smooth approximation to the total error rate of identity verification, a deterministic solution is proposed to solve the fusion optimization problem. The proposed method is applied to a face and iris verification fusion problem addressing the demand for high security in the modern networked society. Our empirical evaluations show promising potential in terms of decision accuracy and computing efficiency. 27 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Multimodal biometrics; Decision fusion; Equal error rate; Pattern classification; Machine learning. Introduction.. Background Attributed to the identity-based nature of authentication, biometric has gained much attention over recent years particularly, for its potential role in information and forensic security. However, there remained many problems to be resolved before biometrics can gain possibly pervasive applications. For instance, due to inherent limitations as well as external sensing factors, no single biometric method can warrant a % authentication accuracy as well as universality of usage by itself. Since combining multiple biometric methods can alleviate many of these problems, multimodal biometrics becomes a focused field of research. Existing means to combine or fuse multiple modalities of biometrics can be performed either before matching or after matching. For fusion before matching, two levels namely, the sensor level and the feature level (see e.g. Refs. [,2]) can be identified. For fusion after matching, three levels namely, Corresponding author. Tel.: ; fax: address: katoh@yonsei.ac.kr (K.-A. Toh). the abstract level (see e.g. Ref. [3]), the rank level and the match score level (see e.g. Ref. [4]) can be identified. Concerning the central module of fusion, either non-training based methods or training-based methods can be adopted. In non-training based methods, it is often assumed that the outputs of individual biometric classifiers are the probabilities that the input pattern belongs to a certain class label (see e.g. Refs. [5 8]). The training-based methods do not require this assumption and they can operate directly on the match scores generated by biometric verification modules (see e.g. Refs. [3,9 ]). Our work here belongs to the training-based approach working at match score level. Since the outcome of biometric verification consists of only two labels, i.e. the query identity is recognized to be either a genuine-user or an imposter, the verification process can thus be treated as a two-category classification problem. This classification treatment holds well for multimodal scores fusion because similar decision labels are anticipated..2. Motivation Apart from the receiver operating characteristic (ROC) curves, the false acceptance rate (FAR), the false rejection rate (FRR), and the equal error rate (EER) have been used 3-323/$3. 27 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:.6/j.patcog

2 K.-A. Toh et al. / Pattern Recognition 4 (28) extensively for comparison of biometric verification performances. These error rates have their own reasons for being widely used: (i) each is a single index measure and thus simple and direct in terms of interpretation as compared to the ROC, (ii) the EER is a compact term indicating both the FAR and the FRR at the same time, and more importantly (iii) the EER is based on a projected optimal operating point (of total error rate, TER) where the FAR curve meets the FRR curve. The error rate is a percentage count of misclassified samples and this poses difficulty to analyze it directly without imposing strong assumptions regarding the data distribution. Except for Poh and Bengio [2] who approached the problem from the theoretical EER point of view based on Gaussian ssumption, there has been lack of literature from the biometric community to solve or even acknowledge this problem. A common practice as seen in the multimodal biometrics community is to perform decision fusion and performance evaluation separately (see all cited references in previous subsection). For instance, the fusion module is first designed using certain distance criterion (e.g. least squares error, LSE) and then the performance is evaluated using FAR, FRR or EER. Although there exists a certain correlation between the learning distance (e.g. LSE) and the decision error rates (a percentage count of incorrectly classified samples), the FAR and FRR outcomes are frequently found to behave in a rather different manner from the optimally learned fusion classifier (with respect to the LSE) in practice. This is due to a mismatch between the learning objective (LSE) and the authentication objectives (FAR, FRR or EER). The Support Vector Machines (SVM) [3,4] has largely advanced the situation in terms of decision boundary design. However there remains no direct clues regarding the error rates without going through the error counting process. In view of the above problem, we present an attempt to approximate the error counting of FAR and FRR and then optimize the approximated Total Error Rate (TER which is equal to FAR+FRR) directly with respect to fusion classifier design. Based on an extensive experiments on fusing several biometrics, we shall observe the empirical behaviors of such formulation..3. Contributions and organization The contributions of this work are enumerated as follows: (i) formulation of an approximate optimization objective which includes a decision threshold for empirical TER estimation, (ii) proposal of a novel fast closed-form solution to TER minimization, (iii) provision of empirical evidences using several biometric data sets for fusion study. The paper is organized as follows: the next section provides several definitions of error rates and a brief account on related linear estimation models. A direct means to compute the TER given a decision model is next presented in Section 3. With the TER objective in place, Section 4 presents our proposed method to minimize the TER. In Section 5, we introduce three non-intrusive biometrics from the face and the eye for verification scores fusion. This is followed up by an extensive experimentation using data from these three biometrics in Section 6. The proposed method is further benchmarked using publicly available data sets. Section 7 summarizes the results and observations, and finally some concluding remarks are passed in Section Definitions and preliminaries Consider the binary classification in biometric decision. Suppose we have m learning examples {x} m i= Rl (a l-dimensional biometric feature vector) and their corresponding class labels y i {, } where denotes an imposter and denotes a genuine-user. Let g : R l R be the hypothesis function (biometric classifier) mapping these pattern features onto a scalar measure for decision inference. Suppose g(x) produces a continuous output, then the output must be thresholded in order to label each example as positive class (genuine-user) or negative class (imposter). Given a decision threshold τ, the class label associated to a new example x n can be written as { (=genuine_user) if g(xn ) τ, cls(g(x n )) = () (=imposter) if g(x n )<τ. For each operational setting of τ, atrue positive rate (TP) and a true negative rate (TN) are sufficient to describe the classifier s performance. Alternatively, a false positive rate (FP) and a false negative rate (FN) can also be defined. It is noted that TP + FN = and TN+ FP =. The relation between these recognition rates can be tabulated as a confusion matrix as shown in Table. 2.. Equal error rate In biometric verification where the basic task is to distinguish between two classes of users, namely the genuine-user and the imposter, FP rate is also called FAR, and FN rate is also called FRR. By varying the threshold τ fro to + (or from to in a normalized case), the FRR shows an increasing trend while the FAR shows a decreasing trend, both with respect to this change of τ. Along the variation of threshold τ, there is a point (say, at τ ) where the two curves (FAR and FRR) cross each other, and this point is called the EER. In other words, EER = FAR FAR=FRR = FRR FRR=FAR = FAR τ = FRR τ Total error rate The TER is defined as the sum of the False Acceptance and the False Rejection rates (TER = FAR + FRR). The EER Table Confusion matrix for two-class problems Estimate\truth P N Pˆ TP FP ˆN FN TN

3 68 K.-A. Toh et al. / Pattern Recognition 4 (28) Distribution frequency Genuine-user and imposter score distributions Imposter Genuine-user Error rates Normalized score FAR, FRR and TER curves FAR FRR TER.2 EER Normalized score Fig.. Relations among FAR, FRR, TER and EER. mentioned above is frequently used as a performance index for biometric systems because at this particular operating point, the TER is frequently found to be at its minimum. This is particularly true when the genuine-user and imposter score distributions are normal (see Fig. ). As such, the EER is frequently approximated by TER/2 at τ [2] and minimization of EER may be treated as minimization of the minimum TER. We shall minimize the empirical TER and observe its impact on the observed EER in this development Linear parametric models Linear parametric models have been widely used due to their tractability in optimization and related analysis. The embedment of nonlinearities such as kernels and other basis functions into linear regression models has even widened their scope of applications (see e.g. Refs. [5 7]). The importance of linear parametric models that embed nonlinearities is thus obvious and we shall limit our scope to these linear parametric models in this paper. A good example of linear parametric model is the multivariate polynomial (MP) regression which has been shown to possess the capability of describing any arbitrary complex nonlinear input output relationships attributed to the theoretical ground of Weierstrass s approximation theory (see e.g. Ref. [8]). However, the number of independent adjustable parameters in MP would grow like l r for a rth-order model with input dimension l [9]. This limitation has been recently addressed by Toh [2] and Toh et al. [2] by reducing the number of polynomial expansion terms for classification applications. We shall adopt this reduced multivariate polynomial model (RM) in our experiments even though we know that the proposed formulation can be easily adapted for other types of linear parametric models with embedded basis functions. Consider a l-dimensional input x and a rth-order polynomial operating on x which gives rise to K polynomial expansion terms. A linear parametric model in this context can be written as K g(α, x) = α k p k (x) = p(x)α, (2) k= where p k (x) corresponds to the kth polynomial expansion term of the row vector p(x) =[p (x), p (x),...,p K (x)] and α = [α, α,...,α K ] T is a column parameter vector. When each element of the input x R l has a known label y R, giving rise to m learning data pairs (x i,y i ), i =, 2,...,m, the learning problem can be supervised. In biometric verification problems, these target labels are known as genuine-user and imposter. Learning of the target labels (packed as y=[y,y 2,...,y m ] T ) can be accomplished by minimizing a LSE criterion. To stabilize the solution for estimation, a weight decay regularization can be performed [2]. The criterion function to be minimized

4 K.-A. Toh et al. / Pattern Recognition 4 (28) is thus: J = 2 m [y i p(x i )α] 2 + b 2 αt α i= = 2 y P α b 2 α 2 2, (3) where b controls the weighting of regularization and P packs the training samples in matrix form: p(x ) p(x 2 ) P =. (4). p(x m ) The estimated training output is given by ŷ = P α where the solution for α which minimizes J is LSE : α = (P T P + bi) P T y, (5) with b being chosen to be a small value for stability and not introducing much bias [2]. I is an identity matrix with similar dimension to P T P. For unseen test data x t, another polynomial matrix P t can be generated using p(x t ). Prediction of the class label for ŷ t can then be performed using the above learned α (i.e. ŷ t = P t α) and the classification decision given by Eq. (). With these backgrounds in place, we are ready to discuss the TER and related issues in the sequel. 3. Direct computation of TER It is noted here that minimization of the minimum TER would be a two-step process if classifier optimization (locating of classifier parameters) and threshold optimization (locating the minimum TER at τ from FAR and FRR computations) are treated separately. We shall present a direct method to compute the TER here and then in next section, we propose a method to minimize this minimum TER directly according to fusion classifier design. Without loss of generality, consider the decision distributions as illustrated in Fig. where the genuine-user scores are normally centered at a higher value than that of the imposter scores. Denote for those variables (x,m) related to positive (genuine-user) and negative (imposter) examples by respective superscripts + and, it is not difficult to see that the FAR and FRR are merely the averaged counts of decision scores falling within the opposite pattern categories: FAR = FRR = g(x j ) τ, (6) j= g(x + i )<τ, (7) i= where the term g(x) τ ( g(x)<τ ) corresponds to a whenever g(x) τ (g(x)<τ), and otherwise. Define a step function δ(ε j ) εj with ε j = g(x j ) τ for j =, 2,...,, Eq. (6) can be re-written as FAR = δ(ε j ). (8) j= We can use the same definition of step function δ for Eq. (7) when we write ε i = τ g(x + i ) Δ (Δ, which can be ignored in practice, is to account for the strict inequality in Eq. (7)) for i =, 2,..., : FRR = δ(ε i ). (9) i= With the FAR and FRR in place, the TER can be written as TER = FAR + FRR = j= δ(ε j ) + δ(ε i ). () i= Here, we note that TER can be related to the commonly known Accuracy () when s =, and when the normalization factor ( 2 ) is ignored. TP + s( FP) Accuracy =. () + s Suppose the fusion classifier g consists of some K number of adjustable parameters α=[α, α 2,...,α K ] T operating on the feature vector x, i.e. g(α, x), then the goal to improve the classifier s discrimination performance can be treated as to minimize the TER given by TER(α, x +, x ) = δ(ε(α, x j )) j= + δ(ε(α, x + i )), (2) i= where ε j = ε(α, x j ) = g(α, x j ) τ for j =, 2,...,m and ε i = ε(α, x + i ) = τ g(α, x+ i ) for i =, 2,...,m+. Remark. We note that g(α, x) need not be linear with respect to x. However, when g(α, x) is linear with respect to both α and x, Eq. (2) may be formulated as a perceptron criterion function where only the total number of samples being misclassified is accumulated (see Ref. [22, chapter 5.5]). Due to the piecewise nature of such perceptron formulation, the criterion function could be ill-posed. Particularly, the error-correcting procedure may never cease even for linearly non-separable case. We shall attempt a smooth approximation approach (where g(α, x) can be nonlinear with respect to x), which accumulates all training samples, to overcome this problem.

5 7 K.-A. Toh et al. / Pattern Recognition 4 (28) Minimizing TER To solve the problem in Eq. (2), an approximation to the non-differentiable step function δ is often adopted. A natural choice to approximate the above step function is the sigmoid function [23] where the minimization problem becomes arg min TER(α, x +, x ) α = arg min σ(ε(α, x α j )) + j= σ(ε(α, x + i )), (3) i= where σ(x) =, γ >, (4) + e γx and ε(α, x j )=g(α, x j ) τ for j =, 2,...,m and ε(α, x + i )= τ g(α, x + i ) for i =, 2,...,m+. There are two problems associated with this approximation. The first problem is that the formulation is nonlinear with respect to the learning parameters. Although an iterative search can be employed for local solutions, different initializations may end up with different local solutions, hence incurring laborious trial and error efforts to select an appropriate setting. The second problem is that the objective function could be illconditioned due to the much local plateaus resulting from summing the flat regions of the sigmoid. A lot of search effort may be spent upon making little progress at locally flat regions. 4.. Quadratic approximation We seek in this section a possible deterministic closed-form solution from matching of the link-loss functional pair [24,25]. Since we adopt a linear link function (polynomials p), a quadratic loss functional would match well to arrive at desired convexity for the link-loss pair. However, some considerations regarding goodness of approximation to the step function δ would be necessary. When we have all inputs normalized within [, ], the step functional can be approximated by centering a quadratic functional at the origin. To cater for inputs which get beyond this range, an offset η can be introduced such that only the right arm of the quadratic functional is activated for the approximation (see Fig. 2). With this idea in mind, the following regularized quadratic TER approximation is proposed: TER(α, x +, x ) = b 2 α [ε(α, x j ) + η]2 j= [ε(α, x + i ) + η]2, (5) i= Step loss L (g) and approximations.5.5 Step Sigmoid Quadratic Output: g Fig. 2. Sigmoidal (γ = 5) and quadratic (η = ) approximations to the step function. where η > and ε(α, x j ) = g(α, x j ) τ = p(x j )α τ, (6) ε(α, x + i ) = τ g(α, x+ i ) = τ p(x + i )α, (7) for j =, 2,...,, i =, 2,..., Optimizing parameter α Our first task is to solve for the parameter vector α which minimizes Eq. (5). The optimality condition to minimize Eq. (5) requires that TER(α, x +, x ) α which implies that bα + + = (8) p T (x j )[p(x j )α τ + η] j= p T (x + i )[τ p(x+ i )α + η]= i= bi + p T (x j )p(x j ) j=

6 K.-A. Toh et al. / Pattern Recognition 4 (28) p T (x + i )p(x+ i ) α i= (η τ) p T (x j j= ) (η + τ) p T (x + i ) =. i= (9) Abbreviating the row polynomial vectors p j =p(x j ) RK and p i = p(x + i ) RK, the solution for α which minimizes Eq. (5) can be written as α = bi + p T j p j + p T i p i (τ η) j= j= p T j i= + (τ + η) i= p T i, (2) where I is an identity matrix of K K size. In a more compact matrix form, Eq. (2) can be written as ( α = bi + P T P + ) P + T P + ( ) (τ η) P T I (τ + η) + P + T I +, (2) where p(x + ) p(x ) p(x + 2 ) p(x 2 ) P + =, P =, (22).. p(x + ) p(x ) and I + =[,...,] T N m+, I =[,...,] T N m. Remark 2. It is noted here that the decision threshold τ is being included in the optimization process when determining α. This is differentiated from many conventional classifiers (such as neural networks) which do not include an explicit decision threshold during classifier design. Moreover, the solution here for minimizing TER is deterministic as it does not require initialization. The learning solution in Eq. (2) appears to have similar structure to Eq. (5) but with separate normalized covariates and regressor matrices (P + and P ) corresponding to each class label. This is differentiated from the LSE in Eq. (5) which lumped the two class specific regressor matrices into a single matrix P. Except for inclusion of threshold, bias and regularization terms, the structure of Eq. (2) also appears analogous to that of the solution to Fisher linear discriminant analysis (see e.g. Ref. [22]). However, since g(α, x) = p(x)α can be nonlinear with respect to x, a nonlinear decision boundary in the x-plane can be obtained, and this is a main advantage over linear classifiers. The proposed formulation can thus be considered a nonlinear discriminant function and we shall explore an expansion of p(x) using a recently proposed reduced polynomial model since the full polynomial has explosive number of parameters as the input dimension and model order increase. The quadratic approximation here may also be related to a quadratic relaxation technique for perceptron criterion (see Ref. [22, chapter 5.6]). However, as mentioned in Ref. [22], the quadratic relaxation suffer much problems from the errorcorrecting procedure even for linearly non-separable cases, particularly the convergence and boundary point issues. The main advantage of our formulation is that the solution (α) for classification decision can be obtained in closed-form which is also least squares optimal in the TER sense Optimizing threshold τ The threshold value τ as appeared in Eqs. (2) or (2) can also be optimized. This can be obtained by making TER(α, τ)/ τ= which gives [(p j α τ) + η]+ j= 2τ = [(τ p i α) + η]= i= (p j α + η) + j= (p i α η) i= τ = 2 (I T P α) + 2 (I + T P +α). (23) To solve for τ in Eq. (23), we need the solution for α. Let ( M = bi + P T P + ) P + T P + (24) and we can packed α from (2) as (τ η) α = M P T I (τ + η) + M P+ T I +. (25) This compact Eq. (25) can be substituted into Eq. (23) and τ can be solved as τ = η ( ( 2 A A + B C + B C with A, B, C, D being defined as A = I T P M P T I, B = I T P M P T + I +, C = I T + P +M P T I, D ) D ), (26) D = I T + P +M P T + I +. (27)

7 72 K.-A. Toh et al. / Pattern Recognition 4 (28) Remark 3. Similar to estimation of α, the optimal threshold τ can also be obtained in closed-form without the need of an iteration process. This τ can in turn be fed into Eq. (2) for optimal estimation of α. Here we note that the bias parameter η in the equation cannot be optimized due to its uniform contribution to all components of error rates Summary of proposed algorithm (TER Q ) For clarity reason, the procedure to implement the algorithm is summarized as follows: Training: Set η = and b = 3, () Generate the regression matrices P + and P from respective genuine-user and imposter training data using Eq. (4). (2) Generate the matrices M, A, B, C and D using P + and P obtained from above step. (3) Compute the optimal decision threshold τ using Eq. (26). Alternatively, τ can be fixed according to the mid-point of the design output range. (4) Compute the optimal fusion classifier parameter α using Eq. (25) and τ. To test or predict a verification outcome from new data: () Generate the regression polynomial P t from the test data using Eq. (4). (2) Compute the decision fusion output ŷ = P t α. (3) Decision: if ŷ τ then the new data is genuine-user, else imposter. For convenience, we shall call this algorithm TER Q in brief. With the algorithm in place, we are ready to perform fusion experiments in the following section. 5. Biometrics from the eye and face Attributed to the pioneering work of John Daugman [26,27], the iris has now been recognized as a biometric for high security applications. Apart from its high accuracy, an iris verification system is non-intrusive in terms of physical contact. However, a visual (RGB) iris recognition system can easily be fooled by a high resolution picture when it is not equipped with anti-spoofing solutions. Apart from using infra-red iris imaging solutions, fusion of several biometric modalities can uplift the level of matching engagement, thereby deterring a certain amount of imposter attacks. As part of our continual effort to fuse several modalities in a natural manner considering ease of use, we shall combine an infra-red iris verification scores with two face verification scores from both visual and infra-red spectrums in this study. 5.. Infra-red iris verification The infra-red iris images were captured using a monochrome CCD camera (WAT-92A from Watec Co. Ltd.) with infra-red LED illumination at resolution. Fig. 3 shows five image samples from five different identities. The raw iris infrared images were first localized by means of interior boundary (between pupil and iris) and exterior boundary (between iris and sclera) which were found using edge detection algorithm. The localized iris region was then transformed into a polar coordinate by a rubber sheet model whereby a normalization was performed to generate an iris signal for feature extraction. Based on independent component analysis (ICA), a set of basis function was estimated to represent the iris signal. The coefficients of the ICA expansions were adopted as feature vectors which were then fed into a cosine distance measure for comparison of two identities. The interested readers are referred to Ref. [28] for more technical details regarding the infra-red iris verification system Visual face verification Face is the most common biometric used by humans. We inherently use this biometric to recognize people in our daily interactions. Face recognition is thus an important area in biometrics for it can also be covert as well as non-intrusive. The main approaches to viewer centered 2D face recognition includes holistic, analytic and hybrid methods [29]. The holistic approach uses subspace techniques to reduce the image dimension and then compare image similarity using this subspace. A very widely used technique for this subspace reduction is principal component analysis (PCA). The analytic approach uses geometrical features such as distances between face objects like eyes, nose, and mouth for similarity measure. The hybrid approach combines various means including the holistic and analytic approaches. The visual face images used in this study were captured under various illumination and pose conditions using a Bumblebee CCD camera produced by Point Grey Research Inc. (see Ref. [3] for details). The resolution of the image used was pixels. The top row of Fig. 4 shows some visual image samples for an identity under various illumination and pose conditions. In this work, we adopted the holistic approach using PCA. To compare similarity between two face images, the Euclidean distance was used for the first eigenvalues [3] Infra-red face verification Due to the relatively high instrumental cost, the infra-red face is less studied as compared to the visual face. The infra-red face images used in this study were captured using a ThermoVision S65 produced by FLIR Systems Inc. As in the visual face case, the images were captured under varying illumination, expression and pose conditions with the resolution of the image being fixed at pixels. The bottom row of Fig. 4 shows some infra-red image samples for the same identity under various conditions. Similar to the visual face, we adopted the holistic approach using PCA for the infra-red face. To compare similarity between two face images, the Euclidean distance was used for the first eigenvalues [3].

Top row: visual face samples for an identity under different lighting and pose conditions; Bottom row: infra-red face samples for the same identity under different lighting and pose conditions.

8 K.-A. Toh et al. / Pattern Recognition 4 (28) Fig. 3. Infra-red iris samples for five different identities. Fig. 4. Top row: visual face samples for an identity under different lighting and pose conditions; Bottom row: infra-red face samples for the same identity under different lighting and pose conditions. These two face data sets constitute a true multimodal system since they are acquired from a similar pool of identities. 6. Experiments 6.. Data sets In the following experiments, each data set corresponding to infra-red iris (iris-ir), visual face (face-vs) and infra-red face (face-ir) verification consists of 96 identities, wherein each identity contains image samples. For training and test purposes, each of these biometric data sets are partitioned into two equal sets consisting of S train and S test, each with 96 5 samples. The genuine-user and the imposter match-scores are generated from these two sets by intra-identity and interidentity matching among the visual/infra-red image samples for each biometric. A total of 96 ( ) sample matchscores are thus available for the genuine-user class in each training set and test set for each biometric. As for the imposter scores, there are 4 ( ) sample match-scores for the 96 identities. Since all three biometrics have the same number of genuine-user and imposter samples, an arbitrary oneto-one identity correspondences was assumed among the three biometric data sets. This is a reasonable assumption since our focus here is output scores fusion and not on correlation among different modalities for each identity Preprocessing In the following experiments, the match-scores for all biometrics are normalized to within the interval [, ], all having a higher match score for a genuine-user than that of an imposter, before performing data fusion. Fig. 5(a) (b), Fig. 5(c) (d) and Fig. 5(e) (f) show the matching performances for the training and test sets, respectively, for individual iris-ir, face-vs and face-ir verifications before scores fusion. From the match-score distribution plots as shown in Fig. 5(a), (c) and (e), the verification performance depends much on the overlapping zone between the genuine-user and the imposter classes. Among the three biometrics, we see that iris-ir has the least overlapping regions and hence the best verification performance. The face-vs has the most overlapping regions and this gives rise to the worst verification performance. This observation using the scores distribution plot is further confirmed by the corresponding ROC plots in Fig. 5(b), (d) and (f) Fusion experiments and evaluation setups Based on the biometrics data described above and a publicly available database, the following three sets of fusion experiments are performed: (i) Fusion of face biometrics (face-vs and face-ir): Due to possible large variation of illumination conditions in ground applications, we believe that visual and infra-red images can complement each other. In this experiment, we perform fusion experiments by combining the verification decision scores from face-vs and face-ir where the images were captured simultaneously. Since the face-vs has poor performance due to large variation of illumination conditions in the database, we shall observe the effects of fusing it with the much higher performed face-ir. (ii) Fusion of faces-and-eye biometrics (iris-ir, face-vs and face-ir): Similar to the face fusion above, an advantage to fuse iris-ir, face-vs and face-ir is that their images can be captured simultaneously. This is important in application because simultaneously presenting all three biometrics becomes a much more difficult task than presenting a single biometric for an imposter attack. This is especially

9 74 K.-A. Toh et al. / Pattern Recognition 4 (28) Normalized frequency Authentic accept rate (%) Normalized frequency Authentic accept rate (%) Normalized frequency Authentic accept rate (%) Fig. 5. Matching performance for iris (IR), face (visual) and face (IR) verification systems: training (solid lines) and test (dashed lines) sets (a) Match score distribution (Iris IR), (b) ROC (Iris IR), (c) Match score distribution (Face Visual), (d) ROC (Face Visual), (e) Match score distribution (Face IR), (f) ROC (Face IR). true when both visual and infra-red light frequencies are exploited. We shall observe the impact of having an additional dimensionality from combining low and high performance biometrics on the proposed algorithm. (iii) Publicly available data sets: Apart from the above experiments using in-house data sets, the proposed TER Q is further experimented on publicly available fusion data sets (XM2VTS face and speaker verification database, which contains 32 fusion cases [3,32]) so that further comparison can be done by other researchers. Regarding the algorithm settings and comparison measures, the following items are observed in our experiments: Comparison platform: To compare the conventional LSE learning and the proposed TER Q learning, we shall adopt the RM model as seen in Ref. [2] for decision score fusion since the number of polynomial coefficients does not explode with respect to model order and feature dimension. Different model orders r [2, 3, 4, 5, 6] will be experimented for both LSE and TER Q such that the experiments project a good overview on different operational settings. For TER Q, the bias was fixed at η = in all experiments since it was found to be inert to estimation within the intended operating range. In all the following fusion experiments, we set b = 3 for the RM model since: (i) it does not introduce much bias in regularization, (ii) we have a standardized setting for both LSE and the proposed TER Q, and (iii) we found this setting produces good training and test results for both cases from our empirical observations. Performance evaluation criterion: The EER shall be adopted as the performance comparison measure in experiments (i) and (ii). There are two reasons behind this choice of criterion: () it is a single value index which has a clear indication of high and low performances and this can be advantages to the use of ROC or DET where the curves for different algorithms may cross each other, and (2) it is related to our optimization objective (minimization of TER). For experiment (iii), the HTER (Half Total Error Rate) will be adopted according to Ref. [3] for direct comparison purpose. Here we note that HTER can be related to EER under certain conditions (e.g. Fig. ).

10 K.-A. Toh et al. / Pattern Recognition 4 (28) EER EER Polynomial order r=2:6, SVM-RBF Polynomial order r=2:6 γ=., :, LSE-train LSE-test TER Q -train TER Q -test SVMPoly-train SVMPoly-test SVMRbf-train SVMRbf-test Face-VStrain Face-VS-test Face-IR-train Face-IR-test LSE-train LSE-test TER Q -train TER Q -test SVMPoly-train SVMPoly-test Face-IR-train Face-IR-test Fig. 6. Combining Face-VS and Face-IR verifications: (a) EER plotted over different polynomial orders and RBF kernel widths, (b) a zoom-in view. Computing effort: In order to observe the computational effort, the CPU time will be recorded. All experiments were ran under the PC Windows Matlab platform using a Pentium- M-.73 GHz computer. Although there could be small differences among different runs of the same algorithm, the CPU time provides some hints regarding the order of difference between two algorithms run-times. For instance, if Algorithm A uses s and Algorithm B uses s, then we can say that A is approximately times faster than B. Benchmarking: In order to gauge whether the best possible performance has been attained for experiments (i) and (ii), we conduct similar experiments using SVM [4,5] implemented by Ma et al. [33]. Both the polynomial kernel (SVM-poly) and the radial basis kernel (SVM-Rbf) are experimented for various polynomial orders (r [2, 3, 4, 5, 6]) and kernel widths (Gamma [.,,...,, ] [33]). We believe these choices of kernel settings do provide certain benchmarks regarding the achievable performance. For experiment (iii), a similar experimental protocol has been adopted according to that in Ref. [3] for the proposed TER Q such that the results can be directly compared in future Results (i): fusion of face-vs and face-ir scores Fig. 6 shows the training (solid lines) and test (dashed lines) EER for the experimented LSE, TER Q, SVM-poly, and SVM- Rbf. Both the LSE and the TER Q adopted the RM model and their EER results are plotted over different model orders for the range mentioned in previous section. For SVM-poly, the same model order range (r [2,...,6]) was used, and for SVM-Rbf, the kernel width (Gamma [33]) was chosen from [.,, 2,...,9,, ]. From the overall plot in Fig. 6(a), we see that the SVM-Rbf has low EER at large kernel width (small Gamma value of.) and has high EER for small kernel widths. The EER values of SVM-Rbf for Gamma [, 2,...,9,, ] are seen to be lower than that of the face-vs but higher than that of face-ir. This indicates that an appropriate kernel size suitable for the distributions must be chosen in order that SVM-Rbf performs well. From the zoom-in plot in Fig. 6(b), we see that the training and test fusion results crowd around, respectively, the training and test results of the higher performed face-ir (than face-vs). For the training cases, the LSE, TER Q and SVM-poly show better performance (lower EER) than that of the IR-face at many model orders (particularly r = 3, 4, 5). However, for the test cases, only TER Q shows a clear performance superiority in terms of EER than that of the face-ir. Fig. 7 shows an instance of EER performance at r = 3 using the DET curves. Fig. 8 shows the CPU times incurred for training (solid lines) and test (dashed lines) from running LSE, TER Q, SVMpoly and SVM-Rbf. From Fig. 8(a), we see that TER Q and LSE have a clear advantage of low CPU requirement. From Fig. 8(b), we see that the test CPU times are similar for TER Q and LSE since they have similar polynomial expansion terms. The training CPU time for TER Q is seen to be lower than that of LSE due to a vectorized implementation of TER Q whereas LSE did not capitalize on such facility. This shows that significant CPU time can be reduced from efficient implementation Results (ii): fusion of iris-ir, face-vs and face-ir scores For fusion of three biometrics (iris-ir, face-vs and face- IR), Fig. 9 shows the training (solid lines) and test (dashed lines) EER for the experimented LSE, TER Q, SVM-poly, and SVM-Rbf. Similar to previous experiment, both the LSE and the TER Q adopted the RM model and their EER results

11 76 K.-A. Toh et al. / Pattern Recognition 4 (28) False rejection rate (%) LSE TER Q SVM-Lin SVM-Pol SVM-Rbf False acceptance rate (%) False rejection rate (%) LSE TER Q SVM-Lin SVM-Pol SVM-Rbf False acceptance rate (%) Fig. 7. DET curves comparing different classifiers for fusion of face biometrics (r = 3 for LSE, TER Q and SVM-poly; Gamma =. for SVM-Rbf) (a) Train data, (b) Test data. CPU time (sec) LSE-train LSE-test TER Q -train TER Q -test SVMPoly-train SVMPoly-test SVMRbf-train SVMRbf-test CPU time (sec) LSEtrain LSEtest TER Q -train TER Q -test Polynomial order r=2:6, SVM-RBF γ=., :, Polynomial order r=2:6 Fig. 8. (a) CPU times incurred for fusion of two biometrics (the CPU time for SVM-Rbf-train at Gamma = is s), (b) a zoom-in view. are plotted over different model orders ranging from 2 to 6. For SVM-poly, the same model order range was plotted. For SVM-Rbf, the kernel width (Gamma) was chosen from [.,, 2,...,9,, ]. From the training results of Fig. 9, we see that only TER Q, LSE and SVM-Rbf at Gamma = have better performance than that of iris-ir. This is reasonable since SVMs are not aimed at training the EER directly and requires a trial and error effort to tune the classifier for best EER performance. The reason that TER Q has best training results is that its training is based on optimization of TER which is related to EER. From the test results of Fig. 9, all four algorithms (LSE, TER Q, SVM-poly, and SVM-Rbf) show an improvement of accuracy as compared to the test samples of the single biometric Iris-IR. The proposed TER Q is seen to perform best for all model orders. Fig. shows a sample DET plot at r = 5 for

12 K.-A. Toh et al. / Pattern Recognition 4 (28) LSE-train LSE-test TER Q -train TER Q -test SVMPoly-train SVMPoly-test SVMRbf-train SVMRbf-test Iris-IR-train Iris-IR-test EER Polynomial order r=2:6, SVM-RBF γ=., :, Fig. 9. Combining face-vs, face-ir and iris-ir verifications: EER plotted over different polynomial orders and RBF kernel widths. False rejection rate (%) 2 Face Visual n False acceptance rate (%) False rejection rate (%) 2 Face Visual n False acceptance rate (%) Fig.. DET curves comparing different classifiers for fusion of all three biometrics (r = 5 for LSE, TER Q and SVM-poly; Gamma = for SVM-Rbf) (a) Train data, (b) Test data.

13 78 K.-A. Toh et al. / Pattern Recognition 4 (28) LSE-train LSE-test TER Q -train TER Q -test SVMPoly-train SVMPoly-test SVMRbf-train SVMRbf-test 4 CPU time (sec) Polynomial order r=2:6, SVM-RBF γ=., :, Fig.. CPU times incurred for fusion of all three biometrics (the CPU times for SVM-Rbf-train and SVM-Rbf-test at Gamma = are, respectively, s and s). 3 6 Mean operator Weighted-Sum-Fisher Weighted-Sum-Brute-Force LSE TER Q Average HTER from 32 cases Mean operator Weighted-Sum-Fisher Weighted-Sum-Brute-Force LSE TER Q HTER Polynomial order Experiment number for 32 fusion cases Fig. 2. Experiments on publicly available data sets: (a) Average HTER plotted over different polynomial orders; (b) HTER (LSE and TER Q with r = 6, b = 3 ) plotted over 32 cases of fusion combinations according to Ref. [3]. LSE, TER Q and SVM-Poly. In the same plot, for SVM-Rbf the best DET at kernel width Gamma = is shown. Fig. shows the CPU times incurred for training (solid lines) and test (dashed lines) for LSE, TER Q, SVM-poly and SVM-Rbf. Again, the TER Q and LSE show lowest computing effort both in terms of training and testing. The EER line in Fig. does not cut on the DET curves due to the resolution of the relatively small number of genuine-user scores (96). For such cases, an approximation to the EER is adopted similar to Ref. [34] by averaging the FAR and the FRR at nearest resolution Results (iii): experiments on publicly available data sets Fig. 2(a) shows the average test HTER obtained from the 32 fusion cases for LSE and TER Q with model orders r [2, 3, 4, 5, 6] and b = 3. The average test HTER of the mean operator, the weighted-sum-fisher and the weightedsum-brute-force from Refs. [3,32] are also included in the figure for comparison purpose. Here, we see that both LSE and TER Q show a decreasing trend in terms of HTER values for r [2, 3, 4, 5]. However, at r = 6, LSE shows deterioration of performance (due to serious over-fitting in one case as seen in

14 K.-A. Toh et al. / Pattern Recognition 4 (28) Fig. 2(b)) while TER Q maintains the performance improvement trend. At r = 5, 6, TER Q outperforms all compared fusion techniques. For a glance of performance for each fusion case, Fig. 2(b) shows the detail HTERs for those individual 32 fusion cases [3] at r = 6, b = 3 for LSE and TER Q. 7. Summary of results and discussion 7.. Summary of observations The following observations can be summarized regarding the performance of the proposed TER Q in previous experiments: () Fusion of face biometrics: when fusing a low performance system with a high performance system, inappropriate tuning to the adjustable parameters of an algorithm may deteriorate the fusion performance. This is particularly evident from the SVM-Rbf experiments. Relatively, TER Q appears to be less sensitive to model parameter change (r is the only model parameter). (2) Fusion of faces-and-iris biometrics: when two biometrics are much stronger than a third one, the chance of having a better fusion performance increases comparing with that merely using a strong and a weak one. This is evident from all test results. (3) Experiments on publicly available data sets: TER Q shows the trend of having a more stable prediction output than that of LSE particularly at high order settings. (4) To summarize, the following points are seen regarding TER Q : (i) good training and test EER performances, (ii) fast training and testing in terms of computational effort. Infra-red face match-score.9.8 Genuine-users Imposters.3.2 Infra-red face match-score.9.8 Genuine-users Imposters.3.2. RML-LSE. SVM-Poly Visual face match-score Visual face match-score Infra-red face match-score Imposters Genuine-users Infra-red face matchscore Imposters Genuine-users.. SVM-RBF RM-TER Visual face match-score Visual face match-score Fig. 3. Decision contours of different classifiers when combining 2 biometrics: (a) Least Squares error minimization using third-order RM model, (b) SVM using third-order Polynomial kernel, (c) SVM using Rbf kernel with Gamma =, (d) TER minimization using third-order RM model.

8 K.-A. Toh et al. / Pattern Recognition 4 (28) 66 82 Iris-IR match-score.9.8.7.6.5.4.3.2. Genuine-users Imposters Iris-IR match-score.9 Genuine-users.8.7.6.5.4 Imposters.3.2 RM-LSE. SVM-Poly..2.3.4.5.6.7.8.9..2.3.4.5.6.7.8.9 Face-VS match-score Face-VS match-score Iris-IR match-score.

Decision contours of different classifiers on visual face and infra-red iris plane (with infra-red face fixed at score of.

, (d) TER minimization using fifth-order RM model. 7.2. Decision landscapes The good performance can probably be understood from the decision landscapes point of view. Fig.

15 8 K.-A. Toh et al. / Pattern Recognition 4 (28) Iris-IR match-score Genuine-users Imposters Iris-IR match-score.9 Genuine-users Imposters.3.2 RM-LSE. SVM-Poly Face-VS match-score Face-VS match-score Iris-IR match-score Genuine-users.9.8 Genuine-users Imposters.3 Imposters.2. SVM-RBF RM-TER Iris-IR match-score Face-VS match-score Face-VS match-score Fig. 4. Decision contours of different classifiers on visual face and infra-red iris plane (with infra-red face fixed at score of.5) when combining three biometrics: (a) Least Squares error minimization using fifth-order RM model, (b) SVM using fifth-order Polynomial kernel, (c) SVM using Rbf kernel with Gamma =., (d) TER minimization using fifth-order RM model Decision landscapes The good performance can probably be understood from the decision landscapes point of view. Fig. 3 shows the score distributions and decision contours for combining the visual and the infra-red face biometrics. The genuine-user and imposter scores show large overlapping regions and rather indistinguishable distributions. This gives rise to a difficult classification problem. In Fig. 3(a) (d), we show decision contours, respectively, for LSE, SVM-poly, SVM-Rbf, and TER Q. Here we see that for the LSE method, the decision contours are much affected by the density of data where the decision contours rarely cut through the high density imposter region (i.e. fitting the data density). For the SVM methods, the decision contours appear to be determined by the distribution structure of data. This is particularly obvious for SVM adopting the RBF kernel with a small kernel width. As for the proposed TER minimization, two phenomenons are observed: (i) the decision contours appear to be unaffected by and run through the high density imposter regions like that in the SVM-poly case, and (ii) the orientations of the contours appear to go along with the two clusters distribution direction. This suggests that the decision boundary is determined by classification error distribution rather than by data density. Fig. 4 shows part of the decision contours for fusion of three biometrics. The plot was obtained for the face-vs and iris-ir plane with face-ir fixed at a score of.5. Here, SVM-poly is seen to be rather affected by the high imposter density which is, perhaps, due to the local solution property. Both TER Q and SVM-Rbf show rather inert to the distribution density and decision contours are seen to cut across the imposter zones. 8. Conclusion In this paper, an approach to directly optimize the decision total error rate with respect to a fusion classifier design

16 K.-A. Toh et al. / Pattern Recognition 4 (28) is proposed for multimodal biometric scores fusion. Through a quadratic approximation for the error counts, a closedform solution is proposed to solve the optimization problem. Although starting from different derivation points, the structure of the proposed solution can be related to that of Fisher linear discriminant analysis except for inclusion of nonlinear decision capability, normalization and several adjustable terms. This suggests that the proposed formulation is a nonlinear discriminant function which has advantages over linear functions for complex decision hyper-surfaces. From the error minimization viewpoint, the significance of this formulation is the inclusion of decision threshold in the optimization solution which solves the complexity of TER evaluation and threshold setting. With much consideration to ground application scenarios, the proposed method (TER Q ) is applied to fuse two face biometrics (based on visual and infra-images) and an infra-red iris biometric. Extensive experiments were performed considering various settings of algorithm s model order which is the only major tuning parameter. The performance is benchmarked on a publicly available database, as well as compared with the commonly adopted least squares error criterion and two support vector machines adopting different kernels. We are overwhelmed by the very encouraging empirical findings. Our immediate task is to generalize the method for multiple category problems for wider applications. Acknowledgements The authors would like to thank the following colleagues for assistance in collection and generation of biometrics decision output data: Mr. Sang-Ki Kim and Mr. Kwang-Hyuk Bae. Special thanks go to Dr. Norman Poh for sharing the XM2VTS face and speaker verification data sets for fusion benchmarking. This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University. References [] A. Ross, R. Govindarajan, Feature level fusion using hand and face biometrics, in: Proceedings of SPIE Conference on Biometric Technology for Human Identification II, Orlando, USA, March 25, pp [2] A.A. Ross, K. Nandakumar, A.K. Jain, Handbook of Multibiometrics, vol. 6, International Series on Biometrics, Springer, Berlin, 26. [3] Y.S. Huang, C.Y. Suen, A method of combining multiple experts for the recognition of unconstrained handwriten numerals, IEEE Trans. Pattern Anal. Mach. Intell. 7 () (995) [4] A.K. Jain, K. Nandakumar, A. Ross, Score normalization in multimodal biometric systems, Pattern Recognition 38 (25) [5] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell. 2 (3) (998) [6] L. Hong, A. Jain, S. Pankanti, Can multibiometrics improve performance? in: Proceedings of Auto ID, Summit, NJ, 999, pp [7] A. Ross, A. Jain, Information fusion in biometrics, Pattern Recognition Lett. 24 (23) [8] L.I. Kuncheva, C.J. Whitaker, C.A. Shipp, R.P.W. Duin, Limits on the majority vote accuracy in classifier fusion, Pattern Anal. Appl. 6 (23) [9] J. Kittler, K. Messer, Fusion of multiple experts in multimodal biometric personal identity verification systems, in: Proceedings of the 22 2th IEEE Workshop on Neural Networks for Signal Processing, 22, pp [] L.I. Kuncheva, J.C. Bezdek, R. Duin, Decision templates for multiple classifier design: an experimental comparison, Pattern Recognition 34 (2) (2) [] K.-A. Toh, W.-Y. Yau, X. Jiang, A reduced multivariate polynomial model for multimodal biometrics and classifiers fusion, IEEE Trans. Circuits Systems Video Technol. (Special Issue on Image- and Video- Based Biometrics) 4 (2) (24) [2] N. Poh, S. Bengio, How do correlation and variance of base-experts affect fusion in biometric authentication tasks?, IEEE Trans. Signal Process. 53 () (25) [3] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, ACM, New York, 992, pp [4] V.N. Vapnik, Statistical Learning Theory, Wiley-Interscience, New York, 998. [5] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifier, in: Poceedings of the fifth ACM Workshop on Computational Learning Theory, Pittsburgh, PA, 992, pp [6] T. Poggio, F. Girosi, Networks for approximation and learning, Proc. IEEE 78 (9) (99) [7] B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, 22. [8] W.R. Wade, An Introduction to Analysis, second ed., Upper Saddle River, NJ: Prentice Hall, 2. [9] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, New York, 995. [2] K.-A. Toh, Fingerprint and speaker verification decisions fusion, in: International Conference on Image Analysis and Processing (ICIAP), Mantova, Italy, 23, pp [2] K.-A. Toh, Q.-L. Tran, D. Srinivasan, Benchmarking a reduced multivariate polynomial pattern classifier, IEEE Trans. Pattern Anal. Mach. Intell. 26 (6) (24) [22] R.O. Duda, P.E. Hart, D.G. Stork, Pattern classification, second ed., Wiley, New York, 2. [23] K.-A. Toh, Learning from target knowledge approximation, in: Proceedings of the First IEEE Conference on Industrial Electronics and Applications, Singapore, May 26, pp [24] G.J. Gordon, Generalized 2 Linear 2 Models, in: Advances in Neural Information Processing Systems (NIPS 22), Vancouver, British Columbia, Canada, December 22, pp [25] P. McCullagh, J.A. Nelder, Generalized Linear Models, second ed., Chapman and Hall, London, 989. [26] J. Daugman, High confidence visual recognition of persons by a test of statistical independence, IEEE Trans. on Pattern Anal. Mach. Intell. 5 () (993) [27] J. Daugman, Biometric personal identification system based on iris analysis, U.S. Patent 2956, 994. [28] K. Bae, S. Noh, J. Kim, Iris feature extraction using independent component analysis, in: Proceedings 4th International Conference on Audio- and Video-Based Person Authentication (AVBPA), Guildford, UK, June 23, pp [29] in: S.Z. Li, A.K. Jain (Eds.), Handbook of Face Recognition, Springer, New York, 24. [3] S.-K. Kim, H. Lee, S. Yu, S. Lee, Robust face recognition by fusion of visual and infrared cues, in: Proceedings of the First IEEE Conference on Industrial Electronics and Applications, Singapore, May 26, pp [3] N. Poh, S. Bengio, Database protocol and tools for evaluating scorelevel fusion algorithms in biometric authentication, Pattern Recognition 39 (2) (26) [32] N. Poh, Multi-system biometrics: Optimal fusion and user-specific information, Ph.D. dissertation, Swiss Federal Institute of Technology in Lausanne, 26.

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

Study on Classification Methods Based on Three Different Learning Criteria Jae Kyu Suhr Contents Introduction Three learning criteria LSE, TER, AUC Methods based on three learning criteria LSE:, ELM TER: