Correspondence. The Effect of Correlation and Performances of Base-Experts on Score Fusion

Size: px

Start display at page:

Download "Correspondence. The Effect of Correlation and Performances of Base-Experts on Score Fusion"

Cora Heather West
5 years ago
Views:

1 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS 1 Correspondence The Effect of Correlation and Performances of Base-Experts on Score Fusion Yishu Liu, Lihua Yang, and Ching Y. Suen Abstract In the field of biometric authentication, it is a promising trend to perform score fusion to improve authentication accuracy. Many empirical studies have shown the effectiveness of score fusion; however, some other researchers assert that fusion is not always beneficial. Despite considerable empirical efforts, to the best of our knowledge, the research devoted to the theoretical analysis of fusion can be found only in the paper by Poh and Bengio published in 005. Unfortunately, we find that the variance reduction-equal error rate (VR-EER) model, which is the theoretical basis of this reference, is incorrect and the resulting conclusions are arguable. Besides, we find that the conclusions from several other empirical studies are arguable too. In this paper, using Fermat s theorem and the connection between F-ratio and EER, we conduct a systematic theoretical study on how correlation and performances of base-experts affect fusion, giving the underlying reason why VR-EER model and the above conclusions are wrong. Contrary to these existing conclusions, we prove that provided fusion weights are selected according to our proposed criterion, the combined system will definitely be superior to all the base-experts, regardless of correlation, performances, or variances of base-experts. Experiments are carried out to validate the conclusions of ours and construct counter-examples for the existing conclusions. Index Terms Correlation, F-ratio, optimal weight vector, performance, score fusion. I. Introduction Biometric authentication (BA) refers to the automatic verification of an identity claim using a person s behavioral or physiological traits. Fingerprints, face, voice, iris, gait, signature, etc., are among the most popular biometric modalities. In terms of accuracy and security, unimodal biometric systems are usually inferior to multimodal biometric systems, which fuse information from different sources. The most common fusion approach in multimodal systems is score fusion. Here, score refers to the scalar outputs of the individual matchers, measuring the similarity/difference Manuscript received July 16, 01; revised October 19, 01; accepted September 1, 01. This work was supported in part by the NSFC under Grant , Grant , Grant , Grant , and Grant , the Computational Science Innovative Research Team Program, Guangdong Province Key Laboratory of Computational Science, Sun Yat-Sen University, and the Cross-Cutting Research Project of South China Normal University under Grant This paper was recommended by Associate Editor W. Pedrycz. Y. Liu is with the School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou 51075, China, and also with the School of Geography, South China Normal University, Guangzhou , China ( yishuliu@1cn.com). L. Yang is with the Guangdong Province Key Laboratory of Computational Science, School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou 51075, China ( mcsylh@mail.sysu.edu.cn). C. Y. Suen is with the Center for Pattern Recognition and Machine Intelligence, Concordia University, Montreal, QC H3G 1M8, Canada ( parmidir@cenparmi.concordia.ca). Digital Object Identifier /TSMC between the person to be authenticated and his/her claimed identity. We call the scores output by the individual matchers experts or base-experts. The ideas of fusing scores to perform biometric authentication have been largely described in the literature, such as [1] [5]. All these empirical studies arrive at the conclusion, according to their experimental results, that score fusion leads to improved performance compared with the individual baseexperts. However, some other researchers have given their opinion that fusion is not always advantageous. As stated by Daugman [6] and Proenca [3], performance improvement due to fusion is guaranteed only when the two experts with very similar performances are combined, contrarily, combining two experts with very different performances will always result in inferior performance compared with the stronger expert (hereafter, we refer to this statement as existing conclusion 1). It is also claimed in [3] that when two base-experts are highly correlated, the performance of fusion is usually not as good as the stronger base-expert (hereafter, we refer to this statement as existing conclusion ); and a similar conclusion is drawn in [7]. We find that existing conclusions 1 and are arguable. Here, it should be stressed that the above conclusions are all founded on experimental results. Under what conditions is the performance of the combined system superior or inferior to those of base-experts? How do correlation, performances, and variances of base-experts affect fusion? None of the works mentioned above made a theoretical or quantitative analysis of these issues. To the best of our knowledge, the only work trying to make a theoretical study on these problems is [8]. Using a theoretical model, called variance reduction-equal error rate (VR-EER) analysis, Poh and Bengio [8] analyzed the impact of correlation and variances of base-experts on the fusion performance, and came to several conclusions, two of which are: 1) if the weaker base-expert has variance three times larger than the stronger base-expert s variance, the performance of the combined system is lower than that of the stronger base-expert (hereafter, we refer to this statement as existing conclusion 3); and ) in any case, positive correlation hurts fusion while negative correlation improves fusion (hereafter, we refer to this statement as existing conclusion 4). Unfortunately, we find that the VR-EER model used in [8] is wrong, and the resulting conclusions are also incorrect. Moreover, when dealing with linear fusion, previous studies only consider the case where fusion weights are all positive [3], [6] [8]. In our opinion, this cannot be supported by any theory and, hence, is unreasonable. Then, when should weights be all positive, and when should both positive and negative weights be taken into account? A systematic theoretical analysis of these problems is made in this paper. We link the concept of EER to so-called /$31.00 c 013 IEEE

2 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS F-ratio, and use the combined system s F-ratio as the criterion for seeking optimal fusion weights. We also investigate the relationship between the combined system s F-ratio and the weight ratio (i.e., the ratio of one weight to the others), and analyze how the correlation between base-experts affects fusion in terms of performance. As a matter of fact, provided that fusion weights are appropriately chosen, linear fusion will definitely lead to a better performance as compared to the best individual expert, and how to appropriately choose weights depends on performances and variances of and correlation between individual base-experts. The rest of the paper is organized as follows. Section II provides some preliminary concepts and notations. Section III points out the errors made in the theoretical derivation of the VR-EER model in [8], proving that this model and the resulting conclusions are both incorrect. Using Fermat s theorem [9], Section IV analyzes fusion in detail, mainly concerned with the issues of how optimal fusion weights can be chosen and of how correlation and performances of base-experts affect fusion. In this section, we reach some conclusions by analyzing, contradicting or correcting existing conclusions 1 4. In Section V numerical experiments are carried out to validate our statements. Finally Section VI summarizes this paper. II. Preliminary Concepts In this section some important concepts such as F-ratio, EER, and correlation between base-experts, etc., are defined. A. F-Ratio and EER On the basis of the ideas and contents of [8], we give the definition of F-Ratio. Definition 1 (F-ratio): For a given BA system, consider the client scores as the realizations of a random variable X C with expected value μ C and standard deviation σ C, and consider the impostor scores as the realizations of a random variable X I with expected value μ I (throughout the paper we assume that μ C μ I as in [8]) and standard deviation σ I. We refer to f = μc μ I (1) σ C + σ I as the BA system s F-ratio. F-ratio measures the separability between the client score distribution and its impostor counterpart. In the context of BA, for a given threshold R, the person whose identity claim is to be verified is assigned to class client if the corresponding score is greater than, and to class impostor otherwise. There are two types of errors committed by the system, i.e., false acceptance rate (FAR) and false rejection rate (FRR), which are both functions of a threshold FAR( ) = FRR( ) = + p I (x)dx () p C (x)dx (3) where p C (x) and p I (x) are the probability density functions of X C and X I, respectively. If there exists (μ I,μ C ) such that FAR( ) = FRR( ), then we refer to EER = FAR( ) = FRR( ) as EER of the system. EER is one of the most commonly used measures for classification performance. Now, we investigate the relationship between EER and F-ratio. Suppose that both X C and X I follow Gaussian distribution, it is easy to show that when a threshold takes the value of = μc σ I + μ I σ C. (4) σ C + σ I EER is given by EER = FAR( )=FRR( )= 1 1 f π 0 e x dx (5) where f is defined in (1). As p C (x) and p I (x) are generally unknown, in practice FAR and FRR can be computed as follows: # of falsely accepted accesses FAR( ) = (6) # of impostor accesses # of falsely rejected accesses FRR( ) =. (7) # of client accesses There may not exist a that strictly satisfies FAR( )= FRR( ) in this situation; then, we seek = arg min R FAR( ) FRR( ) (8) instead and half total error rate (HTER) HTER = FAR( )+FRR( ) (9) replaces EER as a measure for BA systems authentication performance. From (5), we can infer that under the assumption that both X C and X I are normally distributed, the larger the F-ratio is, the smaller the EER is. Thus, F-ratio can act as a measure for classification performance. A great number of experiments is conducted in [8], showing that formula (5) is insensitive to deviation from the Gaussian assumption. In other words, (5) generally holds good in the case of non-gaussianity. After re-conducting the experiments ourselves, we come to similar conclusions. Therefore, EER can be predicted from F-ratio regardless of whether or not X k (k {C, I}) conforms to the Gaussian assumption. When fusing scores, we hope that the fusion coefficients (weights) are chosen in such a way that the combined system s EER (F-ratio) will reach the minimum (maximum). In fact, this gives a criterion for choosing weights in Section IV. B. Correlation Between Two Base-Experts Since more than one BA system is dealt with below, the notations above should be slightly modified A subscript is added where necessary; e.g., Xi C, instead of XC, is used hereafter. The correlation coefficient between Xi k and Xj k is

3 LIU et al.: EFFECT OF CORRELATION AND PERFORMANCES OF BASE-EXPERTS ON SCORE FUSION 3 denoted as ρij k (k {C, I}). We find that the value of ρc ij is very close to that of ρij I, so in practical applications one can use ρ ij = 1 (ρc ij + ρi ij ) as a measure for the correlation between the ith and jth BA systems, just as we do in Section V. III. The VR-EER Model and Its Incorrectness Score fusion can be viewed as a two-step process: score normalization and fusion itself. The zero-mean unit-variance normalization method is used in [8]; score(s) throughout this section means the normalized score(s). Poh and Bengio [8] try to show that the F-ratio of the combined system obtained via a mean operator is, on average, greater than those of the baseexperts when used separately. Unfortunately, this statement is incorrect, and the reasons for this are given below. A. VR-EER Model and Our Queries In [8] only one fusion method is considered, namely, the mean operator. More specifically, the combined scores are obtained by means of averaging the scores of all the baseexperts. Poh and Bengio [8] call their proposed theoretical model VR-EER, which includes two parts, the first part dealing with VR and the second relating F-ratio to EER. First we recapitulate the theoretical framework of the VR- EER model. In order to keep our notation consistent, we make some slight adjustments on the notation of [8]. Suppose there are N base-experts, and regard the ith (1 i N) baseexpert s scores as the realizations of a random variable Xi k (k {C, I}). Denote by μ k i and σi k, the expected value and standard deviation of Xi k, respectively, and by Xk com, the random variable corresponding to the combined scores; then Xcom k = 1 N Xi k, k {C, I}. (10) N The average of variance of Xi k over all i = 1,,...,N, denoted as (σav k ),is (σav k ) = 1 N (σi k N, k {C, I}. (11) It is easy to show that the variance of Xcom k, denoted as (σcom k ), satisfies (σ k com ) (σ k av ), k {C, I}. (1) It follows from (1) that by combining base-experts using a mean operator, the resultant variance is assured to be smaller than the average variance of the base-experts involved, and hence variance reduction. This is so-called VR in the VR-EER model. The F-ratio of the combined system is where μ k com = 1 N f com = μc com μi com σ C com + σi com N (13) μ k i, k {C, I}. (14) Denote by μ k av, the average of N base-experts when used separately, that is μ k av = 1 N N μ k i, k {C, I} (15) then the F-ratio of the average of scores of N base-experts is f av = μc av μi av σav C +. (16) σi av Using (1), (14), and (15), we arrive at the following inequality f com f av. (17) Let EER com be the EER of the combined scores and let EER av be the EER of the average of the scores of N baseexperts. Combining (1) and (5) with (17) yields EER com EER av. (18) Thus, it can be concluded that compared with the case where scores are used separately, fusion yields reduction of variance and hence, reduction of EER. This is the rationale behind the VR-EER model. Our opinions on the above reasoning are as follows. 1) The F-ratio of the average of scores of N base-experts denoted as f av and defined in (16) is not an F-ratio at all for the following reasons: if f av were an F-ratio, according to Definition 1, μ k av and σk av (k {C, I}) should be the expected value and standard deviation of some random variable denoted as, say, Xav k, respectively. However, it is impossible to explain what Xav k is strictly speaking, there does not exist a random variable whose expected value and standard deviation satisfy (15) and (11), respectively; therefore, f av is not an F-ratio. ) The EER of the average of scores of N base-experts denoted as EER av and used in (18) is not an EER for the following reasons: If EER av were an EER, it should depend on some random variables, but as stated above, it is impossible to explain what such random variables mean. EER av does not mean the average of EERs of N base-experts either (see below for the reasons for this). In our opinion, it is impossible to explain what EER av can be used to measure. 3) Since f av is not an F-ratio and EER av is not an EER, expression (5), which defines the relationship between F-ratio and EER, should not be used. Consequently, the derivation of the inequality (18) is incorrect. 4) Since the derivation of (18) is wrong, we think that the final conclusion of the VR-ERR model [i.e., (18)] is meaningless. In light of our opinion expressed in 1) above on μ k av and σav k, and from the context of [] (therein, it is said that EER av measures the average performance of baselines), we guess that f av in (16) may mean the average of F-ratios of N baseexperts. If so, indeed (16) should be f av = 1 N f i = 1 N μ C i μ I i N N σi C + σi I (19)

4 4 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS where f i denotes the ith base-expert s F-ratio. Correspondingly, we have EER av = 1 N N EER i (0) where EER i denotes the ith base-expert s EER. However, even if f av is defined as in (17), (19) still does not hold true (it is easy to give counter-examples). Even if (17) holds true and EER av is defined as in (18), (0) still cannot be produced because EER is nonlinearly inversely proportional to F-ratio, and not linearly. Therefore, f av and EER av in [8] have no explainable mathematical or physical meanings, and, hence, the statements and conclusions concerning them are unintelligible. We believe that the use of the VR-EER model to investigate the influence of correlation and performances of base-experts on the combined system s performance is not workable. B. Arguable Conclusions Drawn Under the VR-EER Model In Section VI of [8], VR-EER model is used to analyze four commonly encountered scenarios and several conclusions are reached. To simplify the analysis, the authors of [8] make two assumptions, the first of which is that the two base-experts have the same numerator of F-ratio. It is said in [8] that this assumption is actually reasonable because scores can be normalized to have canonical client and impostor means. For instance, after performing a linear transformation on Xi k (note that all the scores have previously been normalized using zeromean unit-variance method), the means of client distribution and impostor distribution take on 1 and -1, respectively. However, we think that this assumption is unjustified because performing a linear transformation on Xi k before fusion to ensure that the two base-experts have the same numerator of F-ratio may lead to a degraded fusion performance (although the values of F-ratios of individual base experts cannot be changed). Here is an example. Suppose 1) μ C 1 μi 1 = 3 ;)μc μi = 1 ;3)ρI 1 = ρc 1 = 0; and 4) (σi 1 ) =(σ1 C) =(σ I) =(σ C). In order to have canonical client and impostor means, say, 1 and -1, we have to perform a linear transformation X k i = μ C i μ I i (X k i μc i + μ I i ), i =1, ; k {C, I}. Using a mean operator to combine X k i yields f com = 9 (1) 44(σ C ).If (1) is not performed at all we have 1 com =. Obviously, 3(σ C ) f com f com, so performing (1) to meet the first assumption is unnecessary and even disadvantageous. Since the analysis of Section VI of [8] is made on the unjustified assumption and under the incorrect VR-EER model, the resulting conclusions (i.e., existing conclusions 3 and 4) are arguable. In Sections IV-B D, more theoretical analyses and proofs based on our proposed optimal fusion approach are given to further retort these conclusions. IV. Theoretical Analysis of Fusion Performance Using Fermat s theorem In this section, we study the problem of how to select the weights of linear fusion. Since F-ratio can act as a measure for a BA system s performance, we use the combined system s F- ratio as an optimality criterion, that is to say, we search for the weights for the maximum F-ratio. We solve this problem by Fermat s theorem and other theories in mathematical analysis, and analyze how the combined system s F-ratio changes with the ratio of one weight to the other. Besides, we consider the case where two base-experts F-ratios are fixed and the linear fusion weights used are optimal, and study how the resulting combined system s performance relates to the correlation between two base-experts. In addition, we point out the limitations of and even errors in existing conclusions 1 4 drawn by some researchers [3], [6] [8], and make an in-depth theoretical analysis of what happened within their experimental setups and why these incorrect conclusions were drawn. Quite different from [8], scores involved in this section may be regarded as unnormalized. A. Derivation of Optimal Weight Vector of Linear Fusion As before, suppose that we have N (N ) base-experts, and we regard the ith (1 i N) base-expert s scores as the realizations of a random variable Xi k (k {C, I}), with μ k i and σi k the expected value and standard deviation of Xi k, respectively. Let d =(d 1,d,...,d N ) T (T means transpose throughout the paper), where Let d i = μ C i μ I i, i =1,,...,N. () ξ k =(X k 1,Xk,...,Xk N )T, k {C, I} (3) be an N-dimensional random vector. Denote by w = (w 1,w,...,w N ) T. the weight vector of linear fusion, then Xcom k (the random variable corresponding to the combined scores) can be represented by X k com = wt ξ k, k {C, I}. (4) It is easy to check that the combined system s F-ratio f com (w) = w T d wt C R C C w + w T I R I I w (5) where R k is the correlation matrix of ξ k and k is an N N diagonal matrix whose diagonal entries are σi k (k {C, I}). From (5), we can easily see that it is unnecessary to perform any linear normalization of scores before linear fusion, because normalization coefficients can be absorbed by weight vector w. According to (5), the combined system performance depends on 1) fusion weights; ) expectations and variances of individual base-experts; and 3) correlations among baseexperts. Note that these three factors are interrelated, analyzing any one of them alone cannot lead to any convincing conclusion, instead they should be considered as a whole.

5 LIU et al.: EFFECT OF CORRELATION AND PERFORMANCES OF BASE-EXPERTS ON SCORE FUSION 5 For convenience, we consider the case where N = and assume that ρ I 1 = ρc 1 (6) σi I = Kσi C,, (7) where K R + is a constant. To simplify the notation, we might as well drop the class label C and the subscript 1 so as not to lead to any confusion. In other words, we replace ρ C 1 with ρ, and σc i with σ i (i = 1, ). So the ith base-expert s F-ratio f i satisfies Let (K +1) i = d i, i =1,. (8) σ i t = w w 1 (9) when w 1 = 0, we might as well think that t =+ or t =. In this way, t R. Dividing the numerator and the denominator of (5) simultaneously by w 1, then substituting (9) into the resulting expression, and noting our assumptions (6) (7), we get (K +1) [f com (t)] = d 1 +d 1 d t + d t σ 1 +ρσ 1 σ t + σ t. (30) Denote the right hand side of (30) as H(t) = d 1 +d 1 d t + d t σ 1 +ρσ 1 σ t + σ t. (31) Since many properties of f com (t), such as extreme points, monotonicity intervals, etc., are the same as those of H(t), we study H(t) instead of f com (t) below, which should lead to the same conclusions and simplify the analysis. When ρ ±1, (31) makes sense for any t R. The following analysis is based on the assumption that ρ ±1. Differentiating (31) leads to H (t) = d σ (ρa B) t +(A B )t + d 1 σ 1 (A ρb) (σ 1 +ρσ 1 σ t + σ t ) (3) where A = d σ 1,B= d 1 σ. (33) Setting H (t) = 0, and using Fermat s theorem, we get the extreme points of H(t) t 1 = d 1,t = σ 1(ρB A) d σ (ρa B). (34) When ρa B = 0 (i.e., ρ = B A = f 1 ), we can treat it as the case where t = (equivalent to the case where w 1 =0,w < 0). It can be shown that t 1 is the minimum point, and t is the maximum point [9]. Hence, t is what we have been searching for. More formally, we have the following definition. Definition (Optimal Weight Vector): For any w 1 R, we call w =(w 1,t w 1 ) T optimal weight vector of linear fusion, or optimal weight vector for short, where the expression for t is given in (34). Fig. 1. Combined system s performance versus weight ratio. B. How Does the Combined System s Performance Vary With Fusion Weights? According to the signs of H (t), monotonicity intervals of H(t) can be determined [9], consequently, a qualitative graph of H(t) can be drawn (see Fig. 1, which is drawn under the assumptions that f 1 and d i > 0,, ). 1) When 1 <ρ< f 1 : The qualitative graph is displayed in Fig. 1(a). It is worthwhile to note that ρ is unlikely to closely approximate 1 if ρ is close to 1, d 1 and d are unlikely to be simultaneously greater than 0. We are unable to give the minimum value of ρ in theory, thus let 1 be its lower limit for the moment. ) When f 1 ρ<1: The qualitative graph is displayed in Fig. 1(b). It can be seen that when f 1 ρ<1, to maximize the combined system s performance, we must ensure that t < 0, i.e., weights w 1 and w have different signs. As far as we know, almost all the linear fusion methods [1] [5] use only positive weights. Fig. 1(b) demonstrates that this common practice cannot maximize the combined system s performance, and hence is not reasonable. Existing conclusions 1 4 are all founded on the practice of taking only positive weights into account; this is just the underlying reason for their incorrectness. Experimental validation of this can be found in Section V-B, and we shall return to this point later (see Section IV-D). Substituting t of (34) into (31) and noting (8) and (30), we obtain max[f com (t)] = H(t ) t R (K +1) = f 1 ρf 1 +. (35) 1 ρ For two base-experts satisfying (6) and (7), (35) gives the best performance of the linearly-combined system. C. How Does Correlation Between Base-Experts Affect the Combined System s Performance? Now we address the following problem. When we use optimal weight vector to combine two base-experts whose F-

6 6 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS terms of performance. By considering only positive weights, the researchers [6], [3] drew incorrect existing conclusion 1. When two base-experts are highly correlated, we have 1 ρ f 1. In this case Fig. 1(b) tells us that positive fusion is not as good as the stronger base-expert acting alone. This explains the experimental results on which existing conclusion is based. Finally, we again consider the case where the weights used are optimal. From Fig., we know that when ρ [ f 1, 1), J(ρ) is monotonically increasing. This disproves in theory the first part of existing conclusion 4, i.e., in any case, positive correlation hurts fusion. Fig.. Combined system s best performance versus correlation between base-experts. ratios are f 1 and, respectively, how does the combined system s performance relate to the correlation between two base-experts? Treat the right-hand side of (35) as a new function J(ρ) = f 1 ρf 1 + (36) 1 ρ the rough graph of which is shown in Fig. (suppose f 1 ). The minimum of J(ρ) is obtained for ρ = f 1, being equal to min J(ρ) = min 1 ρf 1 + = ρ ( 1,1] ρ ( 1,1] 1 ρ 1. (37) From Fig. and (37), it can be seen that as long as optimal weight vector in Definition is used as fusion coefficients, the resulting combined system is always superior to each individual base-expert in terms of performance regardless of the magnitude of difference between two base-experts performances or whatever correlation between two base-experts may be. This contradicts existing conclusions 1 4 in theory, and experimental results in Section V-B support our findings. D. Further Explanations of Incorrectness of the Existing Conclusions We return to the issue of how the combined system s performance changes with t, focusing on giving further explanations of why the existing conclusions are incorrect. When two baseexperts performances are very similar, we have f 1 1, so in all probability ρ f 1, corresponding to Fig. 1(a). Therefore, for t = w w 1 (0, + ), in all probability the combined system is better than the stronger expert in terms of performance. On the contrary, when the difference between two base-experts performances is quite significant, f 1 is rather small and we might as well think that 0, so in all probability f 1 ρ<1, corresponding to Fig. 1(b). Therefore, for any w 1,w > 0, the combined system is worse than the stronger expert (though better than the weaker base-expert) in f 1 V. Numerical Experiments In this section experiments are performed to validate our statements and disconfirm the conclusions drawn by other researchers in previous literature. We present an experimental setup first, explaining the database and the specific score files used. Then experimental results are shown and analyzed. A. Experimental Setup For convenience of comparison, we use the same database as that in [8], namely, BANCA [10]. There are seven experimental protocols on the BANCA database, and we number them 1,,..., 7. Marcel [11] used BANCA to conduct some experiments, creating a number of score files, 970 of which can be freely downloaded. 1 Over each protocol five score files are considered by us: 1) IDIAP voice gmm auto scale scores (speech expert); ) SURREY face svm man.scores (face expert); 3) UUC3M voice gmm auto scale scores (speech expert); 4) UCL face lda man.scores (face expert); and 5) UC3M voice gmm auto scale 10 3.scores (speech expert). Among these score files the first four were also used by [8]. In order to cover all the scenarios discussed in Section IV, we have added the fifth one. For each protocol, there are two subgroups, called g1 and g. As in [8], we use g1 as development sets, and g as evaluation sets. B. Fusion Using Optimal Weight Vector Versus Fusion Using Mean Operator Some base-experts that basically satisfy (6) and (7) are picked from the 35 score files (7 5 = 35) described in Section V-A and are combined. The relevant parameters are estimated from the development sets and used to evaluate optimal weight vector w =(1,t ) T according to (34). Then the resulting optimal weight vector is used to perform linear fusions both on the development sets and on the evaluation sets. For comparison purposes, we also use the mean operator to conduct fusions as in [8], all the scores are normalized beforehand. The experimental results are shown in Table I. Based on these data, our observations and analyses are listed below. 1 [Online.] Available: at Poh/web/fusion/download.php.

7 LIU et al.: EFFECT OF CORRELATION AND PERFORMANCES OF BASE-EXPERTS ON SCORE FUSION 7 TABLE I Fusion Using Optimal Weights Versus Fusion Using Mean Operator Base-Experts Fusion using optimal weight vector Fusion using mean operator No.1 Corr. Critical Value 3 F-ratio1 4 F-ratio 5 HTER1 6 HTER 7 t 8 F-ratio1 4 F-ratio 5 HTER1 6 HTER 7 F-ratio1 4 F-ratio 5 HTER1 6 HTER Base-experts serial numbers, for example, 1.3 represents the third base-expert of protocol 1. Correlation between two base-experts, computed as ρ 1 = 1 (ρc 1 + ρi 1 ). 3 f Ratio of one F-ratio to the other, i.e., 1 f (= d 1σ d σ ); If ρ 1 1 is less than it t is positive, otherwise t is negative. 4 F-ratios computed on development sets. 5 F-ratios computed on evaluation sets. 6 HTER computed on development sets (multiplied by 100). 7 HTER computed on evaluation sets (multiplied by 100). 8 Ratio of the second weight to the first, i.e., w w 1. 1) Regardless of two base-experts being positively correlated, negatively correlated, or uncorrelated, no matter how significant the performance difference between two base-experts is, HTER of the combined system is smaller than (at most equal to) that of the stronger base-expert as long as optimal weight vector is used. On the development sets, the combined systems are obviously better than each individual base-expert in terms of authentication performance. On the evaluation sets, sometimes the combined system and the stronger expert exhibit almost similar performance (see the results of combining 1.1 and 1.3, those of combining 1.1 and 1.5, and also those of combining 7.3 and 7.5). The reason for this is that the optimal weight vector has been estimated from the development sets and hence is optimal only for the development sets, it may not be optimal for the evaluation sets. The values of F-ratio also give a fair indication of non-optimality when 1.1 and 1.5 (also 7.3 and 7.5) are combined, F-ratio of the combined system is a bit lower than that of base-expert 1.1 (7.3). ) Fusion using optimal weight vector generally outperforms fusion via the mean operator. F-ratio1, F-ratio, HTER1, and HTER of optimally weighted-fusion are better than (at least as good as) their counterparts of mean-operator-fusion, with the exception of F-ratios on the evaluation set (i.e., F-ratio) when combining 3.1 and 3.. 3) When two base-experts exhibit very different performances (for example, 1.1 versus 1.5, 7.3 versus 7.5) or are highly correlated (for example, 1.1 versus 1.3), surely we cannot benefit from positively linear fusion, as stated in [3], [6], and [7]. 4) The optimal weights used to combine base-experts 1.1 and 1.5, and also those used to combine 7.3 and 7.5, have different signs, and they result in much better performances than the mean operator. This indicates that considering only positive weights is unreasonable. 5) Finally, theoretically speaking, the greater the F-ratio is, the smaller the EER (HTER) is. Our numerical results basically reflect this trend, however, a few exceptions are there indeed when base-experts 7. and 7.4 are combined via the mean operator, F-ratio1 and F-ratio of the resulting combined system are both greater than their counterparts of the stronger base-expert, whereas HTER1 and HTER of the resulting combined system are both greater than their counterparts of the stronger base-expert. A possible explanation for this is that the mathematical connection between F-ratio and EER [i.e., (5)] is established under the condition that both client and impostor scores follow Gaussian distribution, nevertheless this condition in our experiments may not be satisfied entirely. VI. Conclusion This paper investigated the effect of correlation and performances of base-experts on score fusion in biometric authentication tasks. First, we pointed out the incorrectness of several existing conclusions, especially the incorrectness of the VR-EER model and its resulting conclusions. Then making use of Fermat s theorem and F-ratio, we theoretically analyzed the properties of score fusion, and proposed one optimality criterion of selecting linear fusion weights to make the combined system superior to all the base-experts regardless of correlation, performances, or variances of base-experts. Ex-

8 8 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS perimental results validated our statements and disconfirmed the conclusions previously drawn by other researchers in the literature. References [1] L. Hong and A. K. Jain, Integrating faces and fingerprints for personal identification, IEEE Trans Pattern Anal. Mach. Intell., vol. 0, no. 1, pp , Dec [] N. Poh and S. Bengio, EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks, in Proc. 6th Int. Workshop Multiple Classifier Syst., 005, pp [3] H. Proença, Biometric recognition: When is evidence fusion advantageous? in Proc. 5th Int. Symp. Adv. Visual Comput. Part II, 009, pp [4] A. Meraoumia, S. Chitroub, and A. Bouridane, Fusion of fingerknuckle-print and palmprint for an efficient multi-biometric system of person recognition, in Proc. IEEE Int. Conf. Commun., Jun. 011, pp [5] U. Gupta, J. Fukane, V. Ramanan, and R. Thakur, Score level fusion of face and finger traits in multimodal biometric authentication system, in Proc. IJCA ICWET, no , pp [6] J. Daugman, Biometric decision landscapes, Comput. Lab., Univ. Cambridge, Cambridge, MA, USA, Tech. Rep. TR48, 000. [7] J. Kittler, M. Hatef, R. P. Duin, and J. Matas, On combing classifiers, IEEE Trans. Pattern Anal. Mach. Intell., vol. 0, no. 3, pp. 6 39, [8] N. Poh and S. Bengio, How do correlation and variance of base-experts affect fusion in biometric authentication tasks?, IEEE Trans. Signal Process., vol. 53, no. 11, pp , Nov [9] T. M. Apostol, Mathematical Analysis, nd ed. New York, NY, USA: Addison-Wesley, [10] E. Bailly-baillire, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Marithoz, J. Matas, K. Messer, F. Poree, and B. Ruiz, The BANCA database and evaluation protocol, in Proc. Int. Conf. Audio-Video- Based Biometric Person Authentication, Berlin, Germany, 003, pp [11] C. Marcel, Multimodal identity verification at IDIAP, IDIAP, Tech. Rep. Idiap-Com , 003.

I D I A P R E S E A R C H R E P O R T. Samy Bengio a. November submitted for publication

I D I A P R E S E A R C H R E P O R T. Samy Bengio a. November submitted for publication R E S E A R C H R E P O R T I D I A P Why Do Multi-Stream, Multi-Band and Multi-Modal Approaches Work on Biometric User Authentication Tasks? Norman Poh Hoon Thian a IDIAP RR 03-59 November 2003 Samy Bengio