A STATISTICAL SIGNIFICANCE TEST FOR PERSON AUTHENTICATION. IDIAP CP 592, rue du Simplon Martigny, Switzerland

A STATISTICAL SIGFICANCE TEST FOR PERSON AUTHENTICATION Samy Begio Johy Mariéthoz IDIAP CP 592, rue du Simplo 4 1920 Martigy, Switzerlad {begio,marietho}@idiap.ch ABSTRACT Assessig whether two models are statistically sigificatly differet from each other is a very importat step i research, although it has ufortuately ot received eough attetio i the field of perso autheticatio. Several performace measures are ofte used to compare models, such as half total error rates HTERs ad equal error rates EERs, but most beig aggregates of two measures such as the false acceptace rate ad the false rejectio rate, simple statistical tests caot be used as is. We show i this paper how to adapt oe of these tests i order to compute a cofidece iterval aroud oe HTER measure or to assess the statistical sigificatess of the differece betwee two HTER measures. We also compare our techique with other solutios that are sometimes used i the literature ad show why they yield ofte too optimistic results resultig i false statemets about statistical sigificatess. 1. INTRODUCTION The geeral field of biometric perso autheticatio is cocered with the use of several biometric traits such as the voice, the face, the sigature, or the figerprits of persos i order to assess their idetity [1]. I all these cases, researchers ted to use the same performace measures to estimate ad compare their models. Most of them, such as the half total error rate HTER or the detectio cost fuctio DCF are i fact aggregates of other measures such as false acceptace rates FARs ad false rejectio rates FRRs. However, whe it is time to compare a ovel model to existig solutios o the same problem, a quick review of the curret literature i perso autheticatio shows that either o statistical test is used to assess the differece betwee models, or, worse, statistical tests are wrogly used, which ofte eds up i over-optimistic results, tedig to show, for istace, that the ew model is ideed statistically sigificatly better tha the state-of-the-art while it might ot be the case i fact. I this paper, we preset a proper method to compute a simple statistical test, kow as the test of two proportios, or z-test, adapted to the problem of aggregate measures such as HTER ad DCF. I sectio 2, we first review the mai performace measures used i verificatio tasks, the i sectio 3 we recall the purpose of the z-test, based o the Biomial distributio, ad some of its variats. I sectio 4, we exted this test to the case of aggregate measures such as HTER, while i sectio 5, we preset other possible solutios, which, as explaied, ca lead to improper results. I fact, sectio 6 compares our solutio to these other methods ad show why they yield over-optimistic results. Sectio 7 cocludes this paper with some proposed future work. 2. PERSON AUTHENTICATION MEASURES A verificatio system has to deal with two kids of evets: either the perso claimig a give idetity is the oe who he claims to be i which case, he is called a cliet, or he is ot i which case, he is called a impostor. Moreover, the system may geerally take two decisios: either accept the cliet or reject him ad decide he is a impostor. Thus, the system may make two types of errors: a false acceptace, whe the system accepts a impostor, ad a false rejectio, whe the system rejects a cliet. Let FA be the total umber of false acceptaces made by the system, FR be the total umber of false rejectios, NC be the umber of cliet accesses, ad be the umber of impostor accesses. I order to be idepedet o the specific dataset distributio, the performace of the system is ofte measured i terms of rates of these two differet errors, as follows: FAR = FA, FR FRR = NC. 1 A uique measure ofte used combies these two ratios ito the so-called detectio cost fuctio DCF [2] as follows: DCF = { CostFR P cliet FRR CostFA P impostor FAR 2

where P cliet is the prior probability that a cliet will use the system, P impostor is the prior probability that a impostor will use the system, CostFR is the cost of a false rejectio, ad CostFA is the cost of a false acceptace. A particular case of the DCF is kow as the half total error rate HTER where the costs are equal to 1 ad the probabilities are 0.5 each: HTER = FAR FRR 2. 3 Most autheticatio systems are measured ad compared usig HTERs or variatios of it. The mai questio we address i this paper is thus: how ca we compute a cofidece iterval aroud a HTER or assess the differece betwee two systems yieldig differet HTERs. Note that i most bechmark databases used i the autheticatio literature, there is a sigificat ubalace betwee the umber of cliet accesses ad the umber of impostor accesses. This is probably due to the relatively higher cost of obtaiig the former with respect to the latter. This ubalace is the mai reaso why people use HTER to compare models ad ot the usual classificatio error used i the machie learig literature. 3. THE Z-TEST ON PROPORTIONS Several statistical tests are available i the literature. For stadard classificatio tasks, a simple yet ofte used test is kow as the z-test, or test betwee two proportios. The ratioale of this test is the followig: give a set of examples, each draw idepedetly ad idetically distributed i.i.d. from a ukow distributio, our system is goig to take a decisio for each example, ad this decisio will be correct or ot. Let us ow look at the distributio of the umber of errors that will be made by our classificatio system. Sice each decisio is idepedet from the others ad is biary, it is reasoable to assume that the radom variable X represetig 1 the umber of errors should follow a Biomial distributio B, p where is the umber of examples ad p is the percetage of errors. Moreover, it is kow that a Biomial B, p ca be approximated by a Normal distributio N µ, σ 2 with µ = p ad σ 2 = p1 p whe is large eough 2. Fially, if X N p, p1 p, the the distributio of the proportio of errors Y = X N p, p1 p. 1 I this paper we use the followig otatio: bold letters such as FA represet radom variables, while ormal letters such as FA represet a particular value of the uderlyig radom variable. 2 A rule of thumb ofte used is to have p1 p larger tha 10. P Y p β δ = Area uder the curve Y p β Fig. 1. Cofidece itervals are computed usig the area uder the Normal curve. 3.1. Cofidece Itervals I order to compute a cofidece iterval aroud p, we ca search for bouds {p β, p β} such that P p β < Y < p β = δ 4 where δ represets our cofidece. This is called a twosided test sice we are searchig for two bouds aroud p. Fortuately, fidig β i 4 for a give δ ca be doe efficietly for the Normal distributio. Figure 1 illustrates graphically the problem ad Figure 2 summarizes the procedure to obtai the cofidece iterval. 3.2. Differece Betwee Proportios Alteratively, if oe wats to verify whether a give proportio of errors p A is statistically sigificatly differet from aother proportio p B, a similar test ca be performed. I the case where we already kow that p A caot be lower tha p B, a oe-sided test is used, otherwise we use a twosided test. Notig respectively Y A ad Y B the radom variables represetig the distributio of p A ad p B, the oe-sided test is based o P Y A Y B < p A p B = δ 5 while the two-sided test is based o P Y A Y B < p A p B = δ 6 which ca be solved usig the fact that the differece betwee two idepedet Normal distributios is a Normal

distributio where the mea is the differece betwee the two Normal meas ad the variace is the sum of the two Normal variaces, hece, if Y A is ot statistically differet from Y B, the Y A Y B N 0, p A1 p A p B 1 p B 7 ad if δ is higher tha a predefied value such as 95%, the oe ca state that p A is sigificatly differet from p B. Note that a better estimate of the variace of 7 ca be obtaied whe assumig p A = p B which should be the case if they are ot sigificatly differet. I that case, equatio 7 becomes with Y A Y B N 0, p = p A p B 2 2p1 p Note however that usig this test to verify whether two models give statistically sigificatly differet results o the same test database makes a wrog hypothesis, sice Y A ad Y B are ot really idepedet as they correspod to decisios take o the same test set. 3.3. Depedet Case Oe possible solutio proposed i [3] is to oly take ito accout the examples for which the two models disagree. Let p AB be the proportio of examples correctly classified by model A ad icorrectly classified by model B, ad similarly p BA be the proportio of examples correctly classified by model B ad icorrectly classified by model A. I that case, the distributio Y AB of the differece betwee the proportios of errors committed by each model is still Normal distributed ad, assumig the two models are ot differet from each other, should follow Y AB N 0, p AB p BA 9 with the correspodig two-sided test. 8 P Y AB < p AB p BA = δ. 10 This test is i fact very similar to the well-kow Mc- Nemar test, based o a χ 2 distributio. I the literature, most people adopt equatio 8 ad some adopt equatio 9; remember that i order to use equatio 9, oe eeds to have access to all the scores of both models, ad ot just the umbers of errors. Whe possible, we will look at both solutios here, for the case of perso autheticatio. 4. A STATISTICAL TEST FOR HTERS HTERs are ot proportios, but they are a average of two well-defied proportios FAR ad FRR. Give this, ad assumig some hypotheses regardig FAR ad FRR 3, we propose here to exted the test betwee two proportios for the case of HTERs as follows. 4.1. Cofidece Itervals Let the radom variable FA represet the umber of false acceptaces. We ca model it by a Biomial, ad hece by a Normal, as follows: FA B, FA N FA FA, 1 FA N FA, FA 1 FAR. 11 The radom variable FR ca be modeled accordigly. We ca ow write the distributio of the radom variable FAR represetig the ratio of false acceptaces: FA FAR N N FA 1 FAR, FAR, FAR 1 FAR 12 ad similarly for the radom variable FRR. Give the distributio of FAR ad FRR, we ca estimate the distributio of the radom variable HTER as follows: FARFRR N FARFRR 2 HTER N ţ FAR 1 FAR FARFRR, ţ FARFRR N, 2 FAR 1 FAR ţ FAR1 FAR HTER, ű FRR 1 FRR NC ű FRR 1 FRR ű FRR1 FRR 13 Usig this last defiitio, we ca ow compute easily cofidece itervals aroud HTERs usig the methodology preseted i sectio 3 ad summarized i Figure 2 for classical cofidece values used i the scietific literature, Moreover, the test ca be easily exteded to variatios of HTER, such as the DCF. For istace, i the case of 3 such that the distributios of FAR ad FRR should be idepedet, which may look false sice they are both liked by the same model ad threshold, but i fact, give a model ad associated threshold these two quatities are ideed idepedet sice they are computed o separate data the cliet accesses ad the impostor accesses, assumig the model was estimated o a separate traiig set, as it should be.

the well-kow ST evaluatios performed yearly to compare speaker verificatio systems, ad which use the DCF measure described by equatio 2 with CostFR = 10, Pcliet = 0.01, CostFA = 1 ad Pimpostor = 0.99, the uderlyig Normal becomes: FAR 1 FAR FRR 1 FRR DCF N DCF, 0.99 2. 100 NC 14 4.2. Differece Betwee HTERs The distributio of the differece betwee two HTERs assumig idepedece betwee the two uderlyig distributios is HTER A HTER B N 0, σindep 2 15 with FAR A 1 FAR A FAR B 1 FAR B σindep 2 = FRR A 1 FRR A FRR B 1 FRR B while the distributio of the differece betwee two HTERs assumig depedece betwee the two uderlyig distributios becomes HTER A HTER B N 0, σdep 2 16 with σ 2 DEP = FAR AB FAR BA FRR AB FRR BA where FAR AB = AB ad AB is the umber of impostor accesses correctly rejected by model A ad icorrectly accepted by model B, with similar defiitios for FAR BA, FRR AB, ad FRR BA. Hece, i summary, ad usig the stadard cofidece values used i the scietific literature, we obtai the simple methodology described i Figure 2 i order to compute statistical tests for perso autheticatio tasks 4. 5. OTHER STATISTICAL TESTS While several researchers have poited out the use of the z- test to compute statistical tests aroud values such as FAR or FRR see for istace [4], we are ot aware, to the best of our kowledge, of ay similar attempt for aggregate measures such as HTERs or EER, or DCF. However, most people publishig results i verificatio use HTERs or DCF to assess the quality of their methods. 4 While this summary cocers HTERs, it should ow be obvious to exted it to the geeral DCF fuctio. The cofidece iterval CI aroud a HTER is HTER ± σ Z α/2 with FAR1 FAR FRR1 FRR σ = 1.645 for a 90% CI Z α/2 = 1.960 for a 95% CI 2.576 for a 99% CI ad similarly, HTER A ad HTER B are statistically sigificatly differet if z > Z α/2 with z = HTER A HTER B FAR A 1 FAR A FAR B 1 FAR B FRR A 1 FRR A FRR B 1 FRR B i the idepedet case, ad z = FAR AB FAR BA FRR AB FRR BA FARAB FAR BA FRR AB FRR BA i the depedet case. Fig. 2. Methodology for statistical tests aroud HTERs. Oe simple solutio could be to cosider the classificatio error istead of the HTER ad compute statistical tests aroud it. Sice the classificatio error is a well-defied proportio, we ca apply the z-test as well; Let CLASS be defied as the followig radom variable: CLASS = FAFR NC the, the correspodig uderlyig Normal becomes: FAFR CLASS N NC, FAFR NC 2 1 FAFR NC 17 but remember that while this test is correct to assess models accordig to their respective classificatio error, it does ot say aythig o the cofidece oe has over the correspodig HTER, which is the measure of iterest i perso autheticatio. I fact, we will show i sectio 6.1 that, uder reasoable assumptios, the variace of CLASS i equatio 17 is always smaller tha the variace of HTER i

equatio 13, hece cofidece tests usig 17 will always result i over-cofidet statistical sigificace or smaller cofidece itervals. This will be explored further i the followig sectio. Aother possible solutio is to cosider the HTER itself as a proportio which it is ot directly ad compute the statistical test o it. Let NAIVE be the radom variable of this value; the uderlyig Normal becomes: NAIVE N HTER, HTER1 HTER NC 18 Agai, we will show i sectio 6.1 that uder reasoable assumptios, the variace of NAIVE i equatio 18 is always smaller tha the variace of HTER i equatio 13, hece cofidece tests usig 17 should always result i over-cofidet statistical sigificace or smaller cofidece itervals. Yet aother solutio that has bee proposed by some researchers see for istace [5] is to compute a statistical test for FAR ad FRR separately ad the combie the results 5. For istace, i order to compute a cofidece iterval for HTER, oe would average both upper bouds ad both lower bouds foud separately by the FAR ad FRR tests. O top of the fact that there is o theoretical groud to justify such a approach, there is a evidet problem with all approaches that cosider separately FARs ad FRRs. Two models could yield very similar HTERs but for some reaso i geeral liked to the choice of the threshold, which is doe i geeral o a separate data set oe could be slightly biased toward FRRs ad the other oe slightly biased toward FARs. I such a case, these tests would cosider them statistically sigificatly differet while they would ot be whe cosiderig globally their respective HTER istead. For this reaso, we will ot cosider this solutio further here. 6. ANALYSIS We would like to compare i this sectio the use of the proposed statistical test for HTERs, with respect to the two other tests preseted i sectio 5. We will first show that uder some reasoable coditios, icreasig the ratio betwee ad NC will icrease the differece betwee the variace of the Normal of the proposed test ad the variace of the Normal of the other tests. Afterward, we preset two real case studies where the use of the proposed statistical test would have yielded a differet coclusio with regard to the cofidece itervals ad the differece betwee the compared models. 5 The well-kow ST evaluatio campaigs have also apparetly recetly ivestigated the use of the McNemar test to assess speaker verificatio methods, but have cosidered separately FARs ad FRRs [6]. 6.1. Theoretical Aalysis Let us first look i which coditios σ 2 13, the variace of HTER as writte i equatio 13 is higher tha σ 2 18, the variace of NAIVE as writte i equatio 18: implies that FAR 1 FAR σ 2 13 > σ 2 18 19 FRR 1 FRR which ca be simplified ad yields > HTER1 HTER NC 0 > FAR NC FRR 1 FRR NC1 FAR which meas that iequatio 19 will be true whe either NC is much less or much higher tha which is i geeral the case, ad FAR is similar to FRR agai, whe the threshold is chose such that we have equal error rate EER o a separate validatio set, as it is ofte doe, this is reasoable. Let us ow look i which coditios σ 2 13 is higher tha σ 2 17, the variace of CLASS, represetig the classificatio error: σ 2 13 > σ 2 17 20 implies that FAR1 FAR FRR1 FRR > FAFR NC 2 FAFR2 NC 3 which ca be re-writte as 1 FRR3NC > 1 FARNC3 NC ad assumig FAR is similar to FRR, it ca be simplified ito 2 > NC 2 21 which is true as log as is higher tha NC, which is i geeral the case, agai. I order to verify these relatios graphically, we have fixed some variables to reasoable values FAR = 0.1, FRR = 0.2, NC = 100 ad have varied, the umber of impostor accesses. Figure 3 shows the relatio betwee the stadard deviatio of the uderlyig Normal distributios ad the ratio betwee ad NC. As expected, the higher the ratio NC, the bigger the differece betwee the stadard deviatio of the Normal distributios related to the three statistical tests. Moreover, we see that the stadard deviatio of the proposed HTER distributio stays close to the oe of the FRR distributio, which is mostly iflueced by NC, the umber of cliet accesses, ad does ot decrease with the icrease of, cotrary to the two other solutios. Sice the size of the cofidece iterval is directly related to the stadard deviatio, this Figure essetially shows that

0.1 FAR = 0.1, FRR = 0.2, NC = 100 Havig up to 112400 examples, oe could ideed expect the differece betwee the two models to be statistically sigificat. Stadard Deviatio 0.01 HTER NAIVE CLASS FAR FRR 0.001 1 10 100 Ratio /NC Fig. 3. Stadard deviatio of the Normal distributios uderlyig the three differet choices of distributios for a statistical test o HTERs. Also show: stadard deviatios of both the FAR ad FRR distributios. All curves are i loglog scale. the cofidece iterval computed usig the proposed techique will always be larger tha that of the two other techiques. Hece two verificatio methods yieldig two differet HTERs could easily be cosidered statistically sigificatly differet usig oe of the methods described i sectio 5, while they would ot be cosidered statistically sigificatly differet usig the proposed techique. I fact, the Figure shows that the cofidece iterval is directly iflueced by the miimum of NC ad ad ot their sum. I the ext two subsectios, we preset two real case studies where the use of the proposed statistical test would have yielded a differet coclusio. 6.2. Empirical Aalysis o XM2VTS I the first case, the well-kow text-idepedet audiovisual verificatio database XM2VTS [7] was used. I this database, the test set cosists of up to 112000 impostor accesses ad oly 400 cliet accesses, for a total of 112400 accesses. I a recet competitio [8], several models were compared 6 o a face verificatio task ad we will look here at the results of the best model, hereafter called model A, ad the third best model, hereafter called model B, apparetly sigificatly worse. Table 1 shows the differece of performace i terms of HTER betwee models A ad B. 6 While this is ot the topic of this paper sice it should apply to ay data/model, people iterested i kowig more about the problem tackled i this case study are referred to [8]; we used results of the models of IDIAP ad UiS-NC o the automatic registratio task, usig Lausae Protocol I. Furthermore, ote that the results of UiS-NC are slightly differet from those published i [8], but correspod to the list of scores provided by oe of the authors of the method. Method FAR % FRR % HTER % Model A 1.15 2.50 1.82 Model B 1.95 2.75 2.35 Table 1. HTER Performace compariso o the test set betwee models A ad B whe the threshold was selected accordig to the Equal Error Rate criterio EER o a separate validatio set. δ HTER NAIVE CLASS eq 13 eq 18 eq 17 90% 1.285% 0.131% 0.105% 95% 1.531% 0.156% 0.125% 99% 2.013% 0.206% 0.164% Table 2. Cofidece itervals aroud results of model A, computed usig three differet hypotheses ad their respective equatio. Table 2 shows the size of the cofidece itervals computed aroud the result usig HTER or the classificatio error obtaied by model A for the three methods for three differet values of δ 90%, 95% ad 99%. As we ca see, for all values of δ, the size of the iterval is about oe order of magitude larger for the proposed method tha for the two other methods. HTER HTER NAIVE CLASS DEP, eq 16 INDEP, eq 15 eq 18 eq 17 δ 69.2% 64.7% 100.0% 100.0% σ 0.0052 0.0057 0.0006 0.0005 Table 3. Cofidece value δ o the fact that model A is statistically sigificatly differet from model B, accordig to their respective performace HTER or classificatio error, ad computed usig four differet hypotheses ad their respective equatio. For each method, we also give σ, the stadard deviatio of the correspodig statistical test. Table 3 verifies whether the HTER obtaied by model A gives statistically sigificatly differet results tha the oe obtaied by model B, usig the two-sided test of equatio 6 for the idepedet cases ad 10 for the depedet case. Accordig to both proposed HTER method idepedet ad depedet cases, both models are equivalet the cofidece o their differece is much less tha, say, 90%, while accordig to both other methods, the models would

be differet with 100% cofidece!. Remember that there was oly 400 cliet accesses durig the test, hece it is reasoable that oly oe error o these accesses makes a visible differece i HTER while it caot seriously be cosidered statistically sigificat. This is well captured by our techique, but ot by the other oes. Moreover, i this case, the depedece/idepedece assumptio did ot have ay impact o the fial decisio. 6.3. Empirical Aalysis o ST 2000 I the secod case, the well-kow text-idepedet speaker verificatio bechmark database ST 2000 was used. Here, the test set cosists of 57748 impostor accesses ad 5825 cliet accesses, for a total of 63573 accesses. We compared the performace of two models 7 hereafter called models C ad D. Note that, while o XM2VTS the ratio betwee the umber of impostor ad cliet accesses was very high 280 times more, for the ST database, the ratio is more reasoable, but still high aroud 10. Method FAR % FRR % HTER % Model C 13.1 9.6 11.4 Model D 15.8 7.8 11.8 Table 4. HTER Performace compariso o the test set betwee models C ad D whe the threshold was selected accordig to the Equal Error Rate criterio EER o a separate validatio set. δ HTER NAIVE CLASS eq 13 eq 18 eq 17 90% 0.676% 0.414% 0.436% 95% 0.805% 0.493% 0.519% 99% 1.058% 0.648% 0.682% Table 5. Cofidece itervals aroud results of model C, computed usig three differet hypotheses ad their respective equatio. We ow preset the same kids of results as for the XM2VTS case. Table 4 shows the differece of performace i terms of HTER betwee models C ad D; Table 5 shows the size of the cofidece itervals computed aroud the result obtaied by model C; as we ca see, give a ratio of impostor ad cliet accesses aroud 10 istead of 280, the differece betwee all the cofidece itervals is less drastic but still exists; Table 6 verifies whether the HTER 7 Oce agai, while this is ot the topic of this paper, people iterested i kowig more about the problem tackled i this case study are referred to [9]. HTER HTER NAIVE CLASS DEP, eq 16 INDEP, eq 15 eq 18 eq 17 δ 98.8% 89.1% 98.9% 100.0% σ 2 0.0016 0.0028 0.0018 0.0019 Table 6. Cofidece value δ o the fact that model C is statistically sigificatly differet from model D, accordig to their respective performace HTER or classificatio error, ad computed usig four differet hypotheses ad their respective equatio. For each method, we also give σ, the stadard deviatio of the correspodig statistical test. obtaied by model C gives statistically sigificatly differet results tha the oe obtaied by model D. For each test, we show both the cofidece value δ ad the stadard deviatio σ of the correspodig statistical test. As it ca be see, i the DEP case, σ is very small, eve smaller tha the NAIVE ad CLASS solutios, hece obtaiig a very high cofidece that the two models are differet. I order to explai this uexpected result, ote tha oe of the tests take ito accout the possible depedece existig betwee the compared models. Ideed, if the two models are based o the same techique which is ofte the case; for istace, i speaker verificatio, most systems are ofte based o Gaussia Mixture Models, but traied with slightly differet assumptios, the both systems will have a atural tedecy to aswer very correlated scores o the same example. I the case of the two models traied o the XM2VTS database, they were very differet oe was based o a Gaussia Mixture Model, while the other oe was based o Liear Discrimiat Aalysis ad Normalized Correlatio; while for the models traied o the ST database, both were i fact variatios of Gaussia Mixture Models, hece are probably very correlated. Ufortuately, there exist o test that take this depedecy ito accout. Hece, for istace, the variace p ABp BA of equatio 9 will be quickly very small simply because the models are correlated ad ot just because the examples are the same. Usig this equatio will thus result i a uderestimate of the true variace whe models are very correlated, as empirically show i Table 6. O the other had, the INDEP case does ot take ito accout the depedecy betwee the data, but somehow it is reasoable to expect that the effect of this error may be balaced by the fact that it does ot take ito accout the depedecy betwee the models either. The correct solutio probably lies somewhere betwee these two solutios, hece, oe should probably favor the most difficult test so as to oly assess statistical differeces whe both tests agree o this fact hece, here, with oly 89.1% cofidece.

7. CONCLUSION I this paper, we have proposed a proper method to compute statistical tests o aggregate measures such as HTER or DCF ofte used i perso autheticatio. We have also show why usig other approximatios such as tests o the classificatio error istead would result i over-optimistic decisios. We have give some empirical evidece usig two bechmark databases. It is importat to ote that the test of two proportios is ot the ultimate statistical test ad there exist other tests that are kow to be sometimes more appropriate for classificatio tasks such as complex crossvalidatio techiques for istace [10]. However, oe of these tests have so far addressed the problem of depedece betwee the tested models. Nevertheless, a importat fidig of this paper is that whe people desig ew databases for perso autheticatio, they should keep i mid that it is probably ot worth havig a huge ubalace betwee cliet ad impostor access umbers, sice the statistical sigificatess of the results will maily deped o the smallest of these two umbers providig equal costs for false acceptaces ad false rejectios. [7] J. Lütti, Evaluatio protocol for the the XM2FDB database lausae protocol, Tech. Rep. COM-05, IDIAP, 1998. [8] K. Messer, J. Kittler, M. Sadeghi, S. Marcel, C. Marcel, S. Begio, F. Cardiaux, C. Saderso, J. Czyz, L. Vadedorpe, S. Srisuk, M. Petrou, W. Kurutach, A. Kadyrov, R. Paredes, B. Kepeekci, F. B. Tek, G. B. Akar, F. Deravi, ad N. Mavity, Face verificatio competitio o the XM2VTS database, i 4th Iteratioal Coferece o Audio- ad Video-Based Biometric Perso Autheticatio, AVBPA. 2003, Spriger- Verlag. [9] J. Mariéthoz ad S. Begio, A alterative to silece removal for text-idepedet speaker verificatio, Techical Report IDIAP-RR 03-51, IDIAP, Martigy, Switzerlad, 2003. [10] T.G. Dietterich, Approximate statistical tests for comparig supervised classificatio learig algorithms, Neural Computatio, vol. 10, o. 7, pp. 1895 1924, 1998. 8. ACKNOWLEDGMENTS This research has bee carried out i the framework of the Swiss NCCR project IM2. The authors would like to thak Iva Magri-Chagolleau for suggestig the problem, ad Mohammed Sadeghi for providig the scores of Model A. 9. REFERENCES [1] P. Verlide, G. Chollet, ad M. Acheroy, Multimodal idetity verificatio usig expert fusio, Iformatio Fusio, vol. 1, pp. 17 33, 2000. [2] A. Marti ad M. Przybocki, The ST 1999 speaker recogitio evaluatio - a overview, Digital Sigal Processig, vol. 10, pp. 1 18, 2000. [3] G. W. Sedecor ad W. G. Cochra, Statistical Methods, Iowa State Uiversity Press, 1989. [4] J.L. Wayma, Cofidece iterval ad test size estimatio for biometric data, i Proceedigs of the IEEE AutoID Coferece, 1999. [5] J. Koolwaaij, Automatic Speaker Verificatio i Telephoy: a probabilitic approach, PritParters Ipskamp B.V., Eschede, 2000. [6] A Marti, Persoal commuicatio, 2004.