A STATISTICAL SIGNIFICANCE TEST FOR PERSON AUTHENTICATION. IDIAP CP 592, rue du Simplon Martigny, Switzerland

Similar documents
1 Review of Probability & Statistics

Properties and Hypothesis Testing

Frequentist Inference

1 Inferential Methods for Correlation and Regression Analysis

This is an introductory course in Analysis of Variance and Design of Experiments.

Math 140 Introductory Statistics

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Topic 9: Sampling Distributions of Estimators

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

MA238 Assignment 4 Solutions (part a)

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Problem Set 4 Due Oct, 12

Estimation for Complete Data

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

A statistical method to determine sample size to estimate characteristic value of soil parameters

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

10-701/ Machine Learning Mid-term Exam Solution

Infinite Sequences and Series

CS284A: Representations and Algorithms in Molecular Biology

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

Output Analysis (2, Chapters 10 &11 Law)

Final Examination Solutions 17/6/2010

- E < p. ˆ p q ˆ E = q ˆ = 1 - p ˆ = sample proportion of x failures in a sample size of n. where. x n sample proportion. population proportion

GG313 GEOLOGICAL DATA ANALYSIS

Topic 9: Sampling Distributions of Estimators

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Topic 9: Sampling Distributions of Estimators

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 9

There is no straightforward approach for choosing the warmup period l.

6.3 Testing Series With Positive Terms

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

2 1. The r.s., of size n2, from population 2 will be. 2 and 2. 2) The two populations are independent. This implies that all of the n1 n2

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

The standard deviation of the mean

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

1 Models for Matched Pairs

6 Sample Size Calculations

Statistics 511 Additional Materials

Chapter 23: Inferences About Means

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Instructor: Judith Canner Spring 2010 CONFIDENCE INTERVALS How do we make inferences about the population parameters?

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

Statistical inference: example 1. Inferential Statistics

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

Statistical Pattern Recognition

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Homework 5 Solutions

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

GUIDELINES ON REPRESENTATIVE SAMPLING

Data Analysis and Statistical Methods Statistics 651

An Introduction to Randomized Algorithms

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

Information-based Feature Selection

Sampling Distributions, Z-Tests, Power

Simulation. Two Rule For Inverting A Distribution Function

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lecture 2: Monte Carlo Simulation

4. Partial Sums and the Central Limit Theorem

1 Introduction to reducing variance in Monte Carlo simulations

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS

1 Approximating Integrals using Taylor Polynomials

1036: Probability & Statistics

CSE 527, Additional notes on MLE & EM

Accuracy assessment methods and challenges

Common Large/Small Sample Tests 1/55

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M.

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

Lecture 10: Performance Evaluation of ML Methods

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics

Expectation and Variance of a random variable

Chapter 11: Asking and Answering Questions About the Difference of Two Proportions

Chapter 8: Estimating with Confidence

Estimation of a population proportion March 23,

ANALYSIS OF EXPERIMENTAL ERRORS

Economics Spring 2015

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion

Chapter 6 Sampling Distributions

University of California, Los Angeles Department of Statistics. Hypothesis testing

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution

Power and Type II Error

Module 1 Fundamentals in statistics

Efficient GMM LECTURE 12 GMM II

Machine Learning Brett Bernstein

Understanding Samples

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

4.3 Growth Rates of Solutions to Recurrences

Transcription:

A STATISTICAL SIGFICANCE TEST FOR PERSON AUTHENTICATION Samy Begio Johy Mariéthoz IDIAP CP 592, rue du Simplo 4 1920 Martigy, Switzerlad {begio,marietho}@idiap.ch ABSTRACT Assessig whether two models are statistically sigificatly differet from each other is a very importat step i research, although it has ufortuately ot received eough attetio i the field of perso autheticatio. Several performace measures are ofte used to compare models, such as half total error rates HTERs ad equal error rates EERs, but most beig aggregates of two measures such as the false acceptace rate ad the false rejectio rate, simple statistical tests caot be used as is. We show i this paper how to adapt oe of these tests i order to compute a cofidece iterval aroud oe HTER measure or to assess the statistical sigificatess of the differece betwee two HTER measures. We also compare our techique with other solutios that are sometimes used i the literature ad show why they yield ofte too optimistic results resultig i false statemets about statistical sigificatess. 1. INTRODUCTION The geeral field of biometric perso autheticatio is cocered with the use of several biometric traits such as the voice, the face, the sigature, or the figerprits of persos i order to assess their idetity [1]. I all these cases, researchers ted to use the same performace measures to estimate ad compare their models. Most of them, such as the half total error rate HTER or the detectio cost fuctio DCF are i fact aggregates of other measures such as false acceptace rates FARs ad false rejectio rates FRRs. However, whe it is time to compare a ovel model to existig solutios o the same problem, a quick review of the curret literature i perso autheticatio shows that either o statistical test is used to assess the differece betwee models, or, worse, statistical tests are wrogly used, which ofte eds up i over-optimistic results, tedig to show, for istace, that the ew model is ideed statistically sigificatly better tha the state-of-the-art while it might ot be the case i fact. I this paper, we preset a proper method to compute a simple statistical test, kow as the test of two proportios, or z-test, adapted to the problem of aggregate measures such as HTER ad DCF. I sectio 2, we first review the mai performace measures used i verificatio tasks, the i sectio 3 we recall the purpose of the z-test, based o the Biomial distributio, ad some of its variats. I sectio 4, we exted this test to the case of aggregate measures such as HTER, while i sectio 5, we preset other possible solutios, which, as explaied, ca lead to improper results. I fact, sectio 6 compares our solutio to these other methods ad show why they yield over-optimistic results. Sectio 7 cocludes this paper with some proposed future work. 2. PERSON AUTHENTICATION MEASURES A verificatio system has to deal with two kids of evets: either the perso claimig a give idetity is the oe who he claims to be i which case, he is called a cliet, or he is ot i which case, he is called a impostor. Moreover, the system may geerally take two decisios: either accept the cliet or reject him ad decide he is a impostor. Thus, the system may make two types of errors: a false acceptace, whe the system accepts a impostor, ad a false rejectio, whe the system rejects a cliet. Let FA be the total umber of false acceptaces made by the system, FR be the total umber of false rejectios, NC be the umber of cliet accesses, ad be the umber of impostor accesses. I order to be idepedet o the specific dataset distributio, the performace of the system is ofte measured i terms of rates of these two differet errors, as follows: FAR = FA, FR FRR = NC. 1 A uique measure ofte used combies these two ratios ito the so-called detectio cost fuctio DCF [2] as follows: DCF = { CostFR P cliet FRR CostFA P impostor FAR 2

where P cliet is the prior probability that a cliet will use the system, P impostor is the prior probability that a impostor will use the system, CostFR is the cost of a false rejectio, ad CostFA is the cost of a false acceptace. A particular case of the DCF is kow as the half total error rate HTER where the costs are equal to 1 ad the probabilities are 0.5 each: HTER = FAR FRR 2. 3 Most autheticatio systems are measured ad compared usig HTERs or variatios of it. The mai questio we address i this paper is thus: how ca we compute a cofidece iterval aroud a HTER or assess the differece betwee two systems yieldig differet HTERs. Note that i most bechmark databases used i the autheticatio literature, there is a sigificat ubalace betwee the umber of cliet accesses ad the umber of impostor accesses. This is probably due to the relatively higher cost of obtaiig the former with respect to the latter. This ubalace is the mai reaso why people use HTER to compare models ad ot the usual classificatio error used i the machie learig literature. 3. THE Z-TEST ON PROPORTIONS Several statistical tests are available i the literature. For stadard classificatio tasks, a simple yet ofte used test is kow as the z-test, or test betwee two proportios. The ratioale of this test is the followig: give a set of examples, each draw idepedetly ad idetically distributed i.i.d. from a ukow distributio, our system is goig to take a decisio for each example, ad this decisio will be correct or ot. Let us ow look at the distributio of the umber of errors that will be made by our classificatio system. Sice each decisio is idepedet from the others ad is biary, it is reasoable to assume that the radom variable X represetig 1 the umber of errors should follow a Biomial distributio B, p where is the umber of examples ad p is the percetage of errors. Moreover, it is kow that a Biomial B, p ca be approximated by a Normal distributio N µ, σ 2 with µ = p ad σ 2 = p1 p whe is large eough 2. Fially, if X N p, p1 p, the the distributio of the proportio of errors Y = X N p, p1 p. 1 I this paper we use the followig otatio: bold letters such as FA represet radom variables, while ormal letters such as FA represet a particular value of the uderlyig radom variable. 2 A rule of thumb ofte used is to have p1 p larger tha 10. P Y p β δ = Area uder the curve Y p β Fig. 1. Cofidece itervals are computed usig the area uder the Normal curve. 3.1. Cofidece Itervals I order to compute a cofidece iterval aroud p, we ca search for bouds {p β, p β} such that P p β < Y < p β = δ 4 where δ represets our cofidece. This is called a twosided test sice we are searchig for two bouds aroud p. Fortuately, fidig β i 4 for a give δ ca be doe efficietly for the Normal distributio. Figure 1 illustrates graphically the problem ad Figure 2 summarizes the procedure to obtai the cofidece iterval. 3.2. Differece Betwee Proportios Alteratively, if oe wats to verify whether a give proportio of errors p A is statistically sigificatly differet from aother proportio p B, a similar test ca be performed. I the case where we already kow that p A caot be lower tha p B, a oe-sided test is used, otherwise we use a twosided test. Notig respectively Y A ad Y B the radom variables represetig the distributio of p A ad p B, the oe-sided test is based o P Y A Y B < p A p B = δ 5 while the two-sided test is based o P Y A Y B < p A p B = δ 6 which ca be solved usig the fact that the differece betwee two idepedet Normal distributios is a Normal

distributio where the mea is the differece betwee the two Normal meas ad the variace is the sum of the two Normal variaces, hece, if Y A is ot statistically differet from Y B, the Y A Y B N 0, p A1 p A p B 1 p B 7 ad if δ is higher tha a predefied value such as 95%, the oe ca state that p A is sigificatly differet from p B. Note that a better estimate of the variace of 7 ca be obtaied whe assumig p A = p B which should be the case if they are ot sigificatly differet. I that case, equatio 7 becomes with Y A Y B N 0, p = p A p B 2 2p1 p Note however that usig this test to verify whether two models give statistically sigificatly differet results o the same test database makes a wrog hypothesis, sice Y A ad Y B are ot really idepedet as they correspod to decisios take o the same test set. 3.3. Depedet Case Oe possible solutio proposed i [3] is to oly take ito accout the examples for which the two models disagree. Let p AB be the proportio of examples correctly classified by model A ad icorrectly classified by model B, ad similarly p BA be the proportio of examples correctly classified by model B ad icorrectly classified by model A. I that case, the distributio Y AB of the differece betwee the proportios of errors committed by each model is still Normal distributed ad, assumig the two models are ot differet from each other, should follow Y AB N 0, p AB p BA 9 with the correspodig two-sided test. 8 P Y AB < p AB p BA = δ. 10 This test is i fact very similar to the well-kow Mc- Nemar test, based o a χ 2 distributio. I the literature, most people adopt equatio 8 ad some adopt equatio 9; remember that i order to use equatio 9, oe eeds to have access to all the scores of both models, ad ot just the umbers of errors. Whe possible, we will look at both solutios here, for the case of perso autheticatio. 4. A STATISTICAL TEST FOR HTERS HTERs are ot proportios, but they are a average of two well-defied proportios FAR ad FRR. Give this, ad assumig some hypotheses regardig FAR ad FRR 3, we propose here to exted the test betwee two proportios for the case of HTERs as follows. 4.1. Cofidece Itervals Let the radom variable FA represet the umber of false acceptaces. We ca model it by a Biomial, ad hece by a Normal, as follows: FA B, FA N FA FA, 1 FA N FA, FA 1 FAR. 11 The radom variable FR ca be modeled accordigly. We ca ow write the distributio of the radom variable FAR represetig the ratio of false acceptaces: FA FAR N N FA 1 FAR, FAR, FAR 1 FAR 12 ad similarly for the radom variable FRR. Give the distributio of FAR ad FRR, we ca estimate the distributio of the radom variable HTER as follows: FARFRR N FARFRR 2 HTER N ţ FAR 1 FAR FARFRR, ţ FARFRR N, 2 FAR 1 FAR ţ FAR1 FAR HTER, ű FRR 1 FRR NC ű FRR 1 FRR ű FRR1 FRR 13 Usig this last defiitio, we ca ow compute easily cofidece itervals aroud HTERs usig the methodology preseted i sectio 3 ad summarized i Figure 2 for classical cofidece values used i the scietific literature, Moreover, the test ca be easily exteded to variatios of HTER, such as the DCF. For istace, i the case of 3 such that the distributios of FAR ad FRR should be idepedet, which may look false sice they are both liked by the same model ad threshold, but i fact, give a model ad associated threshold these two quatities are ideed idepedet sice they are computed o separate data the cliet accesses ad the impostor accesses, assumig the model was estimated o a separate traiig set, as it should be.

the well-kow ST evaluatios performed yearly to compare speaker verificatio systems, ad which use the DCF measure described by equatio 2 with CostFR = 10, Pcliet = 0.01, CostFA = 1 ad Pimpostor = 0.99, the uderlyig Normal becomes: FAR 1 FAR FRR 1 FRR DCF N DCF, 0.99 2. 100 NC 14 4.2. Differece Betwee HTERs The distributio of the differece betwee two HTERs assumig idepedece betwee the two uderlyig distributios is HTER A HTER B N 0, σindep 2 15 with FAR A 1 FAR A FAR B 1 FAR B σindep 2 = FRR A 1 FRR A FRR B 1 FRR B while the distributio of the differece betwee two HTERs assumig depedece betwee the two uderlyig distributios becomes HTER A HTER B N 0, σdep 2 16 with σ 2 DEP = FAR AB FAR BA FRR AB FRR BA where FAR AB = AB ad AB is the umber of impostor accesses correctly rejected by model A ad icorrectly accepted by model B, with similar defiitios for FAR BA, FRR AB, ad FRR BA. Hece, i summary, ad usig the stadard cofidece values used i the scietific literature, we obtai the simple methodology described i Figure 2 i order to compute statistical tests for perso autheticatio tasks 4. 5. OTHER STATISTICAL TESTS While several researchers have poited out the use of the z- test to compute statistical tests aroud values such as FAR or FRR see for istace [4], we are ot aware, to the best of our kowledge, of ay similar attempt for aggregate measures such as HTERs or EER, or DCF. However, most people publishig results i verificatio use HTERs or DCF to assess the quality of their methods. 4 While this summary cocers HTERs, it should ow be obvious to exted it to the geeral DCF fuctio. The cofidece iterval CI aroud a HTER is HTER ± σ Z α/2 with FAR1 FAR FRR1 FRR σ = 1.645 for a 90% CI Z α/2 = 1.960 for a 95% CI 2.576 for a 99% CI ad similarly, HTER A ad HTER B are statistically sigificatly differet if z > Z α/2 with z = HTER A HTER B FAR A 1 FAR A FAR B 1 FAR B FRR A 1 FRR A FRR B 1 FRR B i the idepedet case, ad z = FAR AB FAR BA FRR AB FRR BA FARAB FAR BA FRR AB FRR BA i the depedet case. Fig. 2. Methodology for statistical tests aroud HTERs. Oe simple solutio could be to cosider the classificatio error istead of the HTER ad compute statistical tests aroud it. Sice the classificatio error is a well-defied proportio, we ca apply the z-test as well; Let CLASS be defied as the followig radom variable: CLASS = FAFR NC the, the correspodig uderlyig Normal becomes: FAFR CLASS N NC, FAFR NC 2 1 FAFR NC 17 but remember that while this test is correct to assess models accordig to their respective classificatio error, it does ot say aythig o the cofidece oe has over the correspodig HTER, which is the measure of iterest i perso autheticatio. I fact, we will show i sectio 6.1 that, uder reasoable assumptios, the variace of CLASS i equatio 17 is always smaller tha the variace of HTER i

equatio 13, hece cofidece tests usig 17 will always result i over-cofidet statistical sigificace or smaller cofidece itervals. This will be explored further i the followig sectio. Aother possible solutio is to cosider the HTER itself as a proportio which it is ot directly ad compute the statistical test o it. Let NAIVE be the radom variable of this value; the uderlyig Normal becomes: NAIVE N HTER, HTER1 HTER NC 18 Agai, we will show i sectio 6.1 that uder reasoable assumptios, the variace of NAIVE i equatio 18 is always smaller tha the variace of HTER i equatio 13, hece cofidece tests usig 17 should always result i over-cofidet statistical sigificace or smaller cofidece itervals. Yet aother solutio that has bee proposed by some researchers see for istace [5] is to compute a statistical test for FAR ad FRR separately ad the combie the results 5. For istace, i order to compute a cofidece iterval for HTER, oe would average both upper bouds ad both lower bouds foud separately by the FAR ad FRR tests. O top of the fact that there is o theoretical groud to justify such a approach, there is a evidet problem with all approaches that cosider separately FARs ad FRRs. Two models could yield very similar HTERs but for some reaso i geeral liked to the choice of the threshold, which is doe i geeral o a separate data set oe could be slightly biased toward FRRs ad the other oe slightly biased toward FARs. I such a case, these tests would cosider them statistically sigificatly differet while they would ot be whe cosiderig globally their respective HTER istead. For this reaso, we will ot cosider this solutio further here. 6. ANALYSIS We would like to compare i this sectio the use of the proposed statistical test for HTERs, with respect to the two other tests preseted i sectio 5. We will first show that uder some reasoable coditios, icreasig the ratio betwee ad NC will icrease the differece betwee the variace of the Normal of the proposed test ad the variace of the Normal of the other tests. Afterward, we preset two real case studies where the use of the proposed statistical test would have yielded a differet coclusio with regard to the cofidece itervals ad the differece betwee the compared models. 5 The well-kow ST evaluatio campaigs have also apparetly recetly ivestigated the use of the McNemar test to assess speaker verificatio methods, but have cosidered separately FARs ad FRRs [6]. 6.1. Theoretical Aalysis Let us first look i which coditios σ 2 13, the variace of HTER as writte i equatio 13 is higher tha σ 2 18, the variace of NAIVE as writte i equatio 18: implies that FAR 1 FAR σ 2 13 > σ 2 18 19 FRR 1 FRR which ca be simplified ad yields > HTER1 HTER NC 0 > FAR NC FRR 1 FRR NC1 FAR which meas that iequatio 19 will be true whe either NC is much less or much higher tha which is i geeral the case, ad FAR is similar to FRR agai, whe the threshold is chose such that we have equal error rate EER o a separate validatio set, as it is ofte doe, this is reasoable. Let us ow look i which coditios σ 2 13 is higher tha σ 2 17, the variace of CLASS, represetig the classificatio error: σ 2 13 > σ 2 17 20 implies that FAR1 FAR FRR1 FRR > FAFR NC 2 FAFR2 NC 3 which ca be re-writte as 1 FRR3NC > 1 FARNC3 NC ad assumig FAR is similar to FRR, it ca be simplified ito 2 > NC 2 21 which is true as log as is higher tha NC, which is i geeral the case, agai. I order to verify these relatios graphically, we have fixed some variables to reasoable values FAR = 0.1, FRR = 0.2, NC = 100 ad have varied, the umber of impostor accesses. Figure 3 shows the relatio betwee the stadard deviatio of the uderlyig Normal distributios ad the ratio betwee ad NC. As expected, the higher the ratio NC, the bigger the differece betwee the stadard deviatio of the Normal distributios related to the three statistical tests. Moreover, we see that the stadard deviatio of the proposed HTER distributio stays close to the oe of the FRR distributio, which is mostly iflueced by NC, the umber of cliet accesses, ad does ot decrease with the icrease of, cotrary to the two other solutios. Sice the size of the cofidece iterval is directly related to the stadard deviatio, this Figure essetially shows that

0.1 FAR = 0.1, FRR = 0.2, NC = 100 Havig up to 112400 examples, oe could ideed expect the differece betwee the two models to be statistically sigificat. Stadard Deviatio 0.01 HTER NAIVE CLASS FAR FRR 0.001 1 10 100 Ratio /NC Fig. 3. Stadard deviatio of the Normal distributios uderlyig the three differet choices of distributios for a statistical test o HTERs. Also show: stadard deviatios of both the FAR ad FRR distributios. All curves are i loglog scale. the cofidece iterval computed usig the proposed techique will always be larger tha that of the two other techiques. Hece two verificatio methods yieldig two differet HTERs could easily be cosidered statistically sigificatly differet usig oe of the methods described i sectio 5, while they would ot be cosidered statistically sigificatly differet usig the proposed techique. I fact, the Figure shows that the cofidece iterval is directly iflueced by the miimum of NC ad ad ot their sum. I the ext two subsectios, we preset two real case studies where the use of the proposed statistical test would have yielded a differet coclusio. 6.2. Empirical Aalysis o XM2VTS I the first case, the well-kow text-idepedet audiovisual verificatio database XM2VTS [7] was used. I this database, the test set cosists of up to 112000 impostor accesses ad oly 400 cliet accesses, for a total of 112400 accesses. I a recet competitio [8], several models were compared 6 o a face verificatio task ad we will look here at the results of the best model, hereafter called model A, ad the third best model, hereafter called model B, apparetly sigificatly worse. Table 1 shows the differece of performace i terms of HTER betwee models A ad B. 6 While this is ot the topic of this paper sice it should apply to ay data/model, people iterested i kowig more about the problem tackled i this case study are referred to [8]; we used results of the models of IDIAP ad UiS-NC o the automatic registratio task, usig Lausae Protocol I. Furthermore, ote that the results of UiS-NC are slightly differet from those published i [8], but correspod to the list of scores provided by oe of the authors of the method. Method FAR % FRR % HTER % Model A 1.15 2.50 1.82 Model B 1.95 2.75 2.35 Table 1. HTER Performace compariso o the test set betwee models A ad B whe the threshold was selected accordig to the Equal Error Rate criterio EER o a separate validatio set. δ HTER NAIVE CLASS eq 13 eq 18 eq 17 90% 1.285% 0.131% 0.105% 95% 1.531% 0.156% 0.125% 99% 2.013% 0.206% 0.164% Table 2. Cofidece itervals aroud results of model A, computed usig three differet hypotheses ad their respective equatio. Table 2 shows the size of the cofidece itervals computed aroud the result usig HTER or the classificatio error obtaied by model A for the three methods for three differet values of δ 90%, 95% ad 99%. As we ca see, for all values of δ, the size of the iterval is about oe order of magitude larger for the proposed method tha for the two other methods. HTER HTER NAIVE CLASS DEP, eq 16 INDEP, eq 15 eq 18 eq 17 δ 69.2% 64.7% 100.0% 100.0% σ 0.0052 0.0057 0.0006 0.0005 Table 3. Cofidece value δ o the fact that model A is statistically sigificatly differet from model B, accordig to their respective performace HTER or classificatio error, ad computed usig four differet hypotheses ad their respective equatio. For each method, we also give σ, the stadard deviatio of the correspodig statistical test. Table 3 verifies whether the HTER obtaied by model A gives statistically sigificatly differet results tha the oe obtaied by model B, usig the two-sided test of equatio 6 for the idepedet cases ad 10 for the depedet case. Accordig to both proposed HTER method idepedet ad depedet cases, both models are equivalet the cofidece o their differece is much less tha, say, 90%, while accordig to both other methods, the models would

be differet with 100% cofidece!. Remember that there was oly 400 cliet accesses durig the test, hece it is reasoable that oly oe error o these accesses makes a visible differece i HTER while it caot seriously be cosidered statistically sigificat. This is well captured by our techique, but ot by the other oes. Moreover, i this case, the depedece/idepedece assumptio did ot have ay impact o the fial decisio. 6.3. Empirical Aalysis o ST 2000 I the secod case, the well-kow text-idepedet speaker verificatio bechmark database ST 2000 was used. Here, the test set cosists of 57748 impostor accesses ad 5825 cliet accesses, for a total of 63573 accesses. We compared the performace of two models 7 hereafter called models C ad D. Note that, while o XM2VTS the ratio betwee the umber of impostor ad cliet accesses was very high 280 times more, for the ST database, the ratio is more reasoable, but still high aroud 10. Method FAR % FRR % HTER % Model C 13.1 9.6 11.4 Model D 15.8 7.8 11.8 Table 4. HTER Performace compariso o the test set betwee models C ad D whe the threshold was selected accordig to the Equal Error Rate criterio EER o a separate validatio set. δ HTER NAIVE CLASS eq 13 eq 18 eq 17 90% 0.676% 0.414% 0.436% 95% 0.805% 0.493% 0.519% 99% 1.058% 0.648% 0.682% Table 5. Cofidece itervals aroud results of model C, computed usig three differet hypotheses ad their respective equatio. We ow preset the same kids of results as for the XM2VTS case. Table 4 shows the differece of performace i terms of HTER betwee models C ad D; Table 5 shows the size of the cofidece itervals computed aroud the result obtaied by model C; as we ca see, give a ratio of impostor ad cliet accesses aroud 10 istead of 280, the differece betwee all the cofidece itervals is less drastic but still exists; Table 6 verifies whether the HTER 7 Oce agai, while this is ot the topic of this paper, people iterested i kowig more about the problem tackled i this case study are referred to [9]. HTER HTER NAIVE CLASS DEP, eq 16 INDEP, eq 15 eq 18 eq 17 δ 98.8% 89.1% 98.9% 100.0% σ 2 0.0016 0.0028 0.0018 0.0019 Table 6. Cofidece value δ o the fact that model C is statistically sigificatly differet from model D, accordig to their respective performace HTER or classificatio error, ad computed usig four differet hypotheses ad their respective equatio. For each method, we also give σ, the stadard deviatio of the correspodig statistical test. obtaied by model C gives statistically sigificatly differet results tha the oe obtaied by model D. For each test, we show both the cofidece value δ ad the stadard deviatio σ of the correspodig statistical test. As it ca be see, i the DEP case, σ is very small, eve smaller tha the NAIVE ad CLASS solutios, hece obtaiig a very high cofidece that the two models are differet. I order to explai this uexpected result, ote tha oe of the tests take ito accout the possible depedece existig betwee the compared models. Ideed, if the two models are based o the same techique which is ofte the case; for istace, i speaker verificatio, most systems are ofte based o Gaussia Mixture Models, but traied with slightly differet assumptios, the both systems will have a atural tedecy to aswer very correlated scores o the same example. I the case of the two models traied o the XM2VTS database, they were very differet oe was based o a Gaussia Mixture Model, while the other oe was based o Liear Discrimiat Aalysis ad Normalized Correlatio; while for the models traied o the ST database, both were i fact variatios of Gaussia Mixture Models, hece are probably very correlated. Ufortuately, there exist o test that take this depedecy ito accout. Hece, for istace, the variace p ABp BA of equatio 9 will be quickly very small simply because the models are correlated ad ot just because the examples are the same. Usig this equatio will thus result i a uderestimate of the true variace whe models are very correlated, as empirically show i Table 6. O the other had, the INDEP case does ot take ito accout the depedecy betwee the data, but somehow it is reasoable to expect that the effect of this error may be balaced by the fact that it does ot take ito accout the depedecy betwee the models either. The correct solutio probably lies somewhere betwee these two solutios, hece, oe should probably favor the most difficult test so as to oly assess statistical differeces whe both tests agree o this fact hece, here, with oly 89.1% cofidece.

7. CONCLUSION I this paper, we have proposed a proper method to compute statistical tests o aggregate measures such as HTER or DCF ofte used i perso autheticatio. We have also show why usig other approximatios such as tests o the classificatio error istead would result i over-optimistic decisios. We have give some empirical evidece usig two bechmark databases. It is importat to ote that the test of two proportios is ot the ultimate statistical test ad there exist other tests that are kow to be sometimes more appropriate for classificatio tasks such as complex crossvalidatio techiques for istace [10]. However, oe of these tests have so far addressed the problem of depedece betwee the tested models. Nevertheless, a importat fidig of this paper is that whe people desig ew databases for perso autheticatio, they should keep i mid that it is probably ot worth havig a huge ubalace betwee cliet ad impostor access umbers, sice the statistical sigificatess of the results will maily deped o the smallest of these two umbers providig equal costs for false acceptaces ad false rejectios. [7] J. Lütti, Evaluatio protocol for the the XM2FDB database lausae protocol, Tech. Rep. COM-05, IDIAP, 1998. [8] K. Messer, J. Kittler, M. Sadeghi, S. Marcel, C. Marcel, S. Begio, F. Cardiaux, C. Saderso, J. Czyz, L. Vadedorpe, S. Srisuk, M. Petrou, W. Kurutach, A. Kadyrov, R. Paredes, B. Kepeekci, F. B. Tek, G. B. Akar, F. Deravi, ad N. Mavity, Face verificatio competitio o the XM2VTS database, i 4th Iteratioal Coferece o Audio- ad Video-Based Biometric Perso Autheticatio, AVBPA. 2003, Spriger- Verlag. [9] J. Mariéthoz ad S. Begio, A alterative to silece removal for text-idepedet speaker verificatio, Techical Report IDIAP-RR 03-51, IDIAP, Martigy, Switzerlad, 2003. [10] T.G. Dietterich, Approximate statistical tests for comparig supervised classificatio learig algorithms, Neural Computatio, vol. 10, o. 7, pp. 1895 1924, 1998. 8. ACKNOWLEDGMENTS This research has bee carried out i the framework of the Swiss NCCR project IM2. The authors would like to thak Iva Magri-Chagolleau for suggestig the problem, ad Mohammed Sadeghi for providig the scores of Model A. 9. REFERENCES [1] P. Verlide, G. Chollet, ad M. Acheroy, Multimodal idetity verificatio usig expert fusio, Iformatio Fusio, vol. 1, pp. 17 33, 2000. [2] A. Marti ad M. Przybocki, The ST 1999 speaker recogitio evaluatio - a overview, Digital Sigal Processig, vol. 10, pp. 1 18, 2000. [3] G. W. Sedecor ad W. G. Cochra, Statistical Methods, Iowa State Uiversity Press, 1989. [4] J.L. Wayma, Cofidece iterval ad test size estimatio for biometric data, i Proceedigs of the IEEE AutoID Coferece, 1999. [5] J. Koolwaaij, Automatic Speaker Verificatio i Telephoy: a probabilitic approach, PritParters Ipskamp B.V., Eschede, 2000. [6] A Marti, Persoal commuicatio, 2004.