Speaker Recognition Using Artificial Neural Networks: RBFNNs vs. EBFNNs

Size: px

Start display at page:

Download "Speaker Recognition Using Artificial Neural Networks: RBFNNs vs. EBFNNs"

Clarissa Page
6 years ago
Views:

1 Speaer Recognition Using Artificial Neural Networs: s vs. s BALASKA Nawel ember of the Sstems & Control Research Group within the LRES Lab., Universit 20 Août 55 of Sida, BP: 26, Sida, 21000, Algeria nabalasa@ahoo.fr AHIDA Zahir Director of the Sstems & Control Research Group within the LRES Lab., Universit 20 Août 55 of Sida, BP: 26, Sida, 21000, Algeria zahirahmida@ahoo.fr GOUTAS Ahcène Director of the Signal & Image Processing Research Group within the LRES Lab., Universit 20 Août 55 of Sida, BP: 26, Sida, 21000, Algeria a.goutas@ahoo.fr Abstract- This paper deals with the application of Radial Basis Function Neural Networs (s) and Elliptical Basis Function Neural Networs (s) for textindependent speaer recognition experiments. These include both closed-set and open-set speaer identification and speaer verification. The database used is a subset of the TIIT database consisting of 60 speaers from different dialect regions. LP-derived Cepstral Coefficients (LPCC) are used as the speaer specific features. Simulation results show that outperform for several speaer recognition experiments. Kewords- Speaer recognition, speaer identification, speaer verification, Radial Basis Function Neural Networs (s), Elliptical Basis Function Neural Networs (s). I. INTRODUCTION In general speaer recognition sstems fall into two main categories, namel: speaer identification sstems and speaer verification sstems. In speaer identification, the goal is to identif an unnown voice from a set of nown voices. Whereas, the objective of speaer verification is to verif whether an unnown voice matches the voice of a speaer whose identit is being claimed [1], [2] and [3]. Speaer identification sstems are mainl used in criminal investigation while speaer verification sstems are used in securit access control. A generic speaer recognition sstem is shown in Fig. 1 Speech Signal Feature Extraction Classification Fig. 1. Speaer recognition sstem Speaer Ident. or Speaer Verif. In Fig. 1, the desired features are first extracted from the speech signal [4], [5]. The extracted features are then used as inputs to a classifier, which maes the final decision regarding identification or verification. Speaer identification sstems can be closed-set or openset. Closed-set speaer identification refers to the case where the speaer is nown a priori to be a member of a set of speaers. Open-set speaer identification includes the additional possibilit where the speaer ma not be member of the set of speaers. Thresholding is often used to determine if a speaer belongs or not to the Open-set in speaer identification and/or speaer verification. Another distinguishing feature of speaer recognition sstems is whether the are text-dependent or textindependent. Text-dependent speaer recognition sstems require that the speaer utter specific phrase or a given password. Text-independent speaer recognition sstems identif the speaer regardless of his utterance [1]. This paper focuses on the text-independent speaer identification and speaer verification tass. The organization of this paper is as follows. Section II reviews the and for text-independent speaer recognition. Section III describes the database used in this paper and speech analsis. Section IV reviews the training procedure, Section V describes the recognition procedure, Section VI discusses the conducted experiments and finall, Section VII gives the conclusions. II. RBF AND EBF NEURAL NETWORKS Radial Basis Function Neural Networs and Elliptical Basis Function Neural Networs can be viewed as a feed-forward neural networ with a single hidden laer. An RBF or EBF networ with D inputs, hidden units and K outputs is shown in Fig. 2. The output laer forms a linear combiner which calculates the weighted sum of the outputs of the hidden units [6].

2 1 2 Φ 1 smoothing parameters controlling the spread of the basis function [6]. III. DATABASE AND SPEECH ANALYSIS 3 D Φ 2 Φ f 1 f 2 f K The database for the experiments reported in this paper is a subset of the DARPA TIIT database [7]. This set represents 60 speaers of the different dialect regions. The set includes 44 males and 16 females. These speaers were divided into three equal subsets: speaer set, anti-speaer set and impostor set. The pre-processing of the TIIT speech data consists of several steps. First, the speech data is processed b the 1 application of a pre-emphasis filter H ( z) = z. A 30 ms Hamming window is applied to the speech ever 10 ms. A 12 th order linear predictive (LP) analsis is performed for each speech frame. The features consist of the 12 cepstral coefficients (LPCC) derived from the LP coefficients [4]. There are ten utterances for each speaer in the selected set. Five of utterances (SX) are concatenated and used for training. The remaining five sentences (SA, SI) are used individuall for testing. The mean duration of training data is s per speaer, and the mean duration of each test utterance is 2.79 s. Fig. 2. RBF or EBF Neural Networ structure The th output of an RBF or EBF neural networ has the form: f ( Y ) = w 0 + w j Φ j ( Y ) j= 1 j = 1,..., and = 1,...,K (1) Where w j are the networ weights. For an RBF networ the activation function is: 1 2 Φ j ( Y ) = exp Y c 2 j 2. σ j j = 1,..., (2) Where. denotes the Euclidean distance. For an EBF networ, on the other hand: 1 ' 1 Φ j( Y ) = exp ( Y c j ) Σ j ( Y c j ) 2. γ j P 3 γ j = c c j j = 1,..., (3) 5 = 1 In (2) and (3), Φ j are the activation functions, {, t = 1 T} Y = t,..., is the input vector of length T and dimension D, widths, Input laer Hidden laer c j are the function centres, σ j are the function jare the covariance matrices and Output laer γ j are a IV. TRAINING PROCEDURE Each speaer in the speaer set was assigned a personalized or modeling the characteristic of his or her own voice. Each networ was trained to recognize the data derived from two classes, speaer class and anti-speaer class [6]. Therefore, each networ was composed of 12 inputs varied numbers of hidden nodes (), and two outputs, with each output representing one class. Onl the first of these was used in closed-set speaer identification. For each RBF neural networ, the K-means algorithm [8] was applied to the corresponding speaer and all anti-speaers separatel to obtain the function centres [6]. Next, the P- nearest neighbor algorithm with P set to 2 was applied to the resulting function centres to determine the function widths [6]. For the EBF neural networs, the function centres and covariance matrices were determined b the E algorithm with diagonal covariance matrices and full covariance matrices or K-means algorithm and sample covariance [6]. Finall the weights of the output laer of s and s can be obtained b using the Least ean Squares (LS) algorithm [8], [9]. Target values during training were [+1,0] for speaer frame and [0,+1] for anti-speaer frame. V. RECOGNITION PROCEDURE A. Closed-set speaer identification The identification test was done b comparing the outputs of all 20 speaers RBF and EBF networs for a particular utterance for the 100 speaer identification tests in total. The networ with highest first output was considered to belong to the true speaer [10], [11]. The structure of speaer identification sstem with S speaers is shown in Fig. 3.

3 Speaer 1 RBF/EBF f 1 1 (Y) Fig. 3. Structure of speaer identification sstem with S speaers B. Speaer verification As the ratio of training vectors between the speaer class and anti-speaer class is about 1 to 20 (for each networ, training vectors were derived from the corresponding speaer and 20 anti-speaers), the networ will favor the anti-speaer during verification b alwas giving outputs which are close to one for the anti-speaer class and close to zero for the speaer class. [6] proposed the solution of this problem b scaling the outputs during verification so that the new average outputs are approximatel equal to 0.5 for both classes. In other words, 1 f ( Y) f ( Y) =, = 12,. A wa to estimate the prior 2 P( C ) ( C f 1 2 (Y) Input Y Speaer 2 RBF/EBF Speaer S RBF/EBF f 1 S (Y) Σ Σ Σ Select maximum to identif speaer probabilit P ) is to divide the number of patterns in class C b the total number of patterns in the training set. During verification, a vector Y = { t, t =1,..., T} corresponding to an utterance spoen b an unnown speaer was fed into the networ. Then the scaled average outputs corresponding to the speaer and anti-speaer were computed [6]. exp ( ) 1 T f Y z =, = 1, 2 T t = 1 exp f 1( Y ) + exp f 2( Y ) (4) Where T is the number of patterns in the test sequence Y. Verification decisions were based on the criterion: > ζ : Accept the claimant z = z1 z2 (5) ζ : Reject the claimant Where ζ [ 1, + 1] is a threshold controlling the false rejection rate and false acceptation rate. False rejection is the rate of falsel rejecting a true speaer, while false acceptance rate measures the rate of incorrectl accepting impostors. VI. EXPERIENTAL RESULTS Several speaer recognition experiments were performed to evaluate the s and s classifiers. These experiments include closed-set speaer identification, speaer verification, and open-set speaer identification, speaer verification. The Identification Rate (IR) is used for evaluating the performance of a closed set speaer identification sstems. The Equal Error Rate (EER) is used for evaluating the performance of the open-set speaer identification and speaer verification sstems. The EER is defined as the point at which two errors False Rejection Rate (FRR) and False Acceptance Rate (FAR) are equal. Note that all operating curves (FRR versus FAR - ROC: Receiver Operating Characteristics) presented in this section for speaer verification and open-set speaer identification represent the posterior performance of the classifiers, given the speaer and impostor scores. On the other hand the EER is obtained b adjusting the threshold during verification to equalize the FAR and FRR. Though this adjustment is impractical in the real sstems, the EER indicate the potential of the networs. The threshold used is a posterior global threshold varing between [-1:0.01:1]. Each networ is composed of centres contributed b the corresponding speaer and anti-speaer ( = speaer centres + anti-speaer centres). EBF-Diag and EBF-Full represent the experiments in which the parameters were obtained b the E algorithm with diagonal covariance matrices and full covariance matrices. EBF-Sample denotes the case where K- means algorithm and sample covariance were used to estimate the function centres and covariance matrices of the EBF networs. Table 1 shows the number of speaer and impostor tests performed in each tas. TABLE I NUBER OF SPEAKER AND IPOSTOR TESTS PER TASK Tas # speaer tests # impostor tests Closed-set speaer identification Closed-set speaer verification Open-set speaer identification Open-set speaer verification

4 A. Closed-set speaer identification TABLE 2 IR FOR CLOSED-SET SPEAKER IDENTIFICATION 8 (4+4) 83 % 98 % 91 % 97 % 16 (8+8) 96 % 99 % 98 % 100 % 32 (16+16) 100 % 100 % 100 % 100 % Fig. 5. FRR versus FAR for closed-set speaer verification = 32 (16+16) C. Open-set speaer identification TABLE 4 EER FOR OPEN-SET SPEAKER IDENTIFICATION 8 (4+4) % % 24.5 % % Fig. 4. IR versus number of centres for closed-set speaer identification Table 2 and figure 4 show the experimental results of closed-set speaer identification for different networ tpes (RBF or EBF), learning algorithm (EBF-Sample, EBF-Diag, EBF-Full) and networs sizes ( = 8, 16, 32). These results reveal that EBF networ trained b the E algorithm with full covariance matrices (EBF-Full) outperformed the other networs (IR = 100 % with 16 centres). We also note that the identification rate (IR) increases when the number of centres increases for all networs. 16 (8+8) % % % 7.00 % 32 (16+16) % 7.00 % % 5.75 % B. Closed-set speaer verification TABLE 3 EER FOR CLOSED-SET SPEAKER VERIFICATION 8 (4+4) 7.70 % 1.35 % 3.11 % 1.39 % 16 (8+8) 2.50 % 0.68 % 1.91 % 0.81 % 32 (16+16) 1.95 % 0.33 % 1.12 % 0.21 % Fig. 6. FRR versus FAR for open-set speaer identification = 32 (16+16)

5 D. Open-set speaer verification TABLE 5 EER FOR OPEN-SET SPEAKER VERIFICATION 8 (4+4) 7.00 % 1.79 % 3.70 % 1.82 % 16 (8+8) 2.52 % 1.16 % 2.91 % 1.00 % 32 (16+16) 2.06 % 0.58 % 1.43 % 0.52 % Tables (3, 4, 5) and figures (5, 6, 7) summarize the equal error rate (EER) for different networ tpes (RBF or EBF), learning algorithm (EBF-Sample, EBF-Diag, EBF-Full) and networs sizes ( = 8, 16, 32). We can see from this results that for all size of networ, the EBF networs trained b the E algorithm with full covariance matrices (EBF-Full) attain a lower EER (5.75 % in open-set speaer identification, 0.21 % in closed-set speaer verification, and 0.52 % in open-set speaer verification, with 32 centres) as compared to the EBF networs trained b the E algorithm with diagonal covariance matrices, the EBF networs with sample covariance (EBF-Sample) and RBF networs. A comparison of the error rates corresponding to EBF-Diag networs and EBF-Full networs reveals that the diagonal covariance matrices less capable of modeling speaer characteristics than the full covariance matrices. These results demonstrate the capabilit of the E algorithm and the advantage of using full covariance matrices in the basis functions. We can see that for all size of networ, the EBF networs trained with the E algorithm (EBF-Full) attain a lower equal error rate as compared to the EBF networs trained in sample covariance (EBF-Sample). We also note that the equal error rate (EER) decreases when the number of centres increases for all networs. We also compare the time spent for the different networs in training (learning of 20 networs) and recognition phase b using 32 centres per networ (see Table 6). The time of preprocessing is not taen into account in the Table 6. Knowing that the pre-processing of a signal of 3,48 s taes 1 s. TABLE 6 TRAINING AND RECOGNITION TIE WITH = 32 (16+16) PER NETWORK EBF- Sample EBF- Diag EBF- Full Training time (min) Identification time (s) Verification time (s) Fig. 7. FRR versus FAR for open-set speaer verification = 32 (16+16) The training time of the networs varies between 8.06 min and min, the smallest training duration is that of RBF networs, and the greatest training period is that of EBF networs trained b E algorithm with full covariance matrices (EBF-Full). The identification time varies in the interval 0.33 s for RBF networs, and 1.15 s for EBF networs with sample covariance matrices (EBF-Sample) and EBF networs trained b E algorithm with full covariance matrices (EBF-Full). With regard to the verification time, RBF networs and EBF networs trained b E algorithm with diagonal covariance matrices (EBF-Diag) tae 0.06 s, While EBF-Full networs and EBF-Sample networs tae 0.11 s. Knowing that we use the commands tic and toc of atlab to calculate the execution times, with a microprocessor which turns to 2 GHz. VII. CONCLUSION This paper has evaluated the use of s and s for text-independent speaer recognition. The performance b the with full covariance matrices is better than the s, s with diagonal matrices and with sample covariance matrices. The results confirmed the claim b [6] that the use of E algorithm to estimate the parameters of elliptical basis function networs achieves the maximum performances, and illustrates that the full covariance matrices of the EBF networs are capable of providing a better representation of the feature vectors. ACKNOWLEDGENT This wor is conducted in the Electronics Research Laborator of Sida (LRES) and is supported in part b the Algerian inistr of Higher Education and Scientific Research under the CNEPRU project code : J REFERENCES [1] D. O Shaughness, Speaer Recognition, ASSP agazine, IEEE Signal Processing agazine, Vol. 3, No. 4, Part. 1, pp. 4-17, October [2] J. P. Campbell, JR, Speaer Recognition : A Tutorial, Proceedings of the IEEE, Vol. 85, No. 9, September 1997.

6 [3] N. BALASKA, Reconnaissance du locuteur par les méthodes statistiques et connexionnistes: étude comparative, émoire de agistère, Université 20 août 55 of Sida, fev [4] S. Young, G. Evermann, T. Kershaw, G. oore, J. Odell, D. Ollason, D. Pove, V. Valtchev, P. Woodland, The HTK Boo (for HTK Version 3.2.1), Copright Cambridge Universit Engineering Department. [5] D.A. Renolds, Experimental Evaluation of Features for Robust Speaer Identification, IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 4, pp , October [6]. W. a, S. Y. Kung, Estimation of Elliptical Basis Function Parameters b the E Algorithm with Application to Speaer Verification, IEEE Transaction on Neural Networs, Vol. 11, N0. 4, Jul [7] The DARPA TIIT Acoustic- Phonetic Continuous Speech Corpus. [8] D. R. Hush, B. G. Horne, Progress in Supervised Neural Networs, IEEE Signal Processing agazine, Vol. 10, No 1, pp. 8-39, Januar [9] B.. Wilamowsi, Neural Networ Architectures and Learning, International Conference on Industrial Technolog, Vol. 1, pp. TU1- TU12, IEEE, December [10] S. E. Fredricson, L. Tarasseno, Text-Independent Speaer Recognition Using Neural Networ Techniques, Fourth International Conference on Artificial Neural Networs, Conference Publication No. 409, pp , June [11]. W. a, W. G. Allen, G. G. Sexton, Speaer Identification Using Radial Basis Functions, Third International Conference on Artificial Neural Networs, pp , a 1993.

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

International Journal of Engineering Science Invention Volume 1 Issue 1 December. 2012 PP.18-23 Estimation of Relative Operating Characteristics of Text Independent Speaker Verification Palivela Hema 1,