Speaker Recognition Using Artificial Neural Networks: RBFNNs vs. EBFNNs

Similar documents
Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Robust Speaker Identification

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

IDIAP. Martigny - Valais - Suisse ADJUSTMENT FOR THE COMPENSATION OF MODEL MISMATCH IN SPEAKER VERIFICATION. Frederic BIMBOT + Dominique GENOUD *

A Generative Model Based Kernel for SVM Classification in Multimedia Applications

Support Vector Machines using GMM Supervectors for Speaker Verification

How to Deal with Multiple-Targets in Speaker Identification Systems?

Front-End Factor Analysis For Speaker Verification

Gaussian Mixture Distance for Information Retrieval

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

FACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Dynamic Time-Alignment Kernel in Support Vector Machine

Comparison of Log-Linear Models and Weighted Dissimilarity Measures

Hand Written Digit Recognition using Kalman Filter

Mixtures of Gaussians with Sparse Structure

Robust Speaker Identification System Based on Wavelet Transform and Gaussian Mixture Model

Speaker Verification Using Accumulative Vectors with Support Vector Machines

Application of cepstrum and neural network to bearing fault detection

Presented By: Omer Shmueli and Sivan Niv

Estimation of the wrist torque of robot gripper using data fusion and ANN techniques

TinySR. Peter Schmidt-Nielsen. August 27, 2014

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Comparing linear and non-linear transformation of speech

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

GMM-Based Speech Transformation Systems under Data Reduction

Intelligent Handwritten Digit Recognition using Artificial Neural Network

Reconnaissance d objetsd et vision artificielle

Graphic Representation Method and Neural Network Recognition of Time-Frequency Vectors of Speech Information

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems

Fuzzy Support Vector Machines for Automatic Infant Cry Recognition

Biometrics: Introduction and Examples. Raymond Veldhuis

Sound Recognition in Mixtures

Exemplar-based voice conversion using non-negative spectrogram deconvolution

COMPARISON STUDY OF SENSITIVITY DEFINITIONS OF NEURAL NETWORKS

BIOMETRIC verification systems are used to verify the

6 Quantization of Discrete Time Signals

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Monaural speech separation using source-adapted models

ISCA Archive

Joint Factor Analysis for Speaker Verification

Classification and Clustering of Printed Mathematical Symbols with Improved Backpropagation and Self-Organizing Map

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure

Wavelet Transform in Speech Segmentation

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Allpass Modeling of LP Residual for Speaker Recognition

Lecture 5: GMM Acoustic Modeling and Feature Extraction

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

SCORE CALIBRATING FOR SPEAKER RECOGNITION BASED ON SUPPORT VECTOR MACHINES AND GAUSSIAN MIXTURE MODELS

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Tone Analysis in Harmonic-Frequency Domain and Feature Reduction using KLT+LVQ for Thai Isolated Word Recognition

Symmetric Distortion Measure for Speaker Recognition

Brief Introduction of Machine Learning Techniques for Content Analysis

Text Independent Speaker Identification Using Imfcc Integrated With Ica

Kernel Based Text-Independnent Speaker Verification

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

IBM Research Report. Training Universal Background Models for Speaker Recognition

Application of Fully Recurrent (FRNN) and Radial Basis Function (RBFNN) Neural Networks for Simulating Solar Radiation

Deep Neural Networks (1) Hidden layers; Back-propagation

QUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS

Face Recognition Using Eigenfaces

Hidden Markov Model and Speech Recognition

Deep Neural Networks (1) Hidden layers; Back-propagation

Role of Assembling Invariant Moments and SVM in Fingerprint Recognition

Modeling the creaky excitation for parametric speech synthesis.

Fantope Regularization in Metric Learning

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

An artificial neural networks (ANNs) model is a functional abstraction of the

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition

Multi-Layer Boosting for Pattern Recognition

Classification and Pattern Recognition

Speaker recognition by means of Deep Belief Networks

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Application of hopfield network in improvement of fingerprint recognition process Mahmoud Alborzi 1, Abbas Toloie- Eshlaghy 1 and Dena Bazazian 2

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

A Small Footprint i-vector Extractor

Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition

arxiv: v1 [cs.sd] 25 Oct 2014

Generalized Cyclic Transformations in Speaker-Independent Speech Recognition

EUSIPCO

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Automatic Speech Recognition (CS753)

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

Improving the Multi-Stack Decoding Algorithm in a Segment-based Speech Recognizer

Iterative Laplacian Score for Feature Selection

Score calibration for optimal biometric identification

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Modifying Voice Activity Detection in Low SNR by correction factors

Neural Networks and the Back-propagation Algorithm

ECE 661: Homework 10 Fall 2014

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Multimodal Biometric Fusion Joint Typist (Keystroke) and Speaker Verification

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Transcription:

Speaer Recognition Using Artificial Neural Networs: s vs. s BALASKA Nawel ember of the Sstems & Control Research Group within the LRES Lab., Universit 20 Août 55 of Sida, BP: 26, Sida, 21000, Algeria E-mail : nabalasa@ahoo.fr AHIDA Zahir Director of the Sstems & Control Research Group within the LRES Lab., Universit 20 Août 55 of Sida, BP: 26, Sida, 21000, Algeria zahirahmida@ahoo.fr GOUTAS Ahcène Director of the Signal & Image Processing Research Group within the LRES Lab., Universit 20 Août 55 of Sida, BP: 26, Sida, 21000, Algeria E-mail : a.goutas@ahoo.fr Abstract- This paper deals with the application of Radial Basis Function Neural Networs (s) and Elliptical Basis Function Neural Networs (s) for textindependent speaer recognition experiments. These include both closed-set and open-set speaer identification and speaer verification. The database used is a subset of the TIIT database consisting of 60 speaers from different dialect regions. LP-derived Cepstral Coefficients (LPCC) are used as the speaer specific features. Simulation results show that outperform for several speaer recognition experiments. Kewords- Speaer recognition, speaer identification, speaer verification, Radial Basis Function Neural Networs (s), Elliptical Basis Function Neural Networs (s). I. INTRODUCTION In general speaer recognition sstems fall into two main categories, namel: speaer identification sstems and speaer verification sstems. In speaer identification, the goal is to identif an unnown voice from a set of nown voices. Whereas, the objective of speaer verification is to verif whether an unnown voice matches the voice of a speaer whose identit is being claimed [1], [2] and [3]. Speaer identification sstems are mainl used in criminal investigation while speaer verification sstems are used in securit access control. A generic speaer recognition sstem is shown in Fig. 1 Speech Signal Feature Extraction Classification Fig. 1. Speaer recognition sstem Speaer Ident. or Speaer Verif. In Fig. 1, the desired features are first extracted from the speech signal [4], [5]. The extracted features are then used as inputs to a classifier, which maes the final decision regarding identification or verification. Speaer identification sstems can be closed-set or openset. Closed-set speaer identification refers to the case where the speaer is nown a priori to be a member of a set of speaers. Open-set speaer identification includes the additional possibilit where the speaer ma not be member of the set of speaers. Thresholding is often used to determine if a speaer belongs or not to the Open-set in speaer identification and/or speaer verification. Another distinguishing feature of speaer recognition sstems is whether the are text-dependent or textindependent. Text-dependent speaer recognition sstems require that the speaer utter specific phrase or a given password. Text-independent speaer recognition sstems identif the speaer regardless of his utterance [1]. This paper focuses on the text-independent speaer identification and speaer verification tass. The organization of this paper is as follows. Section II reviews the and for text-independent speaer recognition. Section III describes the database used in this paper and speech analsis. Section IV reviews the training procedure, Section V describes the recognition procedure, Section VI discusses the conducted experiments and finall, Section VII gives the conclusions. II. RBF AND EBF NEURAL NETWORKS Radial Basis Function Neural Networs and Elliptical Basis Function Neural Networs can be viewed as a feed-forward neural networ with a single hidden laer. An RBF or EBF networ with D inputs, hidden units and K outputs is shown in Fig. 2. The output laer forms a linear combiner which calculates the weighted sum of the outputs of the hidden units [6].

1 2 Φ 1 smoothing parameters controlling the spread of the basis function [6]. III. DATABASE AND SPEECH ANALYSIS 3 D Φ 2 Φ f 1 f 2 f K The database for the experiments reported in this paper is a subset of the DARPA TIIT database [7]. This set represents 60 speaers of the different dialect regions. The set includes 44 males and 16 females. These speaers were divided into three equal subsets: speaer set, anti-speaer set and impostor set. The pre-processing of the TIIT speech data consists of several steps. First, the speech data is processed b the 1 application of a pre-emphasis filter H ( z) = 1 0.95. z. A 30 ms Hamming window is applied to the speech ever 10 ms. A 12 th order linear predictive (LP) analsis is performed for each speech frame. The features consist of the 12 cepstral coefficients (LPCC) derived from the LP coefficients [4]. There are ten utterances for each speaer in the selected set. Five of utterances (SX) are concatenated and used for training. The remaining five sentences (SA, SI) are used individuall for testing. The mean duration of training data is 12.22 s per speaer, and the mean duration of each test utterance is 2.79 s. Fig. 2. RBF or EBF Neural Networ structure The th output of an RBF or EBF neural networ has the form: f ( Y ) = w 0 + w j Φ j ( Y ) j= 1 j = 1,..., and = 1,...,K (1) Where w j are the networ weights. For an RBF networ the activation function is: 1 2 Φ j ( Y ) = exp Y c 2 j 2. σ j j = 1,..., (2) Where. denotes the Euclidean distance. For an EBF networ, on the other hand: 1 ' 1 Φ j( Y ) = exp ( Y c j ) Σ j ( Y c j ) 2. γ j P 3 γ j = c c j j = 1,..., (3) 5 = 1 In (2) and (3), Φ j are the activation functions, {, t = 1 T} Y = t,..., is the input vector of length T and dimension D, widths, Input laer Hidden laer c j are the function centres, σ j are the function jare the covariance matrices and Output laer γ j are a IV. TRAINING PROCEDURE Each speaer in the speaer set was assigned a personalized or modeling the characteristic of his or her own voice. Each networ was trained to recognize the data derived from two classes, speaer class and anti-speaer class [6]. Therefore, each networ was composed of 12 inputs varied numbers of hidden nodes (), and two outputs, with each output representing one class. Onl the first of these was used in closed-set speaer identification. For each RBF neural networ, the K-means algorithm [8] was applied to the corresponding speaer and all anti-speaers separatel to obtain the function centres [6]. Next, the P- nearest neighbor algorithm with P set to 2 was applied to the resulting function centres to determine the function widths [6]. For the EBF neural networs, the function centres and covariance matrices were determined b the E algorithm with diagonal covariance matrices and full covariance matrices or K-means algorithm and sample covariance [6]. Finall the weights of the output laer of s and s can be obtained b using the Least ean Squares (LS) algorithm [8], [9]. Target values during training were [+1,0] for speaer frame and [0,+1] for anti-speaer frame. V. RECOGNITION PROCEDURE A. Closed-set speaer identification The identification test was done b comparing the outputs of all 20 speaers RBF and EBF networs for a particular utterance for the 100 speaer identification tests in total. The networ with highest first output was considered to belong to the true speaer [10], [11]. The structure of speaer identification sstem with S speaers is shown in Fig. 3.

Speaer 1 RBF/EBF f 1 1 (Y) Fig. 3. Structure of speaer identification sstem with S speaers B. Speaer verification As the ratio of training vectors between the speaer class and anti-speaer class is about 1 to 20 (for each networ, training vectors were derived from the corresponding speaer and 20 anti-speaers), the networ will favor the anti-speaer during verification b alwas giving outputs which are close to one for the anti-speaer class and close to zero for the speaer class. [6] proposed the solution of this problem b scaling the outputs during verification so that the new average outputs are approximatel equal to 0.5 for both classes. In other words, 1 f ( Y) f ( Y) =, = 12,. A wa to estimate the prior 2 P( C ) ( C f 1 2 (Y) Input Y Speaer 2 RBF/EBF Speaer S RBF/EBF f 1 S (Y) Σ Σ Σ Select maximum to identif speaer probabilit P ) is to divide the number of patterns in class C b the total number of patterns in the training set. During verification, a vector Y = { t, t =1,..., T} corresponding to an utterance spoen b an unnown speaer was fed into the networ. Then the scaled average outputs corresponding to the speaer and anti-speaer were computed [6]. exp ( ) 1 T f Y z =, = 1, 2 T t = 1 exp f 1( Y ) + exp f 2( Y ) (4) Where T is the number of patterns in the test sequence Y. Verification decisions were based on the criterion: > ζ : Accept the claimant z = z1 z2 (5) ζ : Reject the claimant Where ζ [ 1, + 1] is a threshold controlling the false rejection rate and false acceptation rate. False rejection is the rate of falsel rejecting a true speaer, while false acceptance rate measures the rate of incorrectl accepting impostors. VI. EXPERIENTAL RESULTS Several speaer recognition experiments were performed to evaluate the s and s classifiers. These experiments include closed-set speaer identification, speaer verification, and open-set speaer identification, speaer verification. The Identification Rate (IR) is used for evaluating the performance of a closed set speaer identification sstems. The Equal Error Rate (EER) is used for evaluating the performance of the open-set speaer identification and speaer verification sstems. The EER is defined as the point at which two errors False Rejection Rate (FRR) and False Acceptance Rate (FAR) are equal. Note that all operating curves (FRR versus FAR - ROC: Receiver Operating Characteristics) presented in this section for speaer verification and open-set speaer identification represent the posterior performance of the classifiers, given the speaer and impostor scores. On the other hand the EER is obtained b adjusting the threshold during verification to equalize the FAR and FRR. Though this adjustment is impractical in the real sstems, the EER indicate the potential of the networs. The threshold used is a posterior global threshold varing between [-1:0.01:1]. Each networ is composed of centres contributed b the corresponding speaer and anti-speaer ( = speaer centres + anti-speaer centres). EBF-Diag and EBF-Full represent the experiments in which the parameters were obtained b the E algorithm with diagonal covariance matrices and full covariance matrices. EBF-Sample denotes the case where K- means algorithm and sample covariance were used to estimate the function centres and covariance matrices of the EBF networs. Table 1 shows the number of speaer and impostor tests performed in each tas. TABLE I NUBER OF SPEAKER AND IPOSTOR TESTS PER TASK Tas # speaer tests # impostor tests Closed-set speaer identification 100 0 Closed-set speaer verification 100 1900 Open-set speaer identification 100 100 Open-set speaer verification 100 2000

A. Closed-set speaer identification TABLE 2 IR FOR CLOSED-SET SPEAKER IDENTIFICATION 8 (4+4) 83 % 98 % 91 % 97 % 16 (8+8) 96 % 99 % 98 % 100 % 32 (16+16) 100 % 100 % 100 % 100 % Fig. 5. FRR versus FAR for closed-set speaer verification = 32 (16+16) C. Open-set speaer identification TABLE 4 EER FOR OPEN-SET SPEAKER IDENTIFICATION 8 (4+4) 26.00 % 18.55 % 24.5 % 14.00 % Fig. 4. IR versus number of centres for closed-set speaer identification Table 2 and figure 4 show the experimental results of closed-set speaer identification for different networ tpes (RBF or EBF), learning algorithm (EBF-Sample, EBF-Diag, EBF-Full) and networs sizes ( = 8, 16, 32). These results reveal that EBF networ trained b the E algorithm with full covariance matrices (EBF-Full) outperformed the other networs (IR = 100 % with 16 centres). We also note that the identification rate (IR) increases when the number of centres increases for all networs. 16 (8+8) 16.50 % 10.00 % 18.14 % 7.00 % 32 (16+16) 11.00 % 7.00 % 10.00 % 5.75 % B. Closed-set speaer verification TABLE 3 EER FOR CLOSED-SET SPEAKER VERIFICATION 8 (4+4) 7.70 % 1.35 % 3.11 % 1.39 % 16 (8+8) 2.50 % 0.68 % 1.91 % 0.81 % 32 (16+16) 1.95 % 0.33 % 1.12 % 0.21 % Fig. 6. FRR versus FAR for open-set speaer identification = 32 (16+16)

D. Open-set speaer verification TABLE 5 EER FOR OPEN-SET SPEAKER VERIFICATION 8 (4+4) 7.00 % 1.79 % 3.70 % 1.82 % 16 (8+8) 2.52 % 1.16 % 2.91 % 1.00 % 32 (16+16) 2.06 % 0.58 % 1.43 % 0.52 % Tables (3, 4, 5) and figures (5, 6, 7) summarize the equal error rate (EER) for different networ tpes (RBF or EBF), learning algorithm (EBF-Sample, EBF-Diag, EBF-Full) and networs sizes ( = 8, 16, 32). We can see from this results that for all size of networ, the EBF networs trained b the E algorithm with full covariance matrices (EBF-Full) attain a lower EER (5.75 % in open-set speaer identification, 0.21 % in closed-set speaer verification, and 0.52 % in open-set speaer verification, with 32 centres) as compared to the EBF networs trained b the E algorithm with diagonal covariance matrices, the EBF networs with sample covariance (EBF-Sample) and RBF networs. A comparison of the error rates corresponding to EBF-Diag networs and EBF-Full networs reveals that the diagonal covariance matrices less capable of modeling speaer characteristics than the full covariance matrices. These results demonstrate the capabilit of the E algorithm and the advantage of using full covariance matrices in the basis functions. We can see that for all size of networ, the EBF networs trained with the E algorithm (EBF-Full) attain a lower equal error rate as compared to the EBF networs trained in sample covariance (EBF-Sample). We also note that the equal error rate (EER) decreases when the number of centres increases for all networs. We also compare the time spent for the different networs in training (learning of 20 networs) and recognition phase b using 32 centres per networ (see Table 6). The time of preprocessing is not taen into account in the Table 6. Knowing that the pre-processing of a signal of 3,48 s taes 1 s. TABLE 6 TRAINING AND RECOGNITION TIE WITH = 32 (16+16) PER NETWORK EBF- Sample EBF- Diag EBF- Full Training time (min) 8.06 10.00 13.16 15.57 Identification time (s) 0.33 1.15 1.10 1.15 Verification time (s) 0.06 0.11 0.06 0.11 Fig. 7. FRR versus FAR for open-set speaer verification = 32 (16+16) The training time of the networs varies between 8.06 min and 15.57 min, the smallest training duration is that of RBF networs, and the greatest training period is that of EBF networs trained b E algorithm with full covariance matrices (EBF-Full). The identification time varies in the interval 0.33 s for RBF networs, and 1.15 s for EBF networs with sample covariance matrices (EBF-Sample) and EBF networs trained b E algorithm with full covariance matrices (EBF-Full). With regard to the verification time, RBF networs and EBF networs trained b E algorithm with diagonal covariance matrices (EBF-Diag) tae 0.06 s, While EBF-Full networs and EBF-Sample networs tae 0.11 s. Knowing that we use the commands tic and toc of atlab to calculate the execution times, with a microprocessor which turns to 2 GHz. VII. CONCLUSION This paper has evaluated the use of s and s for text-independent speaer recognition. The performance b the with full covariance matrices is better than the s, s with diagonal matrices and with sample covariance matrices. The results confirmed the claim b [6] that the use of E algorithm to estimate the parameters of elliptical basis function networs achieves the maximum performances, and illustrates that the full covariance matrices of the EBF networs are capable of providing a better representation of the feature vectors. ACKNOWLEDGENT This wor is conducted in the Electronics Research Laborator of Sida (LRES) and is supported in part b the Algerian inistr of Higher Education and Scientific Research under the CNEPRU project code : J0201620060017. REFERENCES [1] D. O Shaughness, Speaer Recognition, ASSP agazine, IEEE Signal Processing agazine, Vol. 3, No. 4, Part. 1, pp. 4-17, October 1986. [2] J. P. Campbell, JR, Speaer Recognition : A Tutorial, Proceedings of the IEEE, Vol. 85, No. 9, September 1997.

[3] N. BALASKA, Reconnaissance du locuteur par les méthodes statistiques et connexionnistes: étude comparative, émoire de agistère, Université 20 août 55 of Sida, fev. 2006. [4] S. Young, G. Evermann, T. Kershaw, G. oore, J. Odell, D. Ollason, D. Pove, V. Valtchev, P. Woodland, The HTK Boo (for HTK Version 3.2.1), Copright 2001-2002 Cambridge Universit Engineering Department. [5] D.A. Renolds, Experimental Evaluation of Features for Robust Speaer Identification, IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 4, pp. 639-643, October 1994. [6]. W. a, S. Y. Kung, Estimation of Elliptical Basis Function Parameters b the E Algorithm with Application to Speaer Verification, IEEE Transaction on Neural Networs, Vol. 11, N0. 4, Jul 2000. [7] http://wave.ldc.upenn.edu, The DARPA TIIT Acoustic- Phonetic Continuous Speech Corpus. [8] D. R. Hush, B. G. Horne, Progress in Supervised Neural Networs, IEEE Signal Processing agazine, Vol. 10, No 1, pp. 8-39, Januar 1993. [9] B.. Wilamowsi, Neural Networ Architectures and Learning, International Conference on Industrial Technolog, Vol. 1, pp. TU1- TU12, IEEE, 10-12 December 2003. [10] S. E. Fredricson, L. Tarasseno, Text-Independent Speaer Recognition Using Neural Networ Techniques, Fourth International Conference on Artificial Neural Networs, Conference Publication No. 409, pp. 13-18, 26-28 June 1995. [11]. W. a, W. G. Allen, G. G. Sexton, Speaer Identification Using Radial Basis Functions, Third International Conference on Artificial Neural Networs, pp. 138-142, 25-27 a 1993.