Speaker Identification and Verification Using Different Model for Text-Dependent

International Journal of Alied Engineering Research ISSN 0973-4562 Volume 12, Number 8 (2017). 1633-1638 Research India Publications. htt://www.riublication.com Seaker Identification and Verification Using Different Model for Text-Deendent 1 Shrikant Uadhyay, 2 Sudhir Kumar Sharma & 3 Aditi Uadhyay Deartment of Electronics & Communication Engineering, 1 Cambridge Institute of Technology, Ranchi, Jharkhand, India. 2, 3 Jaiur National University, Jaiur, Rajasthan, India. Abstract Human voice is one of the medium through which everybody communicate and share their exeriences, thought, feeling etc. and make it meaningful by understanding their meaning and actions. Issue is not that erson is seaking in front of about 10-20m and getting clear voice to the next erson sitting in front. The erson stays at far location and tries to access their various need alications like assword, PIN number, account access etc. with the hel of voice then it will be quite difficult to do so. Seaker is the medium through which we can transfer your voice in such occasion so, seaker identification and verification is very imortant. Text-deendent samle is quite easy to determine and verified by the erson sitting to the remote location as all information has rior saved to the database of unknown user and doesn t require any additional information. Different roosed model have been analyzed in terms of efficiency, loss and error using different feature extraction techniques in this aer. We ut your effort and try to find out that how the model behaves using different feature extraction combinations. INTRODUCTION Within the ast decade, technological growth advances remote collaborative data rocessing over huge comuter network, hands-free sound cature, telebanking have increased the demand for imroved method of information security. For ersonal information including credit history, bank details, medical records, the ability to verify the identity of individuals attemting to access such data is challenging and critical. To date, low-cost strategy such as magnetic cards, ersonal identification numbers (ins), asswords have been widely used. More advanced security measures have also been develoed (e.g., retinal scanners, face recognition, as well as automatic finger rint detectors). The use of these rocedures has been limited by both cost and ease of use. In recent years, seaker recognition (recognizing a erson from his/her voices by a machine) and verification algorithms have also received considerable attention. In articular, seech rovides a convenient and natural form of inut conveys a significant amount of seaker deendent information, and it is inexensive to collect and analyze. Voice is a henomenon that is highly deendent on the seaker who roduced it. Many hysical asects of seech such as the timber, tone or intensity vary a lot from a seaker to another. The same haens with other linguistic asects as the single intonation and range of vocabulary or exressions a seaker normally uses. All these roerties make voice a very owerful biometric key to be used in security systems since the hysical characteristics of seech are easy to measure and comare in comarison to other biometric keys. In addition, the seech signal is quite well-known and has been deely studied for many years so many owerful algorithms can be found to deal with this kind of signal. Text-deendent task involves some form of re-determined or romted assword, in order to obtain the required text. It can be used for alications such as voice mode assword or signature verification. In such alications, there is a need to change the assword frequently and it can be done easily by changing the re-determined text. In text-deendent seaker verification, during enrolment hase a limited number of utterances of the fixed text are collected. Therefore, aroaches based on temlate matching are used for attern comarison instead of aroaches based on statistics or artificial neural networks, which needs a large amount of training data. Research in seech rocessing and communication, for the most art, was motivated by eole s desire to build mechanicals models to emulate human verbal communication. Research interest in seech rocessing today has done well beyond the notion of mimicking human vocal aaratus [1]. Seech is the rimary communication medium between eole. This communication rocess has a comlex structure consisting not only of the transmission of voice but also includes transmission of many seaker secific information. Therefore, from the last five decades eole have come forward to investigate various asects of seech such as mechanical realization of seech signal, human machine interaction and seech & seaker recognition. Seech is a biometric that combines hysiological and behavioral characteristics. It is articularly useful for remote-access transactions over telecommunication networks. Presently this task is the most challenging one to researchers. Peole can reliably identify familiar voices and about 2-3 seconds is enough to identify a voice, although erformance decreases for unfamiliar voices [2]. Even if duration of the utterance was increased, but layed backward (which distorts timing and articulatory cues), the accuracy decreases drastically. Widely 1633

International Journal of Alied Engineering Research ISSN 0973-4562 Volume 12, Number 8 (2017). 1633-1638 Research India Publications. htt://www.riublication.com varying erformance on this background task suggested that cues to voice recognition vary from voice to voice, so that voice atterns may consist of a set of acoustic cues from which listeners select a subset to use in identifying individual voices. IDENTIFICATION & VERIFICATION Seaker recognition by a machine involves three stages. They are: (1) Extraction of features to reresent the seaker information resent in the seech signal. (2) Modelling of seaker features. (3) Decision logic to imlement the identification or verification task. The rimary task in a seaker recognition system is to extract features caable of reresenting the seaker information resent in the seech signal. It is known to us that human beings use high-level features such as style of seech, seech dialect and verbal mannerisms (for examle, a articular kind of a laugh or use of articular idioms and words) to recognize seakers. Intuitively, it is clear that these features constitute imortant seaker information. Difficulty arises due to limitations of the existing feature extraction techniques [3]. Current seaker recognition systems follow segmental features such as the shae of the vocal tract to reresent the seaker-secific information. These features show significant variations across seakers, but they also show considerable variations from time to time for a single seaker. In addition to this, the characteristics of the recording equiment and transmission channel are also reflected in these features [3]. Once a roer set of feature vector is obtained, then in the next hase task in seaker recognition is to develo a model (rototye) for each seaker. The develoment of seaker modelling is called the training hase. The block diagram of training hase is shown in Figure 1. Figure 2. Testing in seaker recognition system In seaker identification a seech utterance from an unknown seaker is analyzed and comared with models of all known seakers. The unknown seaker is identified as the seaker whose model best matches the inut utterance. Figure 3 shows the basic structure of seaker identification system. Seaker identification can be a closed set identification or an oen set identification. In a closed set identification, it is assumed that the test utterance belongs to one of N enrolled seakers (N decisions). In the case of oen set identification, there is an additional decision to be made to determine whether the test utterance was uttered by one of the N enrolled seakers or not, that is, there are N+1 decision levels. Figure 1. Training in seaker recognition system Feature vectors reresenting the voice characteristics of the seaker are extracted and used for building the reference models. The erformance of a seaker recognition system deends rimarily on the effectiveness of the models in caturing the seaker-secific information, and hence this hase lays a major role in determining the erformance of a seaker recognition system. The final stage in the develoment of a seaker recognition system is the decision logic stage, where a decision to either accet or reject the claim of a seaker is taken based on the result of matching techniques used. The block diagram of decision logic and testing hase rocess is shown in the Figure 2. Figure 3. Seaker verification system FEATURE EXTRACTION METHODS Feature extraction involved in signal modeling that erforms temoral and sectral analysis. The need of feature extraction arises because the raw seech signal contains information to convey message to the observer or receiver and has a high dimensionality. Feature extraction algorithm derives a 1634

International Journal of Alied Engineering Research ISSN 0973-4562 Volume 12, Number 8 (2017). 1633-1638 Research India Publications. htt://www.riublication.com characteristics feature vector with lower hysical or satial roerties. A. Mel-Scale Cestrum Co-efficient (MFCC) MFCC technique is basically used to generate the fingerrints of the audio files. Let us consider each frame consist of N samles and let its adjacent frames be searated by M samles where M is less than N. Hamming window is used in which each frame is multilied. Mathematically, Hamming window equation is given by: W (n) = 0.54 0.46 cos( 2πn N 1 ) (1) Now, Fourier Transform (FT) is used to convert the signal from time domain to its frequency domain. Mathematically, it is given by: N 1 X k = x i e 2πik N 1 i=0 (2) M = 2595log 10 (1+ f 700 ) (3) In the next ste log Mel scale sectrum is converted to time domain using Discrete Cosine Transform (DCT). Mathematically, DCT is defined as follow: N 1 X k = α i=0 x i (2i + 1/2N) (4) The result of the conversion is known as MFCC and the set of co-efficient is called acoustic vectors. B. Linear Predictive Coding Analysis (LPC) It is a frame based analysis of the seech signal erformed to rovide observation vectors [3]. The relation between seech samle S (n) and excitation X(n) for auto regressive model (system assume all ole mode) is exlained mathematically as: S (n) = k=1 a k s (n-k) + G. X (n) (5) The system function is defined as: H (z) = S(Z) X(Z) A linear redictor of order with rediction co-efficient (α k) is defined as a system whose outut is defined as: (6) s (n) = k=1 α k S (n-k) (7) The system function is th order olynomial and it follows: P (z) = α k z k (8) The rediction error e (n) is defined as: e (n) = s(n) - s (n) = s (n) - k=1 α k S (n-k) (9) The transfer function of rediction error sequence is: A (z) = 1 - k=1 α k z k (10) Now, by comaring equation (5) and (10), if α k = α k then A (z) will be inverse filter for the system H (z) of equation (6): H (z) = G/ A (z) (11) The urose is to find out set of redictor coefficients that will minimize the mean squared error over a short segment of seech waveform. So, short-time average rediction error is defined as [4]. E (n) = m (e n (m)) 2 k 1 } = m {s n (m) α k s n (m k) (12) where, s n (m) is segment of seech in surrounding of n samles i.e. s n (m) = s (n + m) Now, the value of α k minimize E n are obtained by taking E n / α i = 0 & i = 0, 1, 2... thus getting the equation: m s n (m i)s n (m) = k 1 α k s n (m i)s n (m k) (13) If n (i, k) = m s n (m i)s n (m k) (14) Thus, equation (13) rewritten as: k=1 α k φ k (i, k) =φ k (i, 0), for i= 1, 2, 3... (15) The three ways available to solve above equation i.e. autocorrelation method, lattice method and covariance method. In seech recognition the autocorrelation is widely used because of its comutational efficiency and inherent stability [4]. Seech segment is windowed in autocorrelation method as discuss below: S n = S (m + n) + w (m) for 0 m N-1 (16) Where, w (m) is finite window length Then, we have φ n (i, k) = N+ 1 m=0 s n (m i)s n (m k) for 1 i, 0 k (17) φ n (i, k) = R n (i k) (18) Where, R n (k) = N 1 k m=0 s n (m) s n (m + k) R k (k) is autocorrelation function then equation (15) is simlified as [5] k 1 α k R n ( i k ) = R i (i) for 1 i (19) Thus using Durbin s recursive rocedure the resulting equation is solved as: E (i) = (1- k i 2 ) E (i-1) (20) Then from equation (19) to (22) are solved recursively for i = 1, 2... and this give final equation as: 1635

International Journal of Alied Engineering Research ISSN 0973-4562 Volume 12, Number 8 (2017). 1633-1638 Research India Publications. htt://www.riublication.com () α j = LPC coefficient = α j k i = PACOR coefficients A very essential LPC arameter set which is derived directly from LPC coefficients is LPC ceestral coefficients C m. The recursion used for this discussed as [6]: C 0 = ln G (21) structures known as recognizer outut voting error rate. This method treats the outut of multile ASR systems as indeendent knowledge sources, and then combines these oututs to roduce a comosite outut with a lower word error rate than an individual ASR outut. It roceeds in two stages as shown in Figure 5. m 1 C m = α m + ( k k=1 ) C m k a m k for 1 m (22) m 1 C m = ( k k=1 ) C m k a m k for m > (23) C. Linear Prediction Ceestral Coefficients (LPCC) The basic arameters for estimating a seech signal, LPCC lay a very dominant role. This method is that where one seech samle at the current time can be redicated as a linear combination of ast seech sequence or samle. LPCC algorithm in term of block diagram is shown in Figure 4. below: Figure 5. Proosed model (ROVER based) The second roosed model is based on combined combination known as CNC as shown in Figure 6. below aroach to seech recognition, the decoder oututs the string of word hyotheses corresonding to the ath with highest osterior robability given the acoustic and a language model, and reresented by either N-best lists or word lattice. Figure 4. LPCC algorithm rocess RELATED WORK This aer focus on the Hindi database text-deendent samle soken from SHUNAY (0) to NAU (9). This database is reared using GoldWave software using version 5.55. This samle is tested using MATLAB tool for their efficiency and error rate. The roosed model is tested using various combinations of feature extraction techniques like, and MFCC+LPC. The considered arameters for analysis are mention in Table 1. Table 1: Parameters used for analysis Parameters Values Total number of seakers 24 Single channel Samling rate Language PROPOSED MODEL 16KHz 16 bit (For less quantization error) Hindi (Phonetics is rich) The roosed model designed based on the aims to achieve better erformance by exloiting differences in the nature of the errors occurred at the level of different modeling Figure 6. Proosed model (CNC based) These two models have been used to calculate the efficiency and error rate of the seech signal soken from SHUNYA (0) to NAU (9). And with the hel of this model we exactly know that which model erforms better also that which combinations of feature extraction techniques will better for both the model. Earlier no such models have been develo 1636

International Journal of Alied Engineering Research ISSN 0973-4562 Volume 12, Number 8 (2017). 1633-1638 Research India Publications. htt://www.riublication.com using such combination. So, it is quite useful for seaker recognition. RESULTS The simulation has been done using MATLAB tool also taking hel with COOL and GoldWave software to analyze the database. Effeciency (%) 102 100 98 96 94 92 90 88 86 84 82 Dialects Figure 7. Efficiency for ROVER model MFCC+LPC Error rate (%) 60 50 40 30 20 10 0 Calculations: Figure 10. Error rate for CNC model GMMs (Gaussian Mixture Models) are trained searately on each seaker s enrolment data using the Exectation Maximization (EM) algorithm [255]. The udate equations that guarantee a monotonic increase in the model s likelihood value are: Mixture weight: Dialects T MFCC+LPC i = 1 (i x T t=1 t,λ) (24) Error rate (%) 60 50 40 30 20 10 0 Dialects MFCC+LPCC Means: Variance: μ i = σ i Recognition Rate = 2 = T t 1 T t 1 T t 1 (i x t,λ) x t, (i x t,λ) (i x t,λ) 2 xt T t 1 (i x t,λ) Successfully detected words Number of words in test set (25) - μ 2 i (26) (27) Effeciency (%) 100 98 96 94 92 90 88 86 84 82 80 Figure 8. Error rate for ROVER model MFCC+LPCC Exeriments were erformed with different tyes of seech, that is, for slow seech (less than 12 hones er second), for normal seech (12 to 18 hones er second) and for fast seech (more than 18 hones er second). Maximum accuracy is achieved for normal seech as shown in Figure 11. Figure 9. Efficiency for CNC model Figure11. Seech rate versus accuracy 1637

International Journal of Alied Engineering Research ISSN 0973-4562 Volume 12, Number 8 (2017). 1633-1638 Research India Publications. htt://www.riublication.com CONCLUSION & FUTURE SCOPE Efficiency rate of MFCC+LPC for ROVER model is 98.22% comared to and i.e. 93.99% and 92.10. Error rate of MFCC+LPC is quite low i.e. 1.22% comared to and i.e. 3.53% and 4.42%. On the other hand efficiency rate of for CNC model is 98.99% comared to and MFCC+LPCC i.e. 94.62 and 96.23. Error rate of here is 1.29% comared to and MFCC+LPCC i.e. 2.54 and 3.39. So, with the hel of above calculation and analysis we can choose ROVER model with the feature extraction combination of MFCC+LPC for getting good and efficient seech recognition. For CNC based model we can have good and efficient combination of which give good efficiency and less error rate comared to other two combinations and may be useful for accurate seaker recognition. The combination may be used for different tyes of model taking the acoustic condition in future. ACKNOWLEDGMENT We are extremely thankful to the Prof. (Dr.) Sudhir Kumar Sharma (HOD, Deuty Director, Jaiur National University), and Prof. (Dr.) Pawan Kumar (HOD, Cambridge Institute of Technology) for their technical suort and guidance. AUTHORS PROFILE: Mr. Shrikant Uadhyay, Research Scholar at Jaiur National University, Jaiur. He received his M.Tech degree from Dehradun Institute of Technology (University) in 2011. His current research area includes seech rocessing, image rocessing, neural network and the security challenges for seaker identification and verification in signal rocessing domain. Dr. Sudhir Kumar Sharma is Professor & Head at the Deartment of Electronics & Communication Engineering, School of Engineering and Technology, Jaiur National University, Jaiur, Rajasthan, India. Professor Sharma received his Ph.D. in Electronics from Delhi University in 2000. Professor Sharma has an extensive teaching exerience of 19 years. He has been keenly carrying out research activities since last 20 years rominently in the field of Otical Communication Signal Processing. Mrs. Aditi Uadhyay, Research Scholar. She received M.Tech degree from Dehradun Institute of Technology (University) in 2012. Her current research area includes seech rocessing, image rocessing using different state of HMM. Image enhancement by using different acoustic model. REFERENCES [1] D. G. Childers, R. V. Cox, R. Demori, B. H. Juang, J. J. Mariani, P. Price, S. Sagayama, M. M. Sondhi and R. Weischedel, The ast, resent and future of seech rocessing, IEEE Processing Magazine,. 24-48, May 1998. [2] D. Lancker, J. Kreiman and K. Emmorey, Familiar voice recognition: Pattern and arametersrecognition of backward voices, Phonetics, vol. 13, no. 1,. 19-38, 1985. [3] Abdelnaiem, LPC and MFCC erformance evaluation with artificial neural network for soken language identification, International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 6,. 55-66, June 2013. [4] L.R. Rabiner and R.W. Schafer, Digital rocessing of seech signals. Englewood cliffs, New Jersey: Prentice-Hall, 1978. [5] L.R. Rabiner and B.H. Juang, Fundamental of seech recognition, Englewood cliffs, New Jersey: Prentice- Hall, 1978. [6] Han Y, Wang G and Y Yang, Seech emotion recognition based on mfcc, Journal of Chong Qing. University of Posts and Telecommunication (Natural Science Ediion) vol.69,. 34-39, 2008. 1638