ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM Mayukh Bhaowal and Kunal Chawla (Students)Indian Institute of Information Technology, Allahabad, India Abstract: Key words: Speech recognition is always looked upon as a fascinating field in human computer interaction. It is one of the fundamental steps towards understanding human cognition and their behaviour.this report explicates the theory and implementation of ASR, which is a speaker-dependent real time isolated word recognizer.the major logic used was to first obtain the feature vectors using LPC which was followed by vector quantization. The quantized vectors were then recognized by a suitable modeling technique namely HMM.In the recognition phase the Baum Welch algorithm was used. However it was soon realized that unless some normalization or scaling was carried out the results were highly inaccurate. The paper proposes certain significant modifications to the already existing scaling algorithms.these modifications were brought about after an extensive research work. The results suggest that optimal scaling computations significantly improve the recognition. The schema proposes this modified algorithm which leads to a new insight in the speech recognition techniques. Speech Recognition, LPC, VQ, HMM, Language Processing 1. INTRODUCTION Contemporary ASR systems are composed of a feature preprocessing stage, which aims at extracting the linguistic message while suppressing non-linguistic sources of variability, and a classification stage (including language modelling), that identifies the feature vectors with linguistic classes. The ultimate goal is to estimate the sufficient statistics to discriminate among different phonetic units while minimizing the computational demands of the classifier.

2 Mayukh Bhaowal and Kunal Chawla Steps in ASR : 1. convert audio/wave files to sequences of multi-dimensional feature vectors. (eg. DFT, PLP,LPC, etc) 2. quantize feature vectors into sequences of symbols (eg. VQ) 3. train a model for each recognition object (ie. word,phoneme) from the sequences of symbols. (eg. HMM) 4. constrain models using grammar information. Paper organization: Implementation section gives the details of the inplementation and theories used to achieve the result.the section of results and conclusions deal with discussions of the results of the tests conducted. 2. IMPLEMENTATION 2.1 Noise Elimination and Word boundary detection : We have to isolate the word utterance from the starting and trailing noise. This was done by using Energy Threshold comparison method. Whenever, the energy in a frame of speech exceeds a certain threshold, we can mark this point as the start of seech. 2.2 Pre emphasis The digitized (sampled) speech signal s(n) is put through a low order digital system to spectrally flatten the signal. The first order filter used had the transfer function H(z) = 1 - az -1 where a = 0.9375. 2.3 Frame Blocking In this step the pre-emphasised speech signal is blocked into frames of N samples,with adjacent frames being separated by M samples.thus frame blocking s done to reduce the mean squared predictation error over a short segment of the speech wave form. 2.4 Windowing The pre emphasized speech is then blocked into frames by using Hamming windows. Hamming windows of length 256 was used. To have a smooth estimate we need more windows. So, an overlap of 156 samples was also incorporated. The hamming window used was [1] w(n) = 0.54-0.46cos(2 * pi * n / N - 1)

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE 3 USING LPC,VQ AND HMM 2.5 Autocorrelation Analysis Each frame of the windowed signal is then autocorrelated.the highest autocorrelation value p is the order of the LPC analysis. 2.6 LPC Analysis The next processing step is the LPC analysis,which converts each frame of p+1 autocorrelatons[1] into an LPC parameter set in which the set consists of LPC coefficients. The formal method of converting from autocorrelation coefficients to a LPC parameter set is known as DURBIN s method. 2.7 Cepstral Coefficients Extraction Cepstral coefficients are the coefficients of the fourier transform representation of the log magnitude spectrum. These are more robust and reliable than the LPC coefficients. The cepstral coefficients can be estimated from the LPC. 2.8 Paramter Weighting Low order cepstral coefficients are sensitive to overall spectral slope and higher order spectral coefficients are sensitive to noise. So, it has become a standard technique to weight the cepstral coefficients by a tapered window so as to minimize the sensitivities. c^ = w(m) * c Figure 1. Noise Speech Detection

4 Mayukh Bhaowal and Kunal Chawla Figure 2. Original Signal, Filtered Signal and pre emphasized Signal Figure 3. Frame Blocked Signal,Windowed Signal and Auto correlated Signal 3. VECTOR QUANTIZATION The results of the feature extraction are a series of vectors characteristic of the time-varying spectral properties of the speech signal. These vectors are 24 dimensional and are continuous. We can map them to discrete vectors by quantizing them. However, as we are quantizing vectors this is Vector Quantization. VQ is potentially an extremely efficient representation of spectral information in the speech signal. 3.1 Clustering Algorithms: Assume that we have a set of L training vectors and we need a codebook of size M. A procedure that does the clustering is the K-Means clustering algorithm[3]. Initialization: Arbitrarily choose M vectors ( we can choose these from the training set ) as the initial set of code words in the codebook. Nearest-Neighbor Search: For each training vector, find the codeword in the current codebook that is closest in terms of spectral distance and assign that vector to the corresponding cell. Centorid Update: Update the code word in each cell using the centroid of the training vector assigned to the cell.

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE 5 USING LPC,VQ AND HMM Iteration: Repeat steps 2 and 3 until the average distance(distortion) falls below a preset threshold The disadvantage of this method is in the fact that we have to get a very good initial estimate of the codebook vectors. It may so happen that the random intial selection is clustered in one area of the vector space. If this happens then the final codebook will not be global. This can be a serious problem. An altenative procedure is binary split algorithm. 3.2 The Binary Split algorithm [3]. Design a 1-vector codebook; this is the centroid of the entire set of training vectors. Double the size of the codebook by splitting each current codebook y according to the rule y+ = y(1 + e) y- = y(1 - e) where n varies from 1 to the current size of the codebook, and e is a splitting parameter chosen in the range 0.001 <= e <= 0.05.We chose e=.001 Use the K-Means iterative algorithm to get the best set of centroids for the split codebook. Iterate steps 2 and 3 until the required size of codebook is reached. Figure 4. Partitioned Vector Space 4. HMM ( HIDDEN MARKOV MODEL) A hidden Markov model is defined as a pair of stochastic processes(x,y). The X process is a first order Markov chain, and is not

6 Mayukh Bhaowal and Kunal Chawla directly observable, while the Y process is a sequence of random variables taking values in the space of acoustic parameters, or observations. Two formal assumptions characterize HMMs as used in speech recognition. The first-order Markov hypothesis states that history has no influence on the chain's future evolution if the present is specified, and the output independence hypothesis states that neither chain evolution nor past observations influence the present observation if the last chain transition is specified. Letting y Υ be a variable representing observations and i,j Χ be variables representing model states, the model can be represented by the following parameters: A {a i, j i, j Χ} transition probabilities. B {b i, j i, j Χ} output distributions {π i i Χ} initial probabilities with the following definitions: i, p( Χ = j X t = i) i (y) p( = y X = I, π p(x = i) a j t 1 b, j Υt t 1 i 0 Χ t = j) A - Transition Probability matrix (N x N) B - Observation symbol Probability Distribution matrix (N x M) PI - Initial State Distribution matrix (N x 1) where N = Number of states in the HMM M = Number of Observation symbols ASR used a feed-forward ( or Bakis model ) of HMM for recognization.the model of obtaining the general HMM out of the models ` of individual utterances can be critical to the recognition level and the speed of learnng of the model.a Hidden Markov Model is a Finite State Machine having a fixed number of states. [1] Figure 5. Finite State Machine for HMM

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE 7 USING LPC,VQ AND HMM 4.1 HMMs and Speech Recognition In order to apply HMMs for speech recognition, we need to address 3 problems [1][2] 4.1.1 PROBLEM 1 Given the observation sequence O = {o1,o2,o3,o4,...} and the model L = (A,B,PI), how do we efficiently compute P(O L), the probability of the observation sequence, given the model? This problem is solved by calculating forward and backward variables. The Forward procedure [1] is explicated in the algorithm given below : The Backward Procedure[1] 4.1.2 PROBLEM 2 Finding the optimal sequence associated with a given observation. In this case We want to find the most likely state sequence for a given sequence of observations, Ο = o 1, o2,..., oτ and a model, λ = ( A, B,π ). The solution to this problem depends upon the way most likely state sequence'' is defined. One approach is to find the most likely state q t at t=t

8 Mayukh Bhaowal and Kunal Chawla and to concatenate all such ' q t 's. But some times this method does not give a physically meaningful state sequence. Therefore we would go for another method which has no such problems. In this method, commonly known as Viterbi algorithm, the whole state sequence with the maximum likelihood is found. Viterbi Algorithm[2][1] finds the single best sequence q for the given observation sequence O. The following equations are presented which is the Viterbi algorithm. The Veterbi algorithm is as follows : 4.1.3 PROBLEM 3 This is the problem of parameter estimation. This by far is the toughest problem of HMM. There is no way to analytically solve for the model parameters set that maximizes the probability of the observation sequence in a closed form. Baum welch algorithm is the solution to this problem[1].in this algorithm the parameters are recalculated as : 4.2 Scaling Computation α () () Because t i and β t i are probabilities between 0 and 1 and a large amount of multiplications are computed,the output probability decrease exponentially to 0 when the length sequence T increases.then,we need to scale the forward and the backward vatiables to avoid ^ underflow ^ in computation. The new forward and backward variables are : α t () i β t () i and we define a scale coefficient variable as :

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE 9 USING LPC,VQ AND HMM c t = N t = 1 1 α t () i Using these definitions the scaled forward algorithm is: Initialization 2. Recursion 3. Termination Because the denominator of scales output probability is too small,we will use the probability algorithm 4.3 Scaled Backward Algorithm: 1. Initialization 2. Recursion The scaled Baum Welch algorithm is modified as:

10 Mayukh Bhaowal and Kunal Chawla 5. RESULTS AND DISCUSSIONS The training set for the vector quantizer was obtained by recording the utterances of a set country names.the recordings were done for three male speakers. The recognition vocabulary consisted of the names (India,Spain,Germany,Zambia,Mexico).The results obtained are shown in the table below : Table 1. Results Word Speaker 1 Speaker 2 Speaker 3 India 60% 64% 64.5% Spain 97% 99.2% 98.3% Germany 95.5% 91% 92% Zambia 65% 70% 61.4% Mexico 98% 99% 97.23% Much of the error can be attributed to the presence of plosives in the beginning and end of some of the words.for example India and Zambia are similar sounding,have same vowel part but differs only in their unvoiced beginning..hence it recorded a low 65% accuracy. Words like Spain and Mexico being different from others recorded very high percentage 6. CONCLUSION In the successful implementation the results were found to be satisfactory considering less number of training data under different and varied conditions. The accuracy of the real time system can be increased can be increased significantly by using an improved speech detection/noise elimination algorithm.further improvement can be also achieved by a better VQ codebook design, with training set of utterances from a large number of speakers with variation of age and accent. The scaling computation which was a part of some original work modifies the present scaling algorithms to ameliorate the results further and to remove the inaccuracies in the results. 7. REFERENCES [1] L.R.Rabiner and B.H.Juang, Fundamentals of speech recognition, Prentice Hall (Signal Processing series ) 1993. [2] Richard O.Duda,Peter E.Hart,David G.Stork,Pattern Classification,John Wiley & Sons(ASIA) Pte Ltd. [3] Y.Linde,A.Buzo and R.M.Gray, An algorithm for vector quantizer design,ieee Trans.COM-28,January 1980.