Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Size: px

Start display at page:

Download "Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm"

Benedict Flowers
5 years ago
Views:

1 EngOpt International Conference on Engineering Optimization Rio de Janeiro, Brazil, 0-05 June Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm Dmitry Shalymov Department of Mathematics & Mechanics, Saint-Petersburg State University, Saint-Petersburg, Russia shalydim@mail.ru. Abstract This paper represents the using of the new simultaneous perturbation stochastic approximation algorithm (SPSA) for the solving of the noise robust isolated words recognition problem. The noise robust speech recognition method which is based on mel-frequency cepstral coefficients (MFCC) is briefly described. Main features of SPSA algorithm are shown. The effectiveness of the proposed method is demonstrated. 2. Keywords: Speech Recognition, Stochastic Optimization, Artificial Intelligence 3. Introduction Problems of the speech recognition are still important today. Many of modern methods which are used to solve this problem are computationally resource-intensive. The capacity of such resources is often bounded. For many algorithms it is impossible to use it in portable devices. This moves researches to find more effective methods. This paper represents the using of the new simultaneous perturbation stochastic approximation algorithm (SPSA) for the solving of the noise robust isolated words recognition problem. Due to SPSA s simplicity and small number of operations per each iteration, this algorithm can be used as alternative method for real time speech recognition. The noise robust speech recognition method which is based on mel-frequency cepstral coefficients (MFCC) is briefly described. Each sound-wave that entered in the recognition system includes some noise. In case of noisy measurements of loss function SPSA algorithm keeps reliable estimations under almost arbitrary noise. It is very important to the speech recognition problem where the noise represents often the phase or spectrum shifts of signal, or external environment, or recording device settings, etc. SPSA algorithm is based on trial simultaneous perturbations which provide appropriate estimations under almost arbitrary noise. The main characteristic of SPSA algorithm is that only two measurements of function to approximate loss function gradient are needed for any dimension of an unknown feather vector. Based on this characteristic it is convenient to use SPSA algorithm in speech recognition problem where feature vectors of large dimensions are used. It is simple to use this kind of algorithm in optimization problems with large number of variables. In that way we have an opportunity to operate with many words at once. Moreover its realization is simple for understanding and embedding in electronic devices. 4. Isolated words recognition problem Digital processing of acoustic signal supposes that analog speech signal is performed in digital shape. As a result of A/D transformation continuous signal is converted in sequence of discrete time intervals. Each time interval represents one value (signal measurement). This value characterizes signal in a point with a defined precision. Accuracy of representation depends on width of range of obtained numbers and hence it depends on capacity of A/D transformation. The process of numeric values extraction from signal is called quantization. The signal time intervals fragmentation is called sampling. Digital processing of acoustic signal is shown in Fig.. Fig.. Steps of acoustic signal processing The analogue acoustic signal that comes from microphone is exposed to quantization and sampling due to A/D transformation. Word achievement is occurred. It means that digital record of pronouncing word is performed in form of sequence of acoustic signal measurements {s k}. Word achievement is divided into frame sequence {X i} during digital processing. Frame X (with length N) is a sequence of acoustic signal measurements s, s 2,..., s N. Length of each frame is strictly fixed in time. For example, if N=00 and sampling rate is 8000 Hz then frame length is equal to 2.5 msec. Often frames are shifted relative to each other to prevent information losses in place of frame borders.

2 Frame shift step number of signal measurements between beginnings of two frames that follow one another. Shift step that is less than N (length of frame) means that frames are overlapped. Further, in a series of tasks such as speech recognition or personal identification, each frame is compared with several data values that characterize sound in a best way. Such data organizes feature vector (or attribute vector). From the mathematical point of view it M could be a vector of R space, a group of functions, or one function. The objective of the recognition system is to identify each word that comes in entry with one of the providential classes. Unfortunately, there is a great number of various factors that could reduce accuracy of the recognition system. For example, the mood and state of the speaker, external environment noise, the rate of phrase pronunciation etc. The recognition system is speaker independent in case of correct word recognition regardless of person who is pronouncing. It is hard to implement such system in practice because acoustic signals are strongly depend on loudness and timbre of voice, mood and state of the speaker. To extract information from such signals mel-scale filters are used quite often. These filters average spectral components of the signal in concrete frequency ranges. So the signal becomes less dependent on the speaker. Such filters lie in a base of the MFCC (Mel-Frequency Cepstral Coefficients) method. MFCC is used in the recognition system discussed in this paper. 5. Speech signal processing 5.. Preliminary filtration The speech signal should be passed through a low-frequency filter for spectral smoothing. The goal of this transformation is to reduce influence of local distortions. Low-frequency filtration is often implemented in low-level activity. Nevertheless, there are various mathematical methods that are successfully used in speech recognition problems. In the considered system no such methods were used. It is well known that most informative frequencies of human speech are concentrated in 00Hz 2KHz interval. That s why when solving problem of speech recognition as early as in initial state, only the frequencies of this interval remain in the signal spectrogram Cutting of a signal with an overlapping segments To extract feature vectors of the same length it is necessary to cut the speech signal into equal frames. After that it is necessary to make a transformation of each frame. Usually frames are selected so that they are overlapped for half of its length or for 2/3. Overlapping is used to reduce information loss in the border of the frames. Feature vector for observed region of speech signal consists of cepstral coefficients characterized for each frame separately. So if we increase frame overlapping then dimension of feature vector for entire region will be increased on default. The set of numbers that were extracted during spectral analysis of speech signal interval is called cepstral coefficients. Usually the length of the observed interval is selected so that it corresponds to ms interval Window signal processing The goal of this step is to reduce border effects that take place during segmentation process. To neutralize undesirable border effects, the speech signal s(n) is usually multiplied by w(n): x(n) = s(n)*w(n). As the w(n) function the Hamming window function is often used: 2πn cos,0 n < N w( n) = N 0, otherwise. 5.4 Feature vector extraction Each input speech signal is performed as a feature vector that characterizes the signal. There are several ways to construct the feature vector. In the discussed model we use a classical approach for cepstral coefficients. There are two possibilities to extract cepstral coefficients. One is based on Mel-Frequency Cepstral Coefficients (MFCC)[3]. The other is based on Linear Predictive Cepstral Coefficients (LPCC)[4]. MFCC is the most widespread method. Let us examine its major steps.. Input signal is broken into frames. For each frame Hamming window is used. 2. Pre-emphasis preliminary phrase selection (accentuation). It is performed by speech signal filtration with FIR (finite impulse response) filter. This is due to the necessity of spectral smoothing. It allows us to make signal less sensitive for different noises that happens while signal processing. 3. Then the spectrogram is examined. The set of frequencies that are presented in the spectrogram is divided into numbered intervals. The range of possible frequencies is strictly defined for each interval. Then average signal intensity in each interval is calculated to build a special diagram. In this diagram abscissa consists of interval numbers and ordinate axis consists of amplified amplitude values. This process is called mel-scale filtration. 4. Feature vectors are extracted with the methods based on human interpretation of sound, since a human ear interprets signal loudness in a logarithm scale. This step performs a signal amplitudes compression using the logarithm. 5. The final step is an adaptation of the Fourier inversion to spectrum. The result of this step is the cepstral coefficients extraction and feature vector construction. Cepstral coefficients could be described as follows: c n = K k= (log S( k)) e where S(k) is an averaged spectrum of signal with amplified amplitudes that characterizes frequency interval with number k in melscale filter; K is general number of intervals. ikn,

3 6. Randomized algorithm of stochastic approximation The exact solution of any problem can be found if there is a precise definition and mathematical description. But in reality the complicity of such connections and relationships make it impossible to give an exact mathematical description for many phenomena. The simply theoretic approach is to choose a mathematical model which is close to a real process and which includes different noises (disturbances). These noises represent some kind of roughness of the mathematical model from one side and represent the characteristics of outside uncontrolled perturbations of a system from the other. It is well known to specialists in the theory of the unknown parameters identification that if the noise is a deterministic unknown function or the observation noise is a probabilistically dependent sequence, then getting decisions is wrong. Then some theorists say that observation sequence is degenerate (not rich) and the solutions of such kind of problems are not studied. For the purpose of enriching information in the observation channel sometimes there is a possibility to include new simultaneous perturbation with well-known probabilistic properties into the input system channel to solve a set of problems. Sometimes the measurable random process that is already presented in a system can play a role of such simultaneous perturbation. In control systems it is natural to add the trial simultaneous perturbations (actions) through a control channel. One of the remarkable characteristics of such type of algorithms is a convergence under the almost arbitrary noise. A considerable restriction for using these algorithms is an assumption of weak correlation or independence of the measurement noise and the simultaneously perturbation which is added into the system, while there are no other assumptions about measurement noise properties. This restriction is natural in the case when the noise is generated from either an unknown, but bounded deterministic function (some unmodel dynamics). Let us suppose that there are l different words in our recognition system. Feature vectors of speech signal are input signals for SPSA algorithm. It is represented as a point in multidimensional Euclidean space. SPSA algorithm determines centers of l classes due to the classifying sequence. Each class corresponds to one of the words. Coordinates of the centers represents feature vectors of pattern words. Word is identified with class by distance between feature vector of signal and center of the class. Algorithm considered below is used to define pattern words (or class centers in the system). To recognize speech commands it is used traditional method of comparison with patterns and following minimal distance extraction. As initial class centers it is possible to take arbitrary l vectors of space. In general, the selection of words to be recognized is important. The more phonetic differences are between words, the easier its recognition. But often recognized words are conformable. That s why it is important to define centers of classes as far from each other, as it is only possible. From the mathematical point of view speech recognition problem can be reformulated as a problem of the automatic image classification. 7. Automatic image classification problem Suppose that the state-space is covered by a set of classes { X,, l X } (the number of classes is bounded with ). The k automatic image classification problem is to build a rule which for the each point x from the gives the correspondence class l belonged to { X,, X }. If several points are compared with the same class they have the common feature, and it naturally generates this class. Usually one can take as a common feature of a class the closeness to the specific center: for each point x the simple classification rule is to compare distances from the center of one class with others. To formalize classification rule the family of penalty functions (functions of cost) is considered, and a set of authenticity degree functions is defined: X Let s suppose that probability distribution is assigned in. The automatic classification problem is to find such sets of functions and vectors that minimize the mean risk functional: The state-space is divided into classes according to the rule: Indicator s functions of these classes are denoted as. We can rewrite the mean risk functional in the form where is -dimensioned vector with functions as component values and is -dimensioned vector arranged with functions values. This functional is characterized the performance of classification. Partition is optimal if the parameter minimizes the mean risk functional. Geometrically the automatic classification problem can be described by the following way. Suppose is a space of real numbers and. Penalty functions are. Each points located closer to the center are correspond to the class. The mean risk functional can be redefined: The automatic classification problem transforms into a problem of finding a set of centers which minimize

4 amount dispersal. The value of keeps unchangeable if vector transposition occurs in bundle. The usual way of some function F minimization is to find the bundle of centers for which the equation 0 is satisfied. But in the considered case the function F is not differentiable. That s why automatic classification problem solution may be not simple. 8. Trial perturbations and estimation algorithm Assume that probability distribution is unknown but we have qualifying sequence. In [] it was suggested the way to build estimation sequence which converges to the good approximation of bundle. The proposed new recursive algorithm for the classification of the huge amount of multidimensional data is based on former SPSA ideas. The new algorithms perform well in the real time environment. The SPSA algorithm is used the simultaneous trial perturbations. The main features of the SPSA algorithms are the following: the unknown function is measured not at the point of the previous estimate but at estimate's slightly excited position for all unknown vector components simultaneously, and there is the essential reduction of observations at each iteration in the multi-dimensional case. It means that necessary amount of iterations isn t increasing in comparison with a classical Kiefer-Wolfowitz procedure though number of observations is decreasing significantly. Let s penalty functions are not defined analytically. But values of these functions can be measured with some noise:. Define the as -dimensioned vector arranged with and as -dimensioned vector of noise. To build the estimation sequence of bundle we suggest using the SPSA algorithm. It is based on measurable stochastic independent vectors called trial simultaneous perturbation. These vectors consist of independent stochastic values. Let s fix the initial bundle and choose two zero-aimed sequences and. The proposed algorithm is described below:, where is -dimensioned vector arranged with functions and. are -dimensioned vectors of noise. is a set projector. 9. Practical application As an experiment, a simplified model of SPSA algorithm application for isolated words recognition problem was implemented. In Matlab 7.0. a create speaker-dependent self-qualifying system that is able to recognize four different words was created. Selection of words to be recognized is important in general. It is easier to recognize words that have many phonetic differences. To provide convergence of algorithm with penalty function q ( x, θ) = x θ it needs to satisfy special condition. Namely, distance between different classes should be greater than maximum radius of all classes. Hence it is desirable to have center of classes as far from each other as it is only possible. As initial centers of classes we could take any points of space R M. In considered recognition system feature vectors of first four different words from qualifying sequence were taken as initial centers. Let us consider part of speech signal that is correspond to one-second time interval. It consists of several frames. Each frame has 25 msec time length. So there are 40 frames in one second time interval at all. During spectral processing feature vector with dimension 24 was extracted from each frame. Spectrum was broken to 24 ranges. For each range average spectrum value was computed. Bundle of averaged spectrum values organizes feature vector. Dimension M of phase space is defined as sum of all dimensions correspond to feature vectors of frames in one-second time speech signal interval. Frame overlapping was not used. So phase space dimension M is equal to 40*24=960. For each class there were recorded more than one hundred samples that arranged qualifying sequence. Recording process was performed with 8000 KHz sampling rate and 6 bit quantization. While speech signal processing there were also used optimization methods concerned with peculiarity of microphone. Rate of algorithm convergence in practice is dependent from selection of sequences { α k n } and { β k n }. Important role in considered algorithm is played by simultaneous trial perturbations. It is not necessary to take ± accidental values. The main thing is that trial perturbations are finite and symmetric dispersed. Due to empirical issues as { α n } was taken sequence 3/n and as { β n } was taken / n. Simultaneous trial perturbations were selected as ± / 30. Convergence of considered algorithm for one word is shown in Fig. 2. In this illustration distances between input signals and approximated class center are demonstrated. Class center is approximated during SPSA algorithm launching. There were one hundred of signals entered the system. Feature vector of pattern word is correspond to class center when n=00. During feature vectors extraction some inaccuracies were permitted to simplify system implementation. In particular averaged spectrum values were roughly computed while mel-scale filtration. In spite of this it was succeeded to get 98% accuracy of recognition. To improve statistics cepstral coefficients extraction needs to be implemented in another way. 2

5 Fig. 2: SPSA algorithm convergence to the one class center 9. Conclusions This paper represents the application of the new simultaneous perturbation stochastic approximation algorithm SPSA for the solving of the noise robust isolated words recognition problem. SPSA provides appropriate estimations under almost arbitrary noise. One of its important features is the ability to retain simplicity and efficiency in spite of space dimension grows. Also it gives an opportunity to operate with many classes at once. Main steps of the isolated words recognition problem solving are described. To extract feature vectors Mel-Frequency Cepstral Coefficients was used. The recognition system accuracy is proved to be 98%. Performance of the system could be improved due to MFCC method realization improvement. 0. References. Granichin O. N. and Izmakova O. A., A Randomized Stochastic Approximation Algorithm for Self-Learning. Avtomatika i Telemekhanika, No. 8, 2005, pp Granichin O. N. and Polyak B.T., Randomized Algorithms of an Estimation and Optimization Under Almost Arbitrary Noises. M.: Nauka, Gold B., Morgan N., Speech and Audio Signal Processing. John Wiley and Sons, Inc, Rogina I., Automatic speech recognition. Carnegie Mellon University, Fomin V. N., Recursive estimation and adaptive filtration. M.: Nauka, 984

APPLYING QUANTUM COMPUTER FOR THE REALIZATION OF SPSA ALGORITHM Oleg Granichin, Alexey Wladimirovich

APPLYING QUANTUM COMPUTER FOR THE REALIZATION OF SPSA ALGORITHM Oleg Granichin, Alexey Wladimirovich Department of Mathematics and Mechanics St. Petersburg State University Abstract The estimates of the