A Low-Cost Robust Front-end for Embedded ASR System

Size: px

Start display at page:

Download "A Low-Cost Robust Front-end for Embedded ASR System"

Maximilian Hood
5 years ago
Views:

1 A Low-Cost Robust Front-end for Embedded ASR System Lihui Guo 1, Xin He 2, Yue Lu 1, and Yaxin Zhang 2 1 Department of Computer Science and Technology, East China Normal University, Shanghai Motorola China Research Center, Shanghai Abstract. In this paper we propose a low-cost robust MFCC feature extraction algorithm which combines noise reduction and voice activity detection (VAD) for automatic speech recognition (ASR) system of embedded applications. To remedy the effect of additive noise a magnitude spectrum subtraction method is used. A VAD is performed to distinguish speech signal from noise signal. It discriminates speech/nonspeech frames by employing an order statistics filter (OSF) on subband spectral entropy. A general RASTA filtering on log Mel filter-bank energy trajectories are applied. Finally, a 26 dimensional feature vector is used in ASR system after feature selection. Experimental results show that the proposed front-end can obtain 30.08% and 62.55% relative improvements on Aurora2 and Aurora3 databases and 29.47% on a Mandarin database compared with the baseline obtained from ETSI standard MFCC front-end. 1 Introduction Front-end feature extraction plays an important role in ASR system. ETSI standard Mel-frequency cepstral coefficient front-end is widely used in many ASR systems due to its accuracy representation of human auditory system and speech perception[1]. However, the characteristics of speech signal are often distorted by background noise and channel transfer, especially in mobile ASR applications. The performance of ASR system often degrades dramatically at low SNR levels. Noise reduction is a critical and intractable problem in ASR system. A lot of speech recognition researchers had proposed many noise robustness methods in the past decades. When the environmental noise is additive, spectrum subtraction is a performance and cost effective technique in speech enhancement. In[2], a multi-band spectrum subtraction algorithm is implemented. Stahl[3] introduced a quantile based noise estimation for spectrum subtraction. Although many algorithms are effective, they are not suitable for embedded ASR system due to their large system complexity. In this paper we present a simple and effective spectrum subtraction algorithm for noise reduction. The nonspeech frames 3 in speech signal only contain redundant and disturbance information to ASR system. Although there always are silence and 3 Noise only frames or silence frames

2 short pause models in HMM acoustic models configurations, in practice it is still very helpful to distinguish speech from nonspeech and drop the nonspeech frames during decoding. It is crucial for embedded system that the computational complexity can be minimized if only the speech frames are used for decoding. The transmission bit rate in distributed speech recognition (DSR) system can also be reduced if only speech frames are transferred. On the other hand, insertion errors are often observed when too many silence frames are passed to the decoder. So VAD is necessary in noise robust front-end. In[4] a subband energy-based VAD algorithm is presented. Shen[5] introduced an entropy-based algorithm for endpoint detection under noise condition and Xu[6] presented an improved entropy-based algorithm. In this paper, we introduce a VAD algorithm with an OSF on the subband spectral entropy. Although ETSI has published an advanced front-end (AFE) that provides substantially improvement on recognition performance in noise conditions[7]. It is not suitable for real-time embedded implementation due to its large computational complexity. Experimental results show that our proposed front-end only uses about one quarter computational MIPS of AFE but achieved similar recognition accuracies. This paper is organized as follows. In section 2, we describe the proposed front-end in details. Experimental results on Aurora2, Aurora3 and Mandarin digits databases are presented in section 3. In section 4, we give the summary of conclusions. 2 Front-end Algorithm Description The proposed front-end is a modified version based on the ETSI standard MFCC front-end. Details of the basic processing blocks can be found in[8]. Fig.1 shows the proposed front-end algorithm. In addition to the basic processing blocks, three enhancement stages (shaded blocks in Fig.1) are added. These stages are noise reduction with spectrum subtraction and RASTA filtering (section 2.1), a subband OSF entropy-based VAD (section 2.2), and post-processing stage with dynamic calculation, CMVN 4 processing as well as feature selection (section 2.3). 2.1 Noise Reduction The input signal x(n) is divided into overlapped frames with a length of 25ms (200 samples at the sampling rate of 8kHz) and the frame shift is 10ms (80 samples). The magnitude spectrum of signal is obtained by Fast Fourier Transform (FFT). Afterward, a spectrum subtraction is applied on the magnitude spectrum X[l, m] by subtracting the noise estimate from noisy spectrum. In the proposed front-end, spectrum subtraction is given by: Y [l, m] = max(x[l, m] N[m], αx[l, m]) (0 m N F F T /2) (1) 4 Cepstral Mean and Variance Normalization

3 x[l,n] n:sample index l:frame index preprocessing and FFT length=256 X[l,m] k;frequency index 0 m 128 Spectrum Subtraction Update Noise Estimation Y[l,m] f[l,j] Mel-frequency Nonlinear Filtering Transformation Entropy-based VAD j:filter-bank index 1 j 23 fln[l,j] i:cepstral coefficients 0 i 12 c[l,i] Dynamic Calculation Rasta Filtering DCT frasta[l,j] VADflag= speech N Frame Dropping Y c ' [l,k] (1 k 39) CMVN processing feature selection c ' mva[l,k] 26 dimensional feature vector Fig. 1. The proposed robust front-end algorithm where N F F T is FFT length, Y [l, m] is the speech magnitude spectrum after spectrum subtraction, X[l, m] is the magnitude spectrum of noisy speech signal. N[m] is the average magnitude spectrum of the noise. For each utterance, the first 10 frames are assumed to be noise. This assumption is valid in practical applications where before speaking speakers always take a short period of responding time after hearing a beep tone. The 10 reference frames are used to calculate the average noise spectrum N[m]. In order to track nonstationary noise, N[m] is updated during nonspeech period by: N[m] = γ N[m] + (1 γ)x[t, m] (2) where the t th frame is nonspeech based on VAD decision. The parameters γ = 0.97 performs well for the experiment databases. α (0, 1) is an attenuation constant to avoid Y [l, m] becoming negative due to noise estimation error, α is fixed at 0.3 in our speech recognition experiments. Spectrum subtraction is effective for additive noise but not for convolutional noise. The convolutional noise will become additive as is subjected to a logarithm function. In Fig.1, f ln [l, j] is produced after Mel filtering and nonlinear transformation (natural logarithm). RASTA filtering is applied on the temporal trajectories Mel filter-bank log-energies f ln [l, j] with the following transfer function: 0.98z z2 H rasta (z) = z 1 (3) Many experiments have showed that RASTA filter is effective in dealing with the convolutional distortion caused by transmission channel and microphone. From (1) and (3), we could find that spectrum subtraction module and RASTA filtering will not cause much computation load to ASR system.

4 2.2 Voice Activity Detection Energy is the most effective and most widely used speech character in speech/noise classification. Energy-based VAD algorithm can achieve good performance when the SNR level is tolerable. But many experiments have showed that energy-based algorithms fail at low SNR levels. Due to the characteristics of speech, the entropy of speech signal is different from that of the noise signal. Spectral entropy-based algorithm is more effective, especially for the conditions with white noise. However, the full-band spectral entropy calculated by traditional method has pulses during nonspeech period when the background noise is nonstationary. The proposed VAD algorithm distinguishes speech from nonspeech by employing an OSF on the subband spectral entropy. OSF is nonlinear filter which is widely used in signal processing. The definition of OSF could be found in[4]. Firstly, we divide the magnitude spectrum Y [l, m] into K subbands. The probability of each frequency bin for l th frame in k th subband can be calculated by: m k+1 1 P k [l, i] = (Y [l, i] + M)/ (Y [l, m] + M) m=m k m k = N F F T 2K k (0 k K 1, m k i m k+1 1) (4) where M is a positive constant used to make the curve of noise entropy flat[6]. The spectral entropies of l th frame in K subbands are obtained by: E s [l, k] = m k+1 1 i=m k P k [l, i]logp k [l, i] (0 k K 1) (5) The proposed VAD algorithm employs an OSF for subband spectral entropy smoothing. The implementation of OSF is based on 2N + 1 subband spectral entropies {E s [l N, k],..., E s [l, k],..., E s [l + N, k]} around the frames to be analyzed[4]. Again, the first N frames are assumed to be nonspeech in each utterance and used to estimate the noise reference. E s(h) [l, k] is the h th largest number of the set in algebraic ascending order. The smoothed subband spectral entropy E h [l, k] is given by: E h [l, k] = (1 λ)e s(h) [l, k] + λe s(h+1) [l, k] (0 k K 1) (6) where h = λl (L = 2N + 1, 0 < λ < 1). Then the entropy of l th frame is measured by: H l = 1 K K 1 k=0 E h [l, k] (7) The proposed VAD decision process is based on a threshold. If H l is greater than the preset threshold, then the frame is classified as speech (VADflag=speech),

5 otherwise it is classified as nonspeech (VADflag=nonspeech). The threshold T is defined as: Avg = 1 K K 1 k=0 E m [k] T = β Avg + θ (8) where β = 1.01 and θ = 0.1 are proved to be suitable values. E m [k] is the median value of the sequence {E s [0, k],..., E s [N 1, k]}. Fig.2 illustrates the processing of subband OSF filtering on a Mandarin utterance. (a) is the original speech waveform with a SNR of about 10 db. (b) shows the full-band entropy. (c) is the subband average entropy after applying OSF filtering. By comparing (b) and (c), we can see that after OSF filtering the average subband entropy is more precise than full-band entropy in describing the speech/nonspeech divergence. Fig. 2. processing subband OSF on a Mandarin utterance: (a) original speech waveform. (b) full-band entropy. (c) subband average entropy after applying OSF filtering 2.3 Post-Processing Post-processing is the last stage of the proposed ASR front-end. Discrete cosine transformation (DCT) is applied to the RASTA filtered filter-bank log-energies

6 f rasta [l, j]. Then 13 mel cepstral coefficients c i (0 i 12) are obtained. A 39 dimensional feature vector is produced after dynamic calculation. Finally, the 13 basic MFCC parameters are normalized by performing CMVN as in[9]. From Fig.1 we can see that only the speech frames are considered in post-processing and the nonspeech frames are used for frame dropping. This will reduce the computational complexity of ASR system in backend side. Because each individual components has different contribution to the overall recognition accuracy, by filtering out some less important feature components, MIPS reduction can be achieved without much impact on the recognition accuracy. Furthermore, it is with the advantage of less memory consumption as the size of HMM models will be reduced. Both are beneficial to real-time implementation in embedded systems. 3 Experimental Results We compared the proposed front-end to the ETSI standard MFCC front-end and AFE in a recognition system. The backend decoder is the same. For recognition accuracy comparison, three speech databases are used, viz. Aurora2, Aurora3 and a Mandarin digits databse. We also evaluate the computational complexity of the three front-ends by extracting feature from 1 second speech on Xscale processor. All the three front-ends are implemented in fixed-point fashion[10]. In Aurora2 database, two training models are defined. They are clean condition only uses clean speech, and mulitcondition uses noise speech in different SNR levels (from 20 to -5 db). Three testing sets are defined: SetA, SetB and SetC. Every testing set is with different noise condition. Aurora3 is a set of multi-language Speechdat-Car databases recorded in car under different driving conditions with close-talking and hands-free microphones. Three recognition experiments are defined with different training and testing configurations: well-match, medium-mismatch and highly-mismatch (denote as WM, MM and HM respectively in results tables). Three languages (German, Spanish, and Danish) are used in our speech recognition experiments. The details of experimental framework on Aurora databases could be found in[11]. Experiments were also carried out on a Mandarin corpus. It contains 3031 utterances spoken by 39 female speakers and 40 male speakers. The utterances are collected through telephone network in Taiwan. We randomly select 2122 utterances for training and 909 utterances for testing. All the databases are sampled at 8kHz and quantified into 16 bits. In the experiments, we use HTK speech recognition toolkit[12] to train the HMM models and the configuration of the acoustic models is with the following parameters. Each model is composed of 16 emitting states and 3 Gaussian components per state. A silence (sil) and short pause (sp) model are also defined. The sil model consists of 3 states and the sp model has only a single state. Both models have 6 Gaussian components per state. The model configuration is fixed in all tasks. In VAD algorithm, the parameter N is 10, namely first 10 frames are used to estimate noise reference for each utterance. We divide the magnitude

7 spectrum into 4 subbands, λ = 0.9 and M=1000 are experimentally selected in VAD algorithm. The experimental results are illustrated as following. The average word accuracy percentage on Aurora2 and Aurora3 are presented in Table 1 and Table 2 respectively. The details of evaluation results on Mandarin corpus are given in Table 3. The relative improvement is the comparison results of MFCC front-end and the proposed front-end. The definition of sentence correction, word correction and word accuracy could be found in[12]. In Table 4, we present the running cycles of the three front-ends on Intel Xscale processor with the same processor configuration. Table 1. Evaluation Results on Aurora2 Database MultiCondition Clean Condition SNR/dB AFE MFCC proposed AFE MFCC proposed clean Average(20-0) Table 2. Evaluation Results on Aurora3 Database German Spanish Danish Match case AFE MFCC proposed AFE MFCC proposed AFE MFCC proposed WM MM HM Overall Conclusion This paper has proposed a low-cost noise robust front-end for embedded ASR applications. An OSF entropy-based VAD is presented which shows great ability in speech/nonspeech discrimination. Experimental results show that the proposed front-end can yield 30.08% and 62.55% relative improvements

8 Table 3. Evaluation Results on Mandarin Database AFE MFCC proposed relative improvement sentence correction word correction word accuracy Table 4. Computational Complexity on Xscale Processor AFE MFCC proposed Cycles(million) on Aurora2 and Aurora3 databases respectively and 29.47% on a Mandarin corpus compared with the results of ETSI standard MFCC front-end. Although the recognition accuracy of the proposed front-end is a little lower, its low computation cost shows a great advantage in embedded applications compared with AFE. Speech databases of five languages (English, German, Spanish, Danish and Mandarin) are used for evaluations. It has been found that remarkable recognition accuracy improvements were obtained against the ETSI standard front-end. References 1. Bojan Kotnik, Damjan Vlaj, Zdravko, Bogomir Horvat,: Robust MFCC Feature Extraction Algorithm Using Effective Additive and Convolutinal Noise Reduction Procedures. Proceedings of ICSLP, Denver, Colorado, (2002) Juneja, A., Deshmukh, O., Espy-Wilson, C.,: A Multi-band Spectral Subtraction Method for Enhancing Speech Corrupted by Colored Noise. Proceeding of ICASSP, (2002) IV-4164 vol.4 3. Stahl, V., Fischer, A., Bippus, R.,: Quantile Based Noise Estimation for Spectral Subtraction and Wiener Filtering. Proceedings of ICASSP, (2002) Ramirez, J., Segura, J.C., Benitez, C., de la Torre, A. and Rubio, A,: An Effective Subband OSF-based VAD with Noise Reduction for Robust Speech Recognition. IEEE Transactions on Speech and Audio Processing, Nov, (2005) Jia-lin Shen, Jeih-weih Hung, and Lin-shan Lee,: Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments. Proceeding of ICSLP, Sydney, Australia, (1998) Jia C, Xu B,: An Improved Entropy-based Endpoint Detection Algorithm. Proceeding of ICASSP, Taipei (2002) ETSI,: ETSI ES , Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms. Nov, ETSI,: ETSI ES , Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms. 2000

9 9. J.C. Segura, M.C. Benitez, A. de la Torre, A.J. Rubio,J. Ramrez,: Cepstral Domain Segmental Nonlinear Feature Transformations for Robust Speech Recognition. IEEE Signal Processing Letters, (2004) B.W.Delaney, M.Han, T.Simunic, A.Acquaviva,: A Low-power, Fixed-point, Front-end Feature Extraction for a Distributed Speech Recognition System. Proceeding of ICASSP (2002) David Pearce, Hans-Gnter Hirsch,: The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noisy Conditions; Proceeding of ICSLP, Beijing, China. Oct Young, S.,: HTK Book - Version 2.1, Entropic Cambridge Research Laboratory

Robust Speaker Identification

Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }