A Low-Cost Robust Front-end for Embedded ASR System
|
|
- Maximilian Hood
- 5 years ago
- Views:
Transcription
1 A Low-Cost Robust Front-end for Embedded ASR System Lihui Guo 1, Xin He 2, Yue Lu 1, and Yaxin Zhang 2 1 Department of Computer Science and Technology, East China Normal University, Shanghai Motorola China Research Center, Shanghai Abstract. In this paper we propose a low-cost robust MFCC feature extraction algorithm which combines noise reduction and voice activity detection (VAD) for automatic speech recognition (ASR) system of embedded applications. To remedy the effect of additive noise a magnitude spectrum subtraction method is used. A VAD is performed to distinguish speech signal from noise signal. It discriminates speech/nonspeech frames by employing an order statistics filter (OSF) on subband spectral entropy. A general RASTA filtering on log Mel filter-bank energy trajectories are applied. Finally, a 26 dimensional feature vector is used in ASR system after feature selection. Experimental results show that the proposed front-end can obtain 30.08% and 62.55% relative improvements on Aurora2 and Aurora3 databases and 29.47% on a Mandarin database compared with the baseline obtained from ETSI standard MFCC front-end. 1 Introduction Front-end feature extraction plays an important role in ASR system. ETSI standard Mel-frequency cepstral coefficient front-end is widely used in many ASR systems due to its accuracy representation of human auditory system and speech perception[1]. However, the characteristics of speech signal are often distorted by background noise and channel transfer, especially in mobile ASR applications. The performance of ASR system often degrades dramatically at low SNR levels. Noise reduction is a critical and intractable problem in ASR system. A lot of speech recognition researchers had proposed many noise robustness methods in the past decades. When the environmental noise is additive, spectrum subtraction is a performance and cost effective technique in speech enhancement. In[2], a multi-band spectrum subtraction algorithm is implemented. Stahl[3] introduced a quantile based noise estimation for spectrum subtraction. Although many algorithms are effective, they are not suitable for embedded ASR system due to their large system complexity. In this paper we present a simple and effective spectrum subtraction algorithm for noise reduction. The nonspeech frames 3 in speech signal only contain redundant and disturbance information to ASR system. Although there always are silence and 3 Noise only frames or silence frames
2 short pause models in HMM acoustic models configurations, in practice it is still very helpful to distinguish speech from nonspeech and drop the nonspeech frames during decoding. It is crucial for embedded system that the computational complexity can be minimized if only the speech frames are used for decoding. The transmission bit rate in distributed speech recognition (DSR) system can also be reduced if only speech frames are transferred. On the other hand, insertion errors are often observed when too many silence frames are passed to the decoder. So VAD is necessary in noise robust front-end. In[4] a subband energy-based VAD algorithm is presented. Shen[5] introduced an entropy-based algorithm for endpoint detection under noise condition and Xu[6] presented an improved entropy-based algorithm. In this paper, we introduce a VAD algorithm with an OSF on the subband spectral entropy. Although ETSI has published an advanced front-end (AFE) that provides substantially improvement on recognition performance in noise conditions[7]. It is not suitable for real-time embedded implementation due to its large computational complexity. Experimental results show that our proposed front-end only uses about one quarter computational MIPS of AFE but achieved similar recognition accuracies. This paper is organized as follows. In section 2, we describe the proposed front-end in details. Experimental results on Aurora2, Aurora3 and Mandarin digits databases are presented in section 3. In section 4, we give the summary of conclusions. 2 Front-end Algorithm Description The proposed front-end is a modified version based on the ETSI standard MFCC front-end. Details of the basic processing blocks can be found in[8]. Fig.1 shows the proposed front-end algorithm. In addition to the basic processing blocks, three enhancement stages (shaded blocks in Fig.1) are added. These stages are noise reduction with spectrum subtraction and RASTA filtering (section 2.1), a subband OSF entropy-based VAD (section 2.2), and post-processing stage with dynamic calculation, CMVN 4 processing as well as feature selection (section 2.3). 2.1 Noise Reduction The input signal x(n) is divided into overlapped frames with a length of 25ms (200 samples at the sampling rate of 8kHz) and the frame shift is 10ms (80 samples). The magnitude spectrum of signal is obtained by Fast Fourier Transform (FFT). Afterward, a spectrum subtraction is applied on the magnitude spectrum X[l, m] by subtracting the noise estimate from noisy spectrum. In the proposed front-end, spectrum subtraction is given by: Y [l, m] = max(x[l, m] N[m], αx[l, m]) (0 m N F F T /2) (1) 4 Cepstral Mean and Variance Normalization
3 x[l,n] n:sample index l:frame index preprocessing and FFT length=256 X[l,m] k;frequency index 0 m 128 Spectrum Subtraction Update Noise Estimation Y[l,m] f[l,j] Mel-frequency Nonlinear Filtering Transformation Entropy-based VAD j:filter-bank index 1 j 23 fln[l,j] i:cepstral coefficients 0 i 12 c[l,i] Dynamic Calculation Rasta Filtering DCT frasta[l,j] VADflag= speech N Frame Dropping Y c ' [l,k] (1 k 39) CMVN processing feature selection c ' mva[l,k] 26 dimensional feature vector Fig. 1. The proposed robust front-end algorithm where N F F T is FFT length, Y [l, m] is the speech magnitude spectrum after spectrum subtraction, X[l, m] is the magnitude spectrum of noisy speech signal. N[m] is the average magnitude spectrum of the noise. For each utterance, the first 10 frames are assumed to be noise. This assumption is valid in practical applications where before speaking speakers always take a short period of responding time after hearing a beep tone. The 10 reference frames are used to calculate the average noise spectrum N[m]. In order to track nonstationary noise, N[m] is updated during nonspeech period by: N[m] = γ N[m] + (1 γ)x[t, m] (2) where the t th frame is nonspeech based on VAD decision. The parameters γ = 0.97 performs well for the experiment databases. α (0, 1) is an attenuation constant to avoid Y [l, m] becoming negative due to noise estimation error, α is fixed at 0.3 in our speech recognition experiments. Spectrum subtraction is effective for additive noise but not for convolutional noise. The convolutional noise will become additive as is subjected to a logarithm function. In Fig.1, f ln [l, j] is produced after Mel filtering and nonlinear transformation (natural logarithm). RASTA filtering is applied on the temporal trajectories Mel filter-bank log-energies f ln [l, j] with the following transfer function: 0.98z z2 H rasta (z) = z 1 (3) Many experiments have showed that RASTA filter is effective in dealing with the convolutional distortion caused by transmission channel and microphone. From (1) and (3), we could find that spectrum subtraction module and RASTA filtering will not cause much computation load to ASR system.
4 2.2 Voice Activity Detection Energy is the most effective and most widely used speech character in speech/noise classification. Energy-based VAD algorithm can achieve good performance when the SNR level is tolerable. But many experiments have showed that energy-based algorithms fail at low SNR levels. Due to the characteristics of speech, the entropy of speech signal is different from that of the noise signal. Spectral entropy-based algorithm is more effective, especially for the conditions with white noise. However, the full-band spectral entropy calculated by traditional method has pulses during nonspeech period when the background noise is nonstationary. The proposed VAD algorithm distinguishes speech from nonspeech by employing an OSF on the subband spectral entropy. OSF is nonlinear filter which is widely used in signal processing. The definition of OSF could be found in[4]. Firstly, we divide the magnitude spectrum Y [l, m] into K subbands. The probability of each frequency bin for l th frame in k th subband can be calculated by: m k+1 1 P k [l, i] = (Y [l, i] + M)/ (Y [l, m] + M) m=m k m k = N F F T 2K k (0 k K 1, m k i m k+1 1) (4) where M is a positive constant used to make the curve of noise entropy flat[6]. The spectral entropies of l th frame in K subbands are obtained by: E s [l, k] = m k+1 1 i=m k P k [l, i]logp k [l, i] (0 k K 1) (5) The proposed VAD algorithm employs an OSF for subband spectral entropy smoothing. The implementation of OSF is based on 2N + 1 subband spectral entropies {E s [l N, k],..., E s [l, k],..., E s [l + N, k]} around the frames to be analyzed[4]. Again, the first N frames are assumed to be nonspeech in each utterance and used to estimate the noise reference. E s(h) [l, k] is the h th largest number of the set in algebraic ascending order. The smoothed subband spectral entropy E h [l, k] is given by: E h [l, k] = (1 λ)e s(h) [l, k] + λe s(h+1) [l, k] (0 k K 1) (6) where h = λl (L = 2N + 1, 0 < λ < 1). Then the entropy of l th frame is measured by: H l = 1 K K 1 k=0 E h [l, k] (7) The proposed VAD decision process is based on a threshold. If H l is greater than the preset threshold, then the frame is classified as speech (VADflag=speech),
5 otherwise it is classified as nonspeech (VADflag=nonspeech). The threshold T is defined as: Avg = 1 K K 1 k=0 E m [k] T = β Avg + θ (8) where β = 1.01 and θ = 0.1 are proved to be suitable values. E m [k] is the median value of the sequence {E s [0, k],..., E s [N 1, k]}. Fig.2 illustrates the processing of subband OSF filtering on a Mandarin utterance. (a) is the original speech waveform with a SNR of about 10 db. (b) shows the full-band entropy. (c) is the subband average entropy after applying OSF filtering. By comparing (b) and (c), we can see that after OSF filtering the average subband entropy is more precise than full-band entropy in describing the speech/nonspeech divergence. Fig. 2. processing subband OSF on a Mandarin utterance: (a) original speech waveform. (b) full-band entropy. (c) subband average entropy after applying OSF filtering 2.3 Post-Processing Post-processing is the last stage of the proposed ASR front-end. Discrete cosine transformation (DCT) is applied to the RASTA filtered filter-bank log-energies
6 f rasta [l, j]. Then 13 mel cepstral coefficients c i (0 i 12) are obtained. A 39 dimensional feature vector is produced after dynamic calculation. Finally, the 13 basic MFCC parameters are normalized by performing CMVN as in[9]. From Fig.1 we can see that only the speech frames are considered in post-processing and the nonspeech frames are used for frame dropping. This will reduce the computational complexity of ASR system in backend side. Because each individual components has different contribution to the overall recognition accuracy, by filtering out some less important feature components, MIPS reduction can be achieved without much impact on the recognition accuracy. Furthermore, it is with the advantage of less memory consumption as the size of HMM models will be reduced. Both are beneficial to real-time implementation in embedded systems. 3 Experimental Results We compared the proposed front-end to the ETSI standard MFCC front-end and AFE in a recognition system. The backend decoder is the same. For recognition accuracy comparison, three speech databases are used, viz. Aurora2, Aurora3 and a Mandarin digits databse. We also evaluate the computational complexity of the three front-ends by extracting feature from 1 second speech on Xscale processor. All the three front-ends are implemented in fixed-point fashion[10]. In Aurora2 database, two training models are defined. They are clean condition only uses clean speech, and mulitcondition uses noise speech in different SNR levels (from 20 to -5 db). Three testing sets are defined: SetA, SetB and SetC. Every testing set is with different noise condition. Aurora3 is a set of multi-language Speechdat-Car databases recorded in car under different driving conditions with close-talking and hands-free microphones. Three recognition experiments are defined with different training and testing configurations: well-match, medium-mismatch and highly-mismatch (denote as WM, MM and HM respectively in results tables). Three languages (German, Spanish, and Danish) are used in our speech recognition experiments. The details of experimental framework on Aurora databases could be found in[11]. Experiments were also carried out on a Mandarin corpus. It contains 3031 utterances spoken by 39 female speakers and 40 male speakers. The utterances are collected through telephone network in Taiwan. We randomly select 2122 utterances for training and 909 utterances for testing. All the databases are sampled at 8kHz and quantified into 16 bits. In the experiments, we use HTK speech recognition toolkit[12] to train the HMM models and the configuration of the acoustic models is with the following parameters. Each model is composed of 16 emitting states and 3 Gaussian components per state. A silence (sil) and short pause (sp) model are also defined. The sil model consists of 3 states and the sp model has only a single state. Both models have 6 Gaussian components per state. The model configuration is fixed in all tasks. In VAD algorithm, the parameter N is 10, namely first 10 frames are used to estimate noise reference for each utterance. We divide the magnitude
7 spectrum into 4 subbands, λ = 0.9 and M=1000 are experimentally selected in VAD algorithm. The experimental results are illustrated as following. The average word accuracy percentage on Aurora2 and Aurora3 are presented in Table 1 and Table 2 respectively. The details of evaluation results on Mandarin corpus are given in Table 3. The relative improvement is the comparison results of MFCC front-end and the proposed front-end. The definition of sentence correction, word correction and word accuracy could be found in[12]. In Table 4, we present the running cycles of the three front-ends on Intel Xscale processor with the same processor configuration. Table 1. Evaluation Results on Aurora2 Database MultiCondition Clean Condition SNR/dB AFE MFCC proposed AFE MFCC proposed clean Average(20-0) Table 2. Evaluation Results on Aurora3 Database German Spanish Danish Match case AFE MFCC proposed AFE MFCC proposed AFE MFCC proposed WM MM HM Overall Conclusion This paper has proposed a low-cost noise robust front-end for embedded ASR applications. An OSF entropy-based VAD is presented which shows great ability in speech/nonspeech discrimination. Experimental results show that the proposed front-end can yield 30.08% and 62.55% relative improvements
8 Table 3. Evaluation Results on Mandarin Database AFE MFCC proposed relative improvement sentence correction word correction word accuracy Table 4. Computational Complexity on Xscale Processor AFE MFCC proposed Cycles(million) on Aurora2 and Aurora3 databases respectively and 29.47% on a Mandarin corpus compared with the results of ETSI standard MFCC front-end. Although the recognition accuracy of the proposed front-end is a little lower, its low computation cost shows a great advantage in embedded applications compared with AFE. Speech databases of five languages (English, German, Spanish, Danish and Mandarin) are used for evaluations. It has been found that remarkable recognition accuracy improvements were obtained against the ETSI standard front-end. References 1. Bojan Kotnik, Damjan Vlaj, Zdravko, Bogomir Horvat,: Robust MFCC Feature Extraction Algorithm Using Effective Additive and Convolutinal Noise Reduction Procedures. Proceedings of ICSLP, Denver, Colorado, (2002) Juneja, A., Deshmukh, O., Espy-Wilson, C.,: A Multi-band Spectral Subtraction Method for Enhancing Speech Corrupted by Colored Noise. Proceeding of ICASSP, (2002) IV-4164 vol.4 3. Stahl, V., Fischer, A., Bippus, R.,: Quantile Based Noise Estimation for Spectral Subtraction and Wiener Filtering. Proceedings of ICASSP, (2002) Ramirez, J., Segura, J.C., Benitez, C., de la Torre, A. and Rubio, A,: An Effective Subband OSF-based VAD with Noise Reduction for Robust Speech Recognition. IEEE Transactions on Speech and Audio Processing, Nov, (2005) Jia-lin Shen, Jeih-weih Hung, and Lin-shan Lee,: Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments. Proceeding of ICSLP, Sydney, Australia, (1998) Jia C, Xu B,: An Improved Entropy-based Endpoint Detection Algorithm. Proceeding of ICASSP, Taipei (2002) ETSI,: ETSI ES , Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms. Nov, ETSI,: ETSI ES , Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms. 2000
9 9. J.C. Segura, M.C. Benitez, A. de la Torre, A.J. Rubio,J. Ramrez,: Cepstral Domain Segmental Nonlinear Feature Transformations for Robust Speech Recognition. IEEE Signal Processing Letters, (2004) B.W.Delaney, M.Han, T.Simunic, A.Acquaviva,: A Low-power, Fixed-point, Front-end Feature Extraction for a Distributed Speech Recognition System. Proceeding of ICASSP (2002) David Pearce, Hans-Gnter Hirsch,: The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noisy Conditions; Proceeding of ICSLP, Beijing, China. Oct Young, S.,: HTK Book - Version 2.1, Entropic Cambridge Research Laboratory
Robust Speaker Identification
Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }
More informationImproved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR
Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, IEEE Fellow Abstract This paper
More informationExploring Non-linear Transformations for an Entropybased Voice Activity Detector
Exploring Non-linear Transformations for an Entropybased Voice Activity Detector Jordi Solé-Casals, Pere Martí-Puig, Ramon Reig-Bolaño Digital Technologies Group, University of Vic, Sagrada Família 7,
More informationEstimation of Cepstral Coefficients for Robust Speech Recognition
Estimation of Cepstral Coefficients for Robust Speech Recognition by Kevin M. Indrebo, B.S., M.S. A Dissertation submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment
More informationMaximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems
Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Chin-Hung Sit 1, Man-Wai Mak 1, and Sun-Yuan Kung 2 1 Center for Multimedia Signal Processing Dept. of
More informationSpeech Signal Representations
Speech Signal Representations Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapters 5, 6 2. J. R. Deller et. al., Discrete-Time Processing of Speech Signals, Chapters 4-6
More informationDNN-based uncertainty estimation for weighted DNN-HMM ASR
DNN-based uncertainty estimation for weighted DNN-HMM ASR José Novoa, Josué Fredes, Nestor Becerra Yoma Speech Processing and Transmission Lab., Universidad de Chile nbecerra@ing.uchile.cl Abstract In
More informationMVA Processing of Speech Features. Chia-Ping Chen, Jeff Bilmes
MVA Processing of Speech Features Chia-Ping Chen, Jeff Bilmes {chiaping,bilmes}@ee.washington.edu SSLI Lab Dept of EE, University of Washington Seattle, WA - UW Electrical Engineering UWEE Technical Report
More informationEvaluation of the modified group delay feature for isolated word recognition
Evaluation of the modified group delay feature for isolated word recognition Author Alsteris, Leigh, Paliwal, Kuldip Published 25 Conference Title The 8th International Symposium on Signal Processing and
More informationCepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition.
Cepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition. Philip N. Garner Idiap Research Institute, Centre du Parc, Rue Marconi 9, PO Box 592, 92 Martigny, Switzerland
More informationSNR Features for Automatic Speech Recognition
SNR Features for Automatic Speech Recognition Philip N. Garner Idiap Research Institute Martigny, Switzerland pgarner@idiap.ch Abstract When combined with cepstral normalisation techniques, the features
More informationA Comparative Study of Histogram Equalization (HEQ) for Robust Speech Recognition
Computational Linguistics and Chinese Language Processing Vol. 12, No. 2, June 2007, pp. 217-238 217 The Association for Computational Linguistics and Chinese Language Processing A Comparative Study of
More informationA Survey on Voice Activity Detection Methods
e-issn 2455 1392 Volume 2 Issue 4, April 2016 pp. 668-675 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com A Survey on Voice Activity Detection Methods Shabeeba T. K. 1, Anand Pavithran 2
More information2D Spectrogram Filter for Single Channel Speech Enhancement
Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 007 89 D Spectrogram Filter for Single Channel Speech Enhancement HUIJUN DING,
More informationTinySR. Peter Schmidt-Nielsen. August 27, 2014
TinySR Peter Schmidt-Nielsen August 27, 2014 Abstract TinySR is a light weight real-time small vocabulary speech recognizer written entirely in portable C. The library fits in a single file (plus header),
More informationModifying Voice Activity Detection in Low SNR by correction factors
Modifying Voice Activity Detection in Low SNR by correction factors H. Farsi, M. A. Mozaffarian, H.Rahmani Department of Electrical Engineering University of Birjand P.O. Box: +98-9775-376 IRAN hfarsi@birjand.ac.ir
More informationEstimation of Relative Operating Characteristics of Text Independent Speaker Verification
International Journal of Engineering Science Invention Volume 1 Issue 1 December. 2012 PP.18-23 Estimation of Relative Operating Characteristics of Text Independent Speaker Verification Palivela Hema 1,
More informationDetection-Based Speech Recognition with Sparse Point Process Models
Detection-Based Speech Recognition with Sparse Point Process Models Aren Jansen Partha Niyogi Human Language Technology Center of Excellence Departments of Computer Science and Statistics ICASSP 2010 Dallas,
More informationNoise Reduction. Two Stage Mel-Warped Weiner Filter Approach
Noise Reduction Two Stage Mel-Warped Weiner Filter Approach Intellectual Property Advanced front-end feature extraction algorithm ETSI ES 202 050 V1.1.3 (2003-11) European Telecommunications Standards
More informationSPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION
SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION Amin Haji Abolhassani 1, Sid-Ahmed Selouani 2, Douglas O Shaughnessy 1 1 INRS-Energie-Matériaux-Télécommunications,
More informationModel-Based Margin Estimation for Hidden Markov Model Learning and Generalization
1 2 3 4 5 6 7 8 Model-Based Margin Estimation for Hidden Markov Model Learning and Generalization Sabato Marco Siniscalchi a,, Jinyu Li b, Chin-Hui Lee c a Faculty of Engineering and Architecture, Kore
More informationCEPSTRAL analysis has been widely used in signal processing
162 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999 On Second-Order Statistics and Linear Estimation of Cepstral Coefficients Yariv Ephraim, Fellow, IEEE, and Mazin Rahim, Senior
More informationRobust Sound Event Detection in Continuous Audio Environments
Robust Sound Event Detection in Continuous Audio Environments Haomin Zhang 1, Ian McLoughlin 2,1, Yan Song 1 1 National Engineering Laboratory of Speech and Language Information Processing The University
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short
More informationProc. of NCC 2010, Chennai, India
Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay
More informationExperiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition
Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden
More informationDeep Learning for Speech Recognition. Hung-yi Lee
Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New
More informationRobust Speech Recognition in the Presence of Additive Noise. Svein Gunnar Storebakken Pettersen
Robust Speech Recognition in the Presence of Additive Noise Svein Gunnar Storebakken Pettersen A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of PHILOSOPHIAE DOCTOR
More informationFeature extraction 1
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 1 Dr Philip Jackson Cepstral analysis - Real & complex cepstra - Homomorphic decomposition Filter
More informationSpeech Enhancement with Applications in Speech Recognition
Speech Enhancement with Applications in Speech Recognition A First Year Report Submitted to the School of Computer Engineering of the Nanyang Technological University by Xiao Xiong for the Confirmation
More informationDominant Feature Vectors Based Audio Similarity Measure
Dominant Feature Vectors Based Audio Similarity Measure Jing Gu 1, Lie Lu 2, Rui Cai 3, Hong-Jiang Zhang 2, and Jian Yang 1 1 Dept. of Electronic Engineering, Tsinghua Univ., Beijing, 100084, China 2 Microsoft
More informationThis is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail.
Powered by TCPDF (www.tcpdf.org) This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Author(s): Title: Heikki Kallasjoki,
More informationPHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS
PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu
More informationSignal Modeling Techniques in Speech Recognition. Hassan A. Kingravi
Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction
More informationCorrespondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure
Correspondence Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure It is possible to detect and classify moving and stationary targets using ground surveillance pulse-doppler radars
More informationNoise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm
EngOpt 2008 - International Conference on Engineering Optimization Rio de Janeiro, Brazil, 0-05 June 2008. Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic
More informationImproved Method for Epoch Extraction in High Pass Filtered Speech
Improved Method for Epoch Extraction in High Pass Filtered Speech D. Govind Center for Computational Engineering & Networking Amrita Vishwa Vidyapeetham (University) Coimbatore, Tamilnadu 642 Email: d
More informationCURRENT state-of-the-art automatic speech recognition
1850 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 6, AUGUST 2007 Switching Linear Dynamical Systems for Noise Robust Speech Recognition Bertrand Mesot and David Barber Abstract
More informationNoise Compensation for Subspace Gaussian Mixture Models
Noise ompensation for ubspace Gaussian Mixture Models Liang Lu University of Edinburgh Joint work with KK hin, A. Ghoshal and. enals Liang Lu, Interspeech, eptember, 2012 Outline Motivation ubspace GMM
More informationZeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals
Zeros of z-transformzzt representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Baris Bozkurt 1 Collaboration with LIMSI-CNRS, France 07/03/2017
More informationShort-Time ICA for Blind Separation of Noisy Speech
Short-Time ICA for Blind Separation of Noisy Speech Jing Zhang, P.C. Ching Department of Electronic Engineering The Chinese University of Hong Kong, Hong Kong jzhang@ee.cuhk.edu.hk, pcching@ee.cuhk.edu.hk
More informationPresented By: Omer Shmueli and Sivan Niv
Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan
More informationImproving Reverberant VTS for Hands-free Robust Speech Recognition
Improving Reverberant VTS for Hands-free Robust Speech Recognition Y.-Q. Wang, M. J. F. Gales Cambridge University Engineering Department Trumpington St., Cambridge CB2 1PZ, U.K. {yw293, mjfg}@eng.cam.ac.uk
More informationHidden Markov Model Based Robust Speech Recognition
Hidden Markov Model Based Robust Speech Recognition Vikas Mulik * Vikram Mane Imran Jamadar JCEM,K.M.Gad,E&Tc,&Shivaji University, ADCET,ASHTA,E&Tc&Shivaji university ADCET,ASHTA,Automobile&Shivaji Abstract
More informationSinger Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers
Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla
More informationarxiv: v1 [cs.sd] 25 Oct 2014
Choice of Mel Filter Bank in Computing MFCC of a Resampled Speech arxiv:1410.6903v1 [cs.sd] 25 Oct 2014 Laxmi Narayana M, Sunil Kumar Kopparapu TCS Innovation Lab - Mumbai, Tata Consultancy Services, Yantra
More informationIndependent Component Analysis and Unsupervised Learning
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent
More information"Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction"
"Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction" Francesco Nesta, Marco Matassoni {nesta, matassoni}@fbk.eu Fondazione Bruno Kessler-Irst, Trento (ITALY) For contacts:
More informationOptimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator
1 Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator Israel Cohen Lamar Signal Processing Ltd. P.O.Box 573, Yokneam Ilit 20692, Israel E-mail: icohen@lamar.co.il
More informationVoice Activity Detection Using Pitch Feature
Voice Activity Detection Using Pitch Feature Presented by: Shay Perera 1 CONTENTS Introduction Related work Proposed Improvement References Questions 2 PROBLEM speech Non speech Speech Region Non Speech
More informationIndependent Component Analysis and Unsupervised Learning. Jen-Tzung Chien
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood
More informationA SPECTRAL SUBTRACTION RULE FOR REAL-TIME DSP IMPLEMENTATION OF NOISE REDUCTION IN SPEECH SIGNALS
Proc. of the 1 th Int. Conference on Digital Audio Effects (DAFx-9), Como, Italy, September 1-4, 9 A SPECTRAL SUBTRACTION RULE FOR REAL-TIME DSP IMPLEMENTATION OF NOISE REDUCTION IN SPEECH SIGNALS Matteo
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models
Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions
More informationFeature extraction 2
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 2 Dr Philip Jackson Linear prediction Perceptual linear prediction Comparison of feature methods
More informationExemplar-based voice conversion using non-negative spectrogram deconvolution
Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore
More informationSpeech Enhancement Preprocessing to J- RASTA-PLP
Speech Enhancement Preprocessing to J- RASTA-PLP Michael Shire EECS 225D Prof. Morgan and Prof. Gold 1.0 Introduction The concept of the experiment presented here is to investigate whether a process that
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models
Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationStatistical NLP Spring The Noisy Channel Model
Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.
More informationLecture 9: Speech Recognition. Recognizing Speech
EE E68: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis http://www.ee.columbia.edu/~dpwe/e68/
More informationSpectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates
Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates Dima Ruinskiy Niv Dadush Yizhar Lavner Department of Computer Science, Tel-Hai College, Israel Outline Phoneme
More informationMachine Recognition of Sounds in Mixtures
Machine Recognition of Sounds in Mixtures Outline 1 2 3 4 Computational Auditory Scene Analysis Speech Recognition as Source Formation Sound Fragment Decoding Results & Conclusions Dan Ellis
More informationHarmonic Structure Transform for Speaker Recognition
Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden 29 August, 2011 Laskowski &
More informationThe Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech
CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given
More informationLecture 7: Feature Extraction
Lecture 7: Feature Extraction Kai Yu SpeechLab Department of Computer Science & Engineering Shanghai Jiao Tong University Autumn 2014 Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 1 / 28 Table of
More informationLecture 9: Speech Recognition
EE E682: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 2 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis
More informationBIAS CORRECTION METHODS FOR ADAPTIVE RECURSIVE SMOOTHING WITH APPLICATIONS IN NOISE PSD ESTIMATION. Robert Rehr, Timo Gerkmann
BIAS CORRECTION METHODS FOR ADAPTIVE RECURSIVE SMOOTHING WITH APPLICATIONS IN NOISE PSD ESTIMATION Robert Rehr, Timo Gerkmann Speech Signal Processing Group, Department of Medical Physics and Acoustics
More informationGlobal SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks
Interspeech 2018 2-6 September 2018, Hyderabad Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Rohith Aralikatti, Dilip Kumar Margam, Tanay Sharma,
More informationMinimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features A Theoretically Consistent Approach
Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features A Theoretically Consistent Approach Jesper Jensen Abstract In this work we consider the problem of feature enhancement for noise-robust
More informationA Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise
334 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 11, NO 4, JULY 2003 A Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise Yi Hu, Student Member, IEEE, and Philipos C
More informationA TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme
A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY MengSun,HugoVanhamme Department of Electrical Engineering-ESAT, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, Bus
More informationA Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement
A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,
More informationCochlear modeling and its role in human speech recognition
Allen/IPAM February 1, 2005 p. 1/3 Cochlear modeling and its role in human speech recognition Miller Nicely confusions and the articulation index Jont Allen Univ. of IL, Beckman Inst., Urbana IL Allen/IPAM
More informationStress detection through emotional speech analysis
Stress detection through emotional speech analysis INMA MOHINO inmaculada.mohino@uah.edu.es ROBERTO GIL-PITA roberto.gil@uah.es LORENA ÁLVAREZ PÉREZ loreduna88@hotmail Abstract: Stress is a reaction or
More informationNOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group
NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION M. Schwab, P. Noll, and T. Sikora Technical University Berlin, Germany Communication System Group Einsteinufer 17, 1557 Berlin (Germany) {schwab noll
More informationRobust Speaker Identification System Based on Wavelet Transform and Gaussian Mixture Model
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 19, 267-282 (2003) Robust Speaer Identification System Based on Wavelet Transform and Gaussian Mixture Model Department of Electrical Engineering Tamang University
More informationOn the Influence of the Delta Coefficients in a HMM-based Speech Recognition System
On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System Fabrice Lefèvre, Claude Montacié and Marie-José Caraty Laboratoire d'informatique de Paris VI 4, place Jussieu 755 PARIS
More informationGMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System
GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System Snani Cherifa 1, Ramdani Messaoud 1, Zermi Narima 1, Bourouba Houcine 2 1 Laboratoire d Automatique et Signaux
More informationDetection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors
Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Kazumasa Yamamoto Department of Computer Science Chubu University Kasugai, Aichi, Japan Email: yamamoto@cs.chubu.ac.jp Chikara
More informationModel-based unsupervised segmentation of birdcalls from field recordings
Model-based unsupervised segmentation of birdcalls from field recordings Anshul Thakur School of Computing and Electrical Engineering Indian Institute of Technology Mandi Himachal Pradesh, India Email:
More informationAutomatic Phoneme Recognition. Segmental Hidden Markov Models
Automatic Phoneme Recognition with Segmental Hidden Markov Models Areg G. Baghdasaryan Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment
More informationSignal Modeling Techniques In Speech Recognition
Picone: Signal Modeling... 1 Signal Modeling Techniques In Speech Recognition by, Joseph Picone Texas Instruments Systems and Information Sciences Laboratory Tsukuba Research and Development Center Tsukuba,
More informationSINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan
SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli
More informationThe effect of speaking rate and vowel context on the perception of consonants. in babble noise
The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu
More informationSymmetric Distortion Measure for Speaker Recognition
ISCA Archive http://www.isca-speech.org/archive SPECOM 2004: 9 th Conference Speech and Computer St. Petersburg, Russia September 20-22, 2004 Symmetric Distortion Measure for Speaker Recognition Evgeny
More informationEigenvoice Speaker Adaptation via Composite Kernel PCA
Eigenvoice Speaker Adaptation via Composite Kernel PCA James T. Kwok, Brian Mak and Simon Ho Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong [jamesk,mak,csho]@cs.ust.hk
More informationFEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION
FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION Sarika Hegde 1, K. K. Achary 2 and Surendra Shetty 3 1 Department of Computer Applications, NMAM.I.T., Nitte, Karkala Taluk,
More informationFACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno
ACTORIAL HMMS OR ACOUSTIC MODELING Beth Logan and Pedro Moreno Cambridge Research Laboratories Digital Equipment Corporation One Kendall Square, Building 700, 2nd loor Cambridge, Massachusetts 02139 United
More informationMel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda
Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,
More informationIntraframe Prediction with Intraframe Update Step for Motion-Compensated Lifted Wavelet Video Coding
Intraframe Prediction with Intraframe Update Step for Motion-Compensated Lifted Wavelet Video Coding Aditya Mavlankar, Chuo-Ling Chang, and Bernd Girod Information Systems Laboratory, Department of Electrical
More informationNearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender
Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm
More informationR E S E A R C H R E P O R T Entropy-based multi-stream combination Hemant Misra a Hervé Bourlard a b Vivek Tyagi a IDIAP RR 02-24 IDIAP Dalle Molle Institute for Perceptual Artificial Intelligence ffl
More informationEnvironmental Sound Classification in Realistic Situations
Environmental Sound Classification in Realistic Situations K. Haddad, W. Song Brüel & Kjær Sound and Vibration Measurement A/S, Skodsborgvej 307, 2850 Nærum, Denmark. X. Valero La Salle, Universistat Ramon
More informationISCA Archive
ISCA Archive http://www.isca-speech.org/archive ODYSSEY04 - The Speaker and Language Recognition Workshop Toledo, Spain May 3 - June 3, 2004 Analysis of Multitarget Detection for Speaker and Language Recognition*
More informationVOICE ACTIVITY DETECTION IN PRESENCE OF TRANSIENT NOISE USING SPECTRAL CLUSTERING AND DIFFUSION KERNELS
2014 IEEE 28-th Convention of Electrical and Electronics Engineers in Israel VOICE ACTIVITY DETECTION IN PRESENCE OF TRANSIENT NOISE USING SPECTRAL CLUSTERING AND DIFFUSION KERNELS Oren Rosen, Saman Mousazadeh
More informationWhy DNN Works for Acoustic Modeling in Speech Recognition?
Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,
More informationTime and frequency ltering of lter-bank energies for robust HMM speech recognition
Speech Communication 34 (2001) 93±114 www.elsevier.nl/locate/specom Time and frequency ltering of lter-bank energies for robust HMM speech recognition Climent Nadeu *,Dusan Macho, Javier Hernando TALP
More informationTowards Multi-Modal Driver s Stress Detection
Towards Multi-Modal Driver s Stress Detection Hynek Bořil, Pinar Boyraz, John H.L. Hansen Center for Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at
More informationAN INVERTIBLE DISCRETE AUDITORY TRANSFORM
COMM. MATH. SCI. Vol. 3, No. 1, pp. 47 56 c 25 International Press AN INVERTIBLE DISCRETE AUDITORY TRANSFORM JACK XIN AND YINGYONG QI Abstract. A discrete auditory transform (DAT) from sound signal to
More informationMonaural speech separation using source-adapted models
Monaural speech separation using source-adapted models Ron Weiss, Dan Ellis {ronw,dpwe}@ee.columbia.edu LabROSA Department of Electrical Enginering Columbia University 007 IEEE Workshop on Applications
More informationTime-Varying Autoregressions for Speaker Verification in Reverberant Conditions
INTERSPEECH 017 August 0 4, 017, Stockholm, Sweden Time-Varying Autoregressions for Speaker Verification in Reverberant Conditions Ville Vestman 1, Dhananjaya Gowda, Md Sahidullah 1, Paavo Alku 3, Tomi
More information