A Low-Cost Robust Front-end for Embedded ASR System

Size: px
Start display at page:

Download "A Low-Cost Robust Front-end for Embedded ASR System"

Transcription

1 A Low-Cost Robust Front-end for Embedded ASR System Lihui Guo 1, Xin He 2, Yue Lu 1, and Yaxin Zhang 2 1 Department of Computer Science and Technology, East China Normal University, Shanghai Motorola China Research Center, Shanghai Abstract. In this paper we propose a low-cost robust MFCC feature extraction algorithm which combines noise reduction and voice activity detection (VAD) for automatic speech recognition (ASR) system of embedded applications. To remedy the effect of additive noise a magnitude spectrum subtraction method is used. A VAD is performed to distinguish speech signal from noise signal. It discriminates speech/nonspeech frames by employing an order statistics filter (OSF) on subband spectral entropy. A general RASTA filtering on log Mel filter-bank energy trajectories are applied. Finally, a 26 dimensional feature vector is used in ASR system after feature selection. Experimental results show that the proposed front-end can obtain 30.08% and 62.55% relative improvements on Aurora2 and Aurora3 databases and 29.47% on a Mandarin database compared with the baseline obtained from ETSI standard MFCC front-end. 1 Introduction Front-end feature extraction plays an important role in ASR system. ETSI standard Mel-frequency cepstral coefficient front-end is widely used in many ASR systems due to its accuracy representation of human auditory system and speech perception[1]. However, the characteristics of speech signal are often distorted by background noise and channel transfer, especially in mobile ASR applications. The performance of ASR system often degrades dramatically at low SNR levels. Noise reduction is a critical and intractable problem in ASR system. A lot of speech recognition researchers had proposed many noise robustness methods in the past decades. When the environmental noise is additive, spectrum subtraction is a performance and cost effective technique in speech enhancement. In[2], a multi-band spectrum subtraction algorithm is implemented. Stahl[3] introduced a quantile based noise estimation for spectrum subtraction. Although many algorithms are effective, they are not suitable for embedded ASR system due to their large system complexity. In this paper we present a simple and effective spectrum subtraction algorithm for noise reduction. The nonspeech frames 3 in speech signal only contain redundant and disturbance information to ASR system. Although there always are silence and 3 Noise only frames or silence frames

2 short pause models in HMM acoustic models configurations, in practice it is still very helpful to distinguish speech from nonspeech and drop the nonspeech frames during decoding. It is crucial for embedded system that the computational complexity can be minimized if only the speech frames are used for decoding. The transmission bit rate in distributed speech recognition (DSR) system can also be reduced if only speech frames are transferred. On the other hand, insertion errors are often observed when too many silence frames are passed to the decoder. So VAD is necessary in noise robust front-end. In[4] a subband energy-based VAD algorithm is presented. Shen[5] introduced an entropy-based algorithm for endpoint detection under noise condition and Xu[6] presented an improved entropy-based algorithm. In this paper, we introduce a VAD algorithm with an OSF on the subband spectral entropy. Although ETSI has published an advanced front-end (AFE) that provides substantially improvement on recognition performance in noise conditions[7]. It is not suitable for real-time embedded implementation due to its large computational complexity. Experimental results show that our proposed front-end only uses about one quarter computational MIPS of AFE but achieved similar recognition accuracies. This paper is organized as follows. In section 2, we describe the proposed front-end in details. Experimental results on Aurora2, Aurora3 and Mandarin digits databases are presented in section 3. In section 4, we give the summary of conclusions. 2 Front-end Algorithm Description The proposed front-end is a modified version based on the ETSI standard MFCC front-end. Details of the basic processing blocks can be found in[8]. Fig.1 shows the proposed front-end algorithm. In addition to the basic processing blocks, three enhancement stages (shaded blocks in Fig.1) are added. These stages are noise reduction with spectrum subtraction and RASTA filtering (section 2.1), a subband OSF entropy-based VAD (section 2.2), and post-processing stage with dynamic calculation, CMVN 4 processing as well as feature selection (section 2.3). 2.1 Noise Reduction The input signal x(n) is divided into overlapped frames with a length of 25ms (200 samples at the sampling rate of 8kHz) and the frame shift is 10ms (80 samples). The magnitude spectrum of signal is obtained by Fast Fourier Transform (FFT). Afterward, a spectrum subtraction is applied on the magnitude spectrum X[l, m] by subtracting the noise estimate from noisy spectrum. In the proposed front-end, spectrum subtraction is given by: Y [l, m] = max(x[l, m] N[m], αx[l, m]) (0 m N F F T /2) (1) 4 Cepstral Mean and Variance Normalization

3 x[l,n] n:sample index l:frame index preprocessing and FFT length=256 X[l,m] k;frequency index 0 m 128 Spectrum Subtraction Update Noise Estimation Y[l,m] f[l,j] Mel-frequency Nonlinear Filtering Transformation Entropy-based VAD j:filter-bank index 1 j 23 fln[l,j] i:cepstral coefficients 0 i 12 c[l,i] Dynamic Calculation Rasta Filtering DCT frasta[l,j] VADflag= speech N Frame Dropping Y c ' [l,k] (1 k 39) CMVN processing feature selection c ' mva[l,k] 26 dimensional feature vector Fig. 1. The proposed robust front-end algorithm where N F F T is FFT length, Y [l, m] is the speech magnitude spectrum after spectrum subtraction, X[l, m] is the magnitude spectrum of noisy speech signal. N[m] is the average magnitude spectrum of the noise. For each utterance, the first 10 frames are assumed to be noise. This assumption is valid in practical applications where before speaking speakers always take a short period of responding time after hearing a beep tone. The 10 reference frames are used to calculate the average noise spectrum N[m]. In order to track nonstationary noise, N[m] is updated during nonspeech period by: N[m] = γ N[m] + (1 γ)x[t, m] (2) where the t th frame is nonspeech based on VAD decision. The parameters γ = 0.97 performs well for the experiment databases. α (0, 1) is an attenuation constant to avoid Y [l, m] becoming negative due to noise estimation error, α is fixed at 0.3 in our speech recognition experiments. Spectrum subtraction is effective for additive noise but not for convolutional noise. The convolutional noise will become additive as is subjected to a logarithm function. In Fig.1, f ln [l, j] is produced after Mel filtering and nonlinear transformation (natural logarithm). RASTA filtering is applied on the temporal trajectories Mel filter-bank log-energies f ln [l, j] with the following transfer function: 0.98z z2 H rasta (z) = z 1 (3) Many experiments have showed that RASTA filter is effective in dealing with the convolutional distortion caused by transmission channel and microphone. From (1) and (3), we could find that spectrum subtraction module and RASTA filtering will not cause much computation load to ASR system.

4 2.2 Voice Activity Detection Energy is the most effective and most widely used speech character in speech/noise classification. Energy-based VAD algorithm can achieve good performance when the SNR level is tolerable. But many experiments have showed that energy-based algorithms fail at low SNR levels. Due to the characteristics of speech, the entropy of speech signal is different from that of the noise signal. Spectral entropy-based algorithm is more effective, especially for the conditions with white noise. However, the full-band spectral entropy calculated by traditional method has pulses during nonspeech period when the background noise is nonstationary. The proposed VAD algorithm distinguishes speech from nonspeech by employing an OSF on the subband spectral entropy. OSF is nonlinear filter which is widely used in signal processing. The definition of OSF could be found in[4]. Firstly, we divide the magnitude spectrum Y [l, m] into K subbands. The probability of each frequency bin for l th frame in k th subband can be calculated by: m k+1 1 P k [l, i] = (Y [l, i] + M)/ (Y [l, m] + M) m=m k m k = N F F T 2K k (0 k K 1, m k i m k+1 1) (4) where M is a positive constant used to make the curve of noise entropy flat[6]. The spectral entropies of l th frame in K subbands are obtained by: E s [l, k] = m k+1 1 i=m k P k [l, i]logp k [l, i] (0 k K 1) (5) The proposed VAD algorithm employs an OSF for subband spectral entropy smoothing. The implementation of OSF is based on 2N + 1 subband spectral entropies {E s [l N, k],..., E s [l, k],..., E s [l + N, k]} around the frames to be analyzed[4]. Again, the first N frames are assumed to be nonspeech in each utterance and used to estimate the noise reference. E s(h) [l, k] is the h th largest number of the set in algebraic ascending order. The smoothed subband spectral entropy E h [l, k] is given by: E h [l, k] = (1 λ)e s(h) [l, k] + λe s(h+1) [l, k] (0 k K 1) (6) where h = λl (L = 2N + 1, 0 < λ < 1). Then the entropy of l th frame is measured by: H l = 1 K K 1 k=0 E h [l, k] (7) The proposed VAD decision process is based on a threshold. If H l is greater than the preset threshold, then the frame is classified as speech (VADflag=speech),

5 otherwise it is classified as nonspeech (VADflag=nonspeech). The threshold T is defined as: Avg = 1 K K 1 k=0 E m [k] T = β Avg + θ (8) where β = 1.01 and θ = 0.1 are proved to be suitable values. E m [k] is the median value of the sequence {E s [0, k],..., E s [N 1, k]}. Fig.2 illustrates the processing of subband OSF filtering on a Mandarin utterance. (a) is the original speech waveform with a SNR of about 10 db. (b) shows the full-band entropy. (c) is the subband average entropy after applying OSF filtering. By comparing (b) and (c), we can see that after OSF filtering the average subband entropy is more precise than full-band entropy in describing the speech/nonspeech divergence. Fig. 2. processing subband OSF on a Mandarin utterance: (a) original speech waveform. (b) full-band entropy. (c) subband average entropy after applying OSF filtering 2.3 Post-Processing Post-processing is the last stage of the proposed ASR front-end. Discrete cosine transformation (DCT) is applied to the RASTA filtered filter-bank log-energies

6 f rasta [l, j]. Then 13 mel cepstral coefficients c i (0 i 12) are obtained. A 39 dimensional feature vector is produced after dynamic calculation. Finally, the 13 basic MFCC parameters are normalized by performing CMVN as in[9]. From Fig.1 we can see that only the speech frames are considered in post-processing and the nonspeech frames are used for frame dropping. This will reduce the computational complexity of ASR system in backend side. Because each individual components has different contribution to the overall recognition accuracy, by filtering out some less important feature components, MIPS reduction can be achieved without much impact on the recognition accuracy. Furthermore, it is with the advantage of less memory consumption as the size of HMM models will be reduced. Both are beneficial to real-time implementation in embedded systems. 3 Experimental Results We compared the proposed front-end to the ETSI standard MFCC front-end and AFE in a recognition system. The backend decoder is the same. For recognition accuracy comparison, three speech databases are used, viz. Aurora2, Aurora3 and a Mandarin digits databse. We also evaluate the computational complexity of the three front-ends by extracting feature from 1 second speech on Xscale processor. All the three front-ends are implemented in fixed-point fashion[10]. In Aurora2 database, two training models are defined. They are clean condition only uses clean speech, and mulitcondition uses noise speech in different SNR levels (from 20 to -5 db). Three testing sets are defined: SetA, SetB and SetC. Every testing set is with different noise condition. Aurora3 is a set of multi-language Speechdat-Car databases recorded in car under different driving conditions with close-talking and hands-free microphones. Three recognition experiments are defined with different training and testing configurations: well-match, medium-mismatch and highly-mismatch (denote as WM, MM and HM respectively in results tables). Three languages (German, Spanish, and Danish) are used in our speech recognition experiments. The details of experimental framework on Aurora databases could be found in[11]. Experiments were also carried out on a Mandarin corpus. It contains 3031 utterances spoken by 39 female speakers and 40 male speakers. The utterances are collected through telephone network in Taiwan. We randomly select 2122 utterances for training and 909 utterances for testing. All the databases are sampled at 8kHz and quantified into 16 bits. In the experiments, we use HTK speech recognition toolkit[12] to train the HMM models and the configuration of the acoustic models is with the following parameters. Each model is composed of 16 emitting states and 3 Gaussian components per state. A silence (sil) and short pause (sp) model are also defined. The sil model consists of 3 states and the sp model has only a single state. Both models have 6 Gaussian components per state. The model configuration is fixed in all tasks. In VAD algorithm, the parameter N is 10, namely first 10 frames are used to estimate noise reference for each utterance. We divide the magnitude

7 spectrum into 4 subbands, λ = 0.9 and M=1000 are experimentally selected in VAD algorithm. The experimental results are illustrated as following. The average word accuracy percentage on Aurora2 and Aurora3 are presented in Table 1 and Table 2 respectively. The details of evaluation results on Mandarin corpus are given in Table 3. The relative improvement is the comparison results of MFCC front-end and the proposed front-end. The definition of sentence correction, word correction and word accuracy could be found in[12]. In Table 4, we present the running cycles of the three front-ends on Intel Xscale processor with the same processor configuration. Table 1. Evaluation Results on Aurora2 Database MultiCondition Clean Condition SNR/dB AFE MFCC proposed AFE MFCC proposed clean Average(20-0) Table 2. Evaluation Results on Aurora3 Database German Spanish Danish Match case AFE MFCC proposed AFE MFCC proposed AFE MFCC proposed WM MM HM Overall Conclusion This paper has proposed a low-cost noise robust front-end for embedded ASR applications. An OSF entropy-based VAD is presented which shows great ability in speech/nonspeech discrimination. Experimental results show that the proposed front-end can yield 30.08% and 62.55% relative improvements

8 Table 3. Evaluation Results on Mandarin Database AFE MFCC proposed relative improvement sentence correction word correction word accuracy Table 4. Computational Complexity on Xscale Processor AFE MFCC proposed Cycles(million) on Aurora2 and Aurora3 databases respectively and 29.47% on a Mandarin corpus compared with the results of ETSI standard MFCC front-end. Although the recognition accuracy of the proposed front-end is a little lower, its low computation cost shows a great advantage in embedded applications compared with AFE. Speech databases of five languages (English, German, Spanish, Danish and Mandarin) are used for evaluations. It has been found that remarkable recognition accuracy improvements were obtained against the ETSI standard front-end. References 1. Bojan Kotnik, Damjan Vlaj, Zdravko, Bogomir Horvat,: Robust MFCC Feature Extraction Algorithm Using Effective Additive and Convolutinal Noise Reduction Procedures. Proceedings of ICSLP, Denver, Colorado, (2002) Juneja, A., Deshmukh, O., Espy-Wilson, C.,: A Multi-band Spectral Subtraction Method for Enhancing Speech Corrupted by Colored Noise. Proceeding of ICASSP, (2002) IV-4164 vol.4 3. Stahl, V., Fischer, A., Bippus, R.,: Quantile Based Noise Estimation for Spectral Subtraction and Wiener Filtering. Proceedings of ICASSP, (2002) Ramirez, J., Segura, J.C., Benitez, C., de la Torre, A. and Rubio, A,: An Effective Subband OSF-based VAD with Noise Reduction for Robust Speech Recognition. IEEE Transactions on Speech and Audio Processing, Nov, (2005) Jia-lin Shen, Jeih-weih Hung, and Lin-shan Lee,: Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments. Proceeding of ICSLP, Sydney, Australia, (1998) Jia C, Xu B,: An Improved Entropy-based Endpoint Detection Algorithm. Proceeding of ICASSP, Taipei (2002) ETSI,: ETSI ES , Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms. Nov, ETSI,: ETSI ES , Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms. 2000

9 9. J.C. Segura, M.C. Benitez, A. de la Torre, A.J. Rubio,J. Ramrez,: Cepstral Domain Segmental Nonlinear Feature Transformations for Robust Speech Recognition. IEEE Signal Processing Letters, (2004) B.W.Delaney, M.Han, T.Simunic, A.Acquaviva,: A Low-power, Fixed-point, Front-end Feature Extraction for a Distributed Speech Recognition System. Proceeding of ICASSP (2002) David Pearce, Hans-Gnter Hirsch,: The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noisy Conditions; Proceeding of ICSLP, Beijing, China. Oct Young, S.,: HTK Book - Version 2.1, Entropic Cambridge Research Laboratory

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR

Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, IEEE Fellow Abstract This paper

More information

Exploring Non-linear Transformations for an Entropybased Voice Activity Detector

Exploring Non-linear Transformations for an Entropybased Voice Activity Detector Exploring Non-linear Transformations for an Entropybased Voice Activity Detector Jordi Solé-Casals, Pere Martí-Puig, Ramon Reig-Bolaño Digital Technologies Group, University of Vic, Sagrada Família 7,

More information

Estimation of Cepstral Coefficients for Robust Speech Recognition

Estimation of Cepstral Coefficients for Robust Speech Recognition Estimation of Cepstral Coefficients for Robust Speech Recognition by Kevin M. Indrebo, B.S., M.S. A Dissertation submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment

More information

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Chin-Hung Sit 1, Man-Wai Mak 1, and Sun-Yuan Kung 2 1 Center for Multimedia Signal Processing Dept. of

More information

Speech Signal Representations

Speech Signal Representations Speech Signal Representations Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapters 5, 6 2. J. R. Deller et. al., Discrete-Time Processing of Speech Signals, Chapters 4-6

More information

DNN-based uncertainty estimation for weighted DNN-HMM ASR

DNN-based uncertainty estimation for weighted DNN-HMM ASR DNN-based uncertainty estimation for weighted DNN-HMM ASR José Novoa, Josué Fredes, Nestor Becerra Yoma Speech Processing and Transmission Lab., Universidad de Chile nbecerra@ing.uchile.cl Abstract In

More information

MVA Processing of Speech Features. Chia-Ping Chen, Jeff Bilmes

MVA Processing of Speech Features. Chia-Ping Chen, Jeff Bilmes MVA Processing of Speech Features Chia-Ping Chen, Jeff Bilmes {chiaping,bilmes}@ee.washington.edu SSLI Lab Dept of EE, University of Washington Seattle, WA - UW Electrical Engineering UWEE Technical Report

More information

Evaluation of the modified group delay feature for isolated word recognition

Evaluation of the modified group delay feature for isolated word recognition Evaluation of the modified group delay feature for isolated word recognition Author Alsteris, Leigh, Paliwal, Kuldip Published 25 Conference Title The 8th International Symposium on Signal Processing and

More information

Cepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition.

Cepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition. Cepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition. Philip N. Garner Idiap Research Institute, Centre du Parc, Rue Marconi 9, PO Box 592, 92 Martigny, Switzerland

More information

SNR Features for Automatic Speech Recognition

SNR Features for Automatic Speech Recognition SNR Features for Automatic Speech Recognition Philip N. Garner Idiap Research Institute Martigny, Switzerland pgarner@idiap.ch Abstract When combined with cepstral normalisation techniques, the features

More information

A Comparative Study of Histogram Equalization (HEQ) for Robust Speech Recognition

A Comparative Study of Histogram Equalization (HEQ) for Robust Speech Recognition Computational Linguistics and Chinese Language Processing Vol. 12, No. 2, June 2007, pp. 217-238 217 The Association for Computational Linguistics and Chinese Language Processing A Comparative Study of

More information

A Survey on Voice Activity Detection Methods

A Survey on Voice Activity Detection Methods e-issn 2455 1392 Volume 2 Issue 4, April 2016 pp. 668-675 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com A Survey on Voice Activity Detection Methods Shabeeba T. K. 1, Anand Pavithran 2

More information

2D Spectrogram Filter for Single Channel Speech Enhancement

2D Spectrogram Filter for Single Channel Speech Enhancement Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 007 89 D Spectrogram Filter for Single Channel Speech Enhancement HUIJUN DING,

More information

TinySR. Peter Schmidt-Nielsen. August 27, 2014

TinySR. Peter Schmidt-Nielsen. August 27, 2014 TinySR Peter Schmidt-Nielsen August 27, 2014 Abstract TinySR is a light weight real-time small vocabulary speech recognizer written entirely in portable C. The library fits in a single file (plus header),

More information

Modifying Voice Activity Detection in Low SNR by correction factors

Modifying Voice Activity Detection in Low SNR by correction factors Modifying Voice Activity Detection in Low SNR by correction factors H. Farsi, M. A. Mozaffarian, H.Rahmani Department of Electrical Engineering University of Birjand P.O. Box: +98-9775-376 IRAN hfarsi@birjand.ac.ir

More information

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification International Journal of Engineering Science Invention Volume 1 Issue 1 December. 2012 PP.18-23 Estimation of Relative Operating Characteristics of Text Independent Speaker Verification Palivela Hema 1,

More information

Detection-Based Speech Recognition with Sparse Point Process Models

Detection-Based Speech Recognition with Sparse Point Process Models Detection-Based Speech Recognition with Sparse Point Process Models Aren Jansen Partha Niyogi Human Language Technology Center of Excellence Departments of Computer Science and Statistics ICASSP 2010 Dallas,

More information

Noise Reduction. Two Stage Mel-Warped Weiner Filter Approach

Noise Reduction. Two Stage Mel-Warped Weiner Filter Approach Noise Reduction Two Stage Mel-Warped Weiner Filter Approach Intellectual Property Advanced front-end feature extraction algorithm ETSI ES 202 050 V1.1.3 (2003-11) European Telecommunications Standards

More information

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION Amin Haji Abolhassani 1, Sid-Ahmed Selouani 2, Douglas O Shaughnessy 1 1 INRS-Energie-Matériaux-Télécommunications,

More information

Model-Based Margin Estimation for Hidden Markov Model Learning and Generalization

Model-Based Margin Estimation for Hidden Markov Model Learning and Generalization 1 2 3 4 5 6 7 8 Model-Based Margin Estimation for Hidden Markov Model Learning and Generalization Sabato Marco Siniscalchi a,, Jinyu Li b, Chin-Hui Lee c a Faculty of Engineering and Architecture, Kore

More information

CEPSTRAL analysis has been widely used in signal processing

CEPSTRAL analysis has been widely used in signal processing 162 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999 On Second-Order Statistics and Linear Estimation of Cepstral Coefficients Yariv Ephraim, Fellow, IEEE, and Mazin Rahim, Senior

More information

Robust Sound Event Detection in Continuous Audio Environments

Robust Sound Event Detection in Continuous Audio Environments Robust Sound Event Detection in Continuous Audio Environments Haomin Zhang 1, Ian McLoughlin 2,1, Yan Song 1 1 National Engineering Laboratory of Speech and Language Information Processing The University

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short

More information

Proc. of NCC 2010, Chennai, India

Proc. of NCC 2010, Chennai, India Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay

More information

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden

More information

Deep Learning for Speech Recognition. Hung-yi Lee

Deep Learning for Speech Recognition. Hung-yi Lee Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New

More information

Robust Speech Recognition in the Presence of Additive Noise. Svein Gunnar Storebakken Pettersen

Robust Speech Recognition in the Presence of Additive Noise. Svein Gunnar Storebakken Pettersen Robust Speech Recognition in the Presence of Additive Noise Svein Gunnar Storebakken Pettersen A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of PHILOSOPHIAE DOCTOR

More information

Feature extraction 1

Feature extraction 1 Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 1 Dr Philip Jackson Cepstral analysis - Real & complex cepstra - Homomorphic decomposition Filter

More information

Speech Enhancement with Applications in Speech Recognition

Speech Enhancement with Applications in Speech Recognition Speech Enhancement with Applications in Speech Recognition A First Year Report Submitted to the School of Computer Engineering of the Nanyang Technological University by Xiao Xiong for the Confirmation

More information

Dominant Feature Vectors Based Audio Similarity Measure

Dominant Feature Vectors Based Audio Similarity Measure Dominant Feature Vectors Based Audio Similarity Measure Jing Gu 1, Lie Lu 2, Rui Cai 3, Hong-Jiang Zhang 2, and Jian Yang 1 1 Dept. of Electronic Engineering, Tsinghua Univ., Beijing, 100084, China 2 Microsoft

More information

This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail.

This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Powered by TCPDF (www.tcpdf.org) This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Author(s): Title: Heikki Kallasjoki,

More information

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure Correspondence Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure It is possible to detect and classify moving and stationary targets using ground surveillance pulse-doppler radars

More information

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm EngOpt 2008 - International Conference on Engineering Optimization Rio de Janeiro, Brazil, 0-05 June 2008. Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic

More information

Improved Method for Epoch Extraction in High Pass Filtered Speech

Improved Method for Epoch Extraction in High Pass Filtered Speech Improved Method for Epoch Extraction in High Pass Filtered Speech D. Govind Center for Computational Engineering & Networking Amrita Vishwa Vidyapeetham (University) Coimbatore, Tamilnadu 642 Email: d

More information

CURRENT state-of-the-art automatic speech recognition

CURRENT state-of-the-art automatic speech recognition 1850 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 6, AUGUST 2007 Switching Linear Dynamical Systems for Noise Robust Speech Recognition Bertrand Mesot and David Barber Abstract

More information

Noise Compensation for Subspace Gaussian Mixture Models

Noise Compensation for Subspace Gaussian Mixture Models Noise ompensation for ubspace Gaussian Mixture Models Liang Lu University of Edinburgh Joint work with KK hin, A. Ghoshal and. enals Liang Lu, Interspeech, eptember, 2012 Outline Motivation ubspace GMM

More information

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Zeros of z-transformzzt representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Baris Bozkurt 1 Collaboration with LIMSI-CNRS, France 07/03/2017

More information

Short-Time ICA for Blind Separation of Noisy Speech

Short-Time ICA for Blind Separation of Noisy Speech Short-Time ICA for Blind Separation of Noisy Speech Jing Zhang, P.C. Ching Department of Electronic Engineering The Chinese University of Hong Kong, Hong Kong jzhang@ee.cuhk.edu.hk, pcching@ee.cuhk.edu.hk

More information

Presented By: Omer Shmueli and Sivan Niv

Presented By: Omer Shmueli and Sivan Niv Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan

More information

Improving Reverberant VTS for Hands-free Robust Speech Recognition

Improving Reverberant VTS for Hands-free Robust Speech Recognition Improving Reverberant VTS for Hands-free Robust Speech Recognition Y.-Q. Wang, M. J. F. Gales Cambridge University Engineering Department Trumpington St., Cambridge CB2 1PZ, U.K. {yw293, mjfg}@eng.cam.ac.uk

More information

Hidden Markov Model Based Robust Speech Recognition

Hidden Markov Model Based Robust Speech Recognition Hidden Markov Model Based Robust Speech Recognition Vikas Mulik * Vikram Mane Imran Jamadar JCEM,K.M.Gad,E&Tc,&Shivaji University, ADCET,ASHTA,E&Tc&Shivaji university ADCET,ASHTA,Automobile&Shivaji Abstract

More information

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla

More information

arxiv: v1 [cs.sd] 25 Oct 2014

arxiv: v1 [cs.sd] 25 Oct 2014 Choice of Mel Filter Bank in Computing MFCC of a Resampled Speech arxiv:1410.6903v1 [cs.sd] 25 Oct 2014 Laxmi Narayana M, Sunil Kumar Kopparapu TCS Innovation Lab - Mumbai, Tata Consultancy Services, Yantra

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

"Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction"

Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction "Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction" Francesco Nesta, Marco Matassoni {nesta, matassoni}@fbk.eu Fondazione Bruno Kessler-Irst, Trento (ITALY) For contacts:

More information

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator 1 Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator Israel Cohen Lamar Signal Processing Ltd. P.O.Box 573, Yokneam Ilit 20692, Israel E-mail: icohen@lamar.co.il

More information

Voice Activity Detection Using Pitch Feature

Voice Activity Detection Using Pitch Feature Voice Activity Detection Using Pitch Feature Presented by: Shay Perera 1 CONTENTS Introduction Related work Proposed Improvement References Questions 2 PROBLEM speech Non speech Speech Region Non Speech

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

A SPECTRAL SUBTRACTION RULE FOR REAL-TIME DSP IMPLEMENTATION OF NOISE REDUCTION IN SPEECH SIGNALS

A SPECTRAL SUBTRACTION RULE FOR REAL-TIME DSP IMPLEMENTATION OF NOISE REDUCTION IN SPEECH SIGNALS Proc. of the 1 th Int. Conference on Digital Audio Effects (DAFx-9), Como, Italy, September 1-4, 9 A SPECTRAL SUBTRACTION RULE FOR REAL-TIME DSP IMPLEMENTATION OF NOISE REDUCTION IN SPEECH SIGNALS Matteo

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions

More information

Feature extraction 2

Feature extraction 2 Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 2 Dr Philip Jackson Linear prediction Perceptual linear prediction Comparison of feature methods

More information

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Exemplar-based voice conversion using non-negative spectrogram deconvolution Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore

More information

Speech Enhancement Preprocessing to J- RASTA-PLP

Speech Enhancement Preprocessing to J- RASTA-PLP Speech Enhancement Preprocessing to J- RASTA-PLP Michael Shire EECS 225D Prof. Morgan and Prof. Gold 1.0 Introduction The concept of the experiment presented here is to investigate whether a process that

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Lecture 9: Speech Recognition. Recognizing Speech

Lecture 9: Speech Recognition. Recognizing Speech EE E68: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis http://www.ee.columbia.edu/~dpwe/e68/

More information

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates Dima Ruinskiy Niv Dadush Yizhar Lavner Department of Computer Science, Tel-Hai College, Israel Outline Phoneme

More information

Machine Recognition of Sounds in Mixtures

Machine Recognition of Sounds in Mixtures Machine Recognition of Sounds in Mixtures Outline 1 2 3 4 Computational Auditory Scene Analysis Speech Recognition as Source Formation Sound Fragment Decoding Results & Conclusions Dan Ellis

More information

Harmonic Structure Transform for Speaker Recognition

Harmonic Structure Transform for Speaker Recognition Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden 29 August, 2011 Laskowski &

More information

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given

More information

Lecture 7: Feature Extraction

Lecture 7: Feature Extraction Lecture 7: Feature Extraction Kai Yu SpeechLab Department of Computer Science & Engineering Shanghai Jiao Tong University Autumn 2014 Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 1 / 28 Table of

More information

Lecture 9: Speech Recognition

Lecture 9: Speech Recognition EE E682: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 2 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis

More information

BIAS CORRECTION METHODS FOR ADAPTIVE RECURSIVE SMOOTHING WITH APPLICATIONS IN NOISE PSD ESTIMATION. Robert Rehr, Timo Gerkmann

BIAS CORRECTION METHODS FOR ADAPTIVE RECURSIVE SMOOTHING WITH APPLICATIONS IN NOISE PSD ESTIMATION. Robert Rehr, Timo Gerkmann BIAS CORRECTION METHODS FOR ADAPTIVE RECURSIVE SMOOTHING WITH APPLICATIONS IN NOISE PSD ESTIMATION Robert Rehr, Timo Gerkmann Speech Signal Processing Group, Department of Medical Physics and Acoustics

More information

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Interspeech 2018 2-6 September 2018, Hyderabad Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Rohith Aralikatti, Dilip Kumar Margam, Tanay Sharma,

More information

Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features A Theoretically Consistent Approach

Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features A Theoretically Consistent Approach Minimum Mean-Square Error Estimation of Mel-Frequency Cepstral Features A Theoretically Consistent Approach Jesper Jensen Abstract In this work we consider the problem of feature enhancement for noise-robust

More information

A Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise

A Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise 334 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL 11, NO 4, JULY 2003 A Generalized Subspace Approach for Enhancing Speech Corrupted by Colored Noise Yi Hu, Student Member, IEEE, and Philipos C

More information

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY MengSun,HugoVanhamme Department of Electrical Engineering-ESAT, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, Bus

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

Cochlear modeling and its role in human speech recognition

Cochlear modeling and its role in human speech recognition Allen/IPAM February 1, 2005 p. 1/3 Cochlear modeling and its role in human speech recognition Miller Nicely confusions and the articulation index Jont Allen Univ. of IL, Beckman Inst., Urbana IL Allen/IPAM

More information

Stress detection through emotional speech analysis

Stress detection through emotional speech analysis Stress detection through emotional speech analysis INMA MOHINO inmaculada.mohino@uah.edu.es ROBERTO GIL-PITA roberto.gil@uah.es LORENA ÁLVAREZ PÉREZ loreduna88@hotmail Abstract: Stress is a reaction or

More information

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION M. Schwab, P. Noll, and T. Sikora Technical University Berlin, Germany Communication System Group Einsteinufer 17, 1557 Berlin (Germany) {schwab noll

More information

Robust Speaker Identification System Based on Wavelet Transform and Gaussian Mixture Model

Robust Speaker Identification System Based on Wavelet Transform and Gaussian Mixture Model JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 19, 267-282 (2003) Robust Speaer Identification System Based on Wavelet Transform and Gaussian Mixture Model Department of Electrical Engineering Tamang University

More information

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System Fabrice Lefèvre, Claude Montacié and Marie-José Caraty Laboratoire d'informatique de Paris VI 4, place Jussieu 755 PARIS

More information

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System Snani Cherifa 1, Ramdani Messaoud 1, Zermi Narima 1, Bourouba Houcine 2 1 Laboratoire d Automatique et Signaux

More information

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Kazumasa Yamamoto Department of Computer Science Chubu University Kasugai, Aichi, Japan Email: yamamoto@cs.chubu.ac.jp Chikara

More information

Model-based unsupervised segmentation of birdcalls from field recordings

Model-based unsupervised segmentation of birdcalls from field recordings Model-based unsupervised segmentation of birdcalls from field recordings Anshul Thakur School of Computing and Electrical Engineering Indian Institute of Technology Mandi Himachal Pradesh, India Email:

More information

Automatic Phoneme Recognition. Segmental Hidden Markov Models

Automatic Phoneme Recognition. Segmental Hidden Markov Models Automatic Phoneme Recognition with Segmental Hidden Markov Models Areg G. Baghdasaryan Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment

More information

Signal Modeling Techniques In Speech Recognition

Signal Modeling Techniques In Speech Recognition Picone: Signal Modeling... 1 Signal Modeling Techniques In Speech Recognition by, Joseph Picone Texas Instruments Systems and Information Sciences Laboratory Tsukuba Research and Development Center Tsukuba,

More information

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli

More information

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

The effect of speaking rate and vowel context on the perception of consonants. in babble noise The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu

More information

Symmetric Distortion Measure for Speaker Recognition

Symmetric Distortion Measure for Speaker Recognition ISCA Archive http://www.isca-speech.org/archive SPECOM 2004: 9 th Conference Speech and Computer St. Petersburg, Russia September 20-22, 2004 Symmetric Distortion Measure for Speaker Recognition Evgeny

More information

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Eigenvoice Speaker Adaptation via Composite Kernel PCA Eigenvoice Speaker Adaptation via Composite Kernel PCA James T. Kwok, Brian Mak and Simon Ho Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong [jamesk,mak,csho]@cs.ust.hk

More information

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION Sarika Hegde 1, K. K. Achary 2 and Surendra Shetty 3 1 Department of Computer Applications, NMAM.I.T., Nitte, Karkala Taluk,

More information

FACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno

FACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno ACTORIAL HMMS OR ACOUSTIC MODELING Beth Logan and Pedro Moreno Cambridge Research Laboratories Digital Equipment Corporation One Kendall Square, Building 700, 2nd loor Cambridge, Massachusetts 02139 United

More information

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,

More information

Intraframe Prediction with Intraframe Update Step for Motion-Compensated Lifted Wavelet Video Coding

Intraframe Prediction with Intraframe Update Step for Motion-Compensated Lifted Wavelet Video Coding Intraframe Prediction with Intraframe Update Step for Motion-Compensated Lifted Wavelet Video Coding Aditya Mavlankar, Chuo-Ling Chang, and Bernd Girod Information Systems Laboratory, Department of Electrical

More information

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm

More information

R E S E A R C H R E P O R T Entropy-based multi-stream combination Hemant Misra a Hervé Bourlard a b Vivek Tyagi a IDIAP RR 02-24 IDIAP Dalle Molle Institute for Perceptual Artificial Intelligence ffl

More information

Environmental Sound Classification in Realistic Situations

Environmental Sound Classification in Realistic Situations Environmental Sound Classification in Realistic Situations K. Haddad, W. Song Brüel & Kjær Sound and Vibration Measurement A/S, Skodsborgvej 307, 2850 Nærum, Denmark. X. Valero La Salle, Universistat Ramon

More information

ISCA Archive

ISCA Archive ISCA Archive http://www.isca-speech.org/archive ODYSSEY04 - The Speaker and Language Recognition Workshop Toledo, Spain May 3 - June 3, 2004 Analysis of Multitarget Detection for Speaker and Language Recognition*

More information

VOICE ACTIVITY DETECTION IN PRESENCE OF TRANSIENT NOISE USING SPECTRAL CLUSTERING AND DIFFUSION KERNELS

VOICE ACTIVITY DETECTION IN PRESENCE OF TRANSIENT NOISE USING SPECTRAL CLUSTERING AND DIFFUSION KERNELS 2014 IEEE 28-th Convention of Electrical and Electronics Engineers in Israel VOICE ACTIVITY DETECTION IN PRESENCE OF TRANSIENT NOISE USING SPECTRAL CLUSTERING AND DIFFUSION KERNELS Oren Rosen, Saman Mousazadeh

More information

Why DNN Works for Acoustic Modeling in Speech Recognition?

Why DNN Works for Acoustic Modeling in Speech Recognition? Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,

More information

Time and frequency ltering of lter-bank energies for robust HMM speech recognition

Time and frequency ltering of lter-bank energies for robust HMM speech recognition Speech Communication 34 (2001) 93±114 www.elsevier.nl/locate/specom Time and frequency ltering of lter-bank energies for robust HMM speech recognition Climent Nadeu *,Dusan Macho, Javier Hernando TALP

More information

Towards Multi-Modal Driver s Stress Detection

Towards Multi-Modal Driver s Stress Detection Towards Multi-Modal Driver s Stress Detection Hynek Bořil, Pinar Boyraz, John H.L. Hansen Center for Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at

More information

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM COMM. MATH. SCI. Vol. 3, No. 1, pp. 47 56 c 25 International Press AN INVERTIBLE DISCRETE AUDITORY TRANSFORM JACK XIN AND YINGYONG QI Abstract. A discrete auditory transform (DAT) from sound signal to

More information

Monaural speech separation using source-adapted models

Monaural speech separation using source-adapted models Monaural speech separation using source-adapted models Ron Weiss, Dan Ellis {ronw,dpwe}@ee.columbia.edu LabROSA Department of Electrical Enginering Columbia University 007 IEEE Workshop on Applications

More information

Time-Varying Autoregressions for Speaker Verification in Reverberant Conditions

Time-Varying Autoregressions for Speaker Verification in Reverberant Conditions INTERSPEECH 017 August 0 4, 017, Stockholm, Sweden Time-Varying Autoregressions for Speaker Verification in Reverberant Conditions Ville Vestman 1, Dhananjaya Gowda, Md Sahidullah 1, Paavo Alku 3, Tomi

More information