Investigate more robust features for Speech Recognition using Deep Learning

Size: px

Start display at page:

Download "Investigate more robust features for Speech Recognition using Deep Learning"

Elvin Horn
5 years ago
Views:

features for Speech Recognition using Deep Learning TIPHANIE

1 DEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 Investigate more robust features for Speech Recognition using Deep Learning TIPHANIE DENIAUX KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING

3 Abstract The new electronic devices and their constant progress brought up the challenge of improving the speech recognitions systems. Indeed, people tend to use more and more hands-free devices that are inclined to be used in noisy environments. The evolution of Machine Learning techniques has been very efficient for the last decade and speech recognition system using those techniques appeared. The main challenge of Automatic Speech Recognition systems nowadays is the improvement of the robustness to noise and reverberations. Deep Learning methods were used to either improve the speech representations or defining better distributions probabilities. The problem we face is the drop in the performance of ASR systems when inputs are noisy. The general approach is to define novel speech features that are more robust using Deep Neural Networks. To do so we got through different implementations as the incorporation of autooencoders in the MFCC block diagram or the deep denoising autoencoders with different pre-training methods. The final solution is a system that build more robust features from noisy MFCC. Our input is the demonstration that a denoising system using q quantized DDAEs defined by the clustering of the training data using K-means is more efficient than one denoising system applied to the whole data. The performance gained using such a system is of 2 to 3% in terms of phone error rate and might be improved using more training data and better tuned NN parameters.

4 Acknowledgment First I would like to thank my superviser Saikat Chatterjee for the opportunity he gave me to work on Deep Learning in speech recognition in the Communication Theory department at the School of Electrical Engineering, KTH. His help along this thesis has been a real support, and I really thank him for taking the time to evaluate with me our options and find solutions. I also thank Dr. Md Sahidullah for the help he gave me to start my work, his advices were valued for the rest of my thesis. I thank my examiner Mikael Skoglund and my opponent Fanny for reading my thesis and calling it into question. I thank the Master students that allowed me to attend their presentations because it provided me with experience to prepare my own one. I thank my family members for their support even if they sometimes could not understand what I was specifically working on. I thank Emeric for encouraging me and pushing me into doing my best at any cost and finally I thank all my Stockholm friends but also my french fellows for their friendship and support.

5 Contents 1 Introduction Motivation Structure of the report Background Introduction to Speech Recognition Feature extraction block Pattern recognition block Deep learning in Speech recognition What is deep learning? [3] DNN-HMM systems NN for robust features Developing tools and architecture Toolbox of Deep Learning General overview Neural networks Pre-training with Restricted Boltzmann Machines [11] Autoencoders Denoising auto-encoders Speech recognition tools Basic scheme of a speech recognition system HTK versus Kaldi General overview of Kaldi From speech to decoding with Kaldi Problem statement 17 5 Experiments and implementation TIMIT database AE integrated in MFCC computational line Framing setting Basic MFCCs computation Integration of the AE in the basic scheme Denoising DNN with supervised pre-training Noisy signals Final solution: denoising deep autoencoder with supervised pre-training and VQ

6 5.3.3 K-means theory NN setting Kaldi : conversion Kaldi setting Results Effect of the VQ of the feature vectors on denoising Effect of the VQ of the feature vectors on speech recognition Discussion of the results Conclusion and future work 28 A Kaldi formats 30 B TIMIT files 32 C Matlab files 34 C.1 Database loading C.2 MFCC implementation C.2.1 MFCC C.2.2 MFCC + deltas + deltas-deltas C.2.3 FilterBanks C.2.4 Power spectrum C.3 RBM learning C.3.1 Training using [22] C.4 DDAE implementation C.4.1 Noisy signals C.4.2 Training of the NN C.4.3 Format conversion C.4.4 Kaldi script

7 Abbreviations GM - Gaussian Model GMM - Gaussian Mixture Model NN - Neural Network DNN - Deep Neural Network DBN - Deep Belief Network AE - AutoEncoder SAE - Stacked AutoEncoders DDAE - Deep Denoising Autoencoder DBN - Deep Belief Network RBM - Restricted Boltzmann Machine MFCC - Mel-Frequency Cepstral Coefficients SGM - Stochastic Gradient Method CD - Contrastive Divergence LM - Language Model EM - Expectation Maximization WER - Word Error Rate PER - Phone Error Rate FFT - Fast Fourier Transform DFT - Discrete Fourier Transform FBE - FilterBanks Energies SNR - Signal Noise Ratio VQ - Vector Quantization 4

8 1 Introduction 1.1 Motivation The growing use of hands-free devices and voice-controlled systems has involved the development of high-performance speech recognition systems. Today s major challenge of Automatic Speech Recognition (ASR) systems is the presence of environmental noise and reverberation that causes a drop in the performance. Machine Learning has become a hot topic since early 2000 s and has been used to model the output distribution probabilities of the Hidden Markov models used in Speech Recognition. It is quite recently that the use of Deep learning has reached the same performance as the standard Gaussian Mixture Models. The novel pre-training methods are the new approaches that made this late achievement happened. But other applications of deep learning in speech recognition are also the discovery of bottleneck features [8] or the denoising of features [5]. So the goal of this thesis is to investigate how deep learning can be use in a side-way approach to create more robust speech features. 1.2 Structure of the report This report has been organized as follows. Chapter 2 sums up the background theory of speech recognition and deep learning. Chapter 3 lists and explains the tools used along for this Thesis. In Chapter IV the problem is formulated with its assumptions and the path taken along the thesis is expressed. The implementations done to get to the final solution and the details of the experiments are highlighted and explained in Chapter 5. Finally in Chapter 6 the results are presented and discussed. This last Chapter is followed by the conclusion. The parts of the code used to build the final solution are attached in the appendix. 1

9 2 Background In this chapter we provide a brief discussion of essential background in Speech recognition and Deep Learning. Algorithmic details and parameterization are discussed further in the next chapters. 2.1 Introduction to Speech Recognition Basically, an Automatic Speech Recognition System performs a task of recognition from a provided speech signal. In the case of this thesis, the output of the ASR is a text version of the speech and we operate at the phone level. Indeed, a phone is a distinct speech sound commonly associated with a vowel or a consonant. The English language has 42 phonemes that are units of sound and a phone is an acoustic representation associated with a phoneme. Figure 2.1: ASR System Feature extraction block A feature is a mathematical representation of a speech signal. At this stage, the speech waveform is transformed into a parametric representation for an easier analysis and processing in the next pattern recognition s stage. Indeed, raw waveform speech signals present large variations due to speaker variability or environment. Therefore, another domain than the time-domain needs to be used to represent speech and the first to come to mind is the Fourier domain because the frequency domain is relevant for speech. But while human hearing is a compromise between time and frequency resolution, a Fourier transform applied to a whole speech signal discards all timing information in the process. That is why we consider many short segments called frames of length in between 5 and 30 milliseconds. State-of-the-art extraction technique is the Mel-Frequency 2

10 Cepstral Coefficients. They have been chosen for their computational simplicity, their low dimensional encoding, and their success at the recognition stage. Figure 2.2: Mel-Frequency Cepstral Coefficients block diagram. Once the MFCC computed their first- and second-order temporal differences are concatenated and the final vector is given as input to the pattern recognition system Pattern recognition block The recognition can be realized at several levels: phones, triphones or words. In this thesis we work only on phone recognition as our goal is to demonstrate an improvement in recognition due to the development of more robust features. To deal with the temporal variability of speech the most current ASR systems use Hidden Markov Models combined with Gaussian Mixture Models to model the probability distributions over vectors of features. This model of probability distributions is referred to as the Acoustic Model. 2.2 Deep learning in Speech recognition With the advances made in detection and classification while using machine learning powerful techniques, the speech recognition community started to use Deep neural networks in ASR systems [3]. Basically, deep learning finds its origins within the Neurosciences and has been contributing to many different topics as shown in Figure What is deep learning? [3] Figure 2.3: Deep learning is at the center of many research areas 3

11 Deep learning is a hierarchical learning. That is to say that there are many layers of non-linear information which represent different level of abstraction, and that the whole concept is defined from the lower levels to build a high-level structure. A basic system is exposed in Figure 2.4. Considering the input nodes as x i, an activation function f and the weights w i j the hidden nodes y j are calculated as follows: y j = f( i wi j x i) Figure 2.4: Layered network Two types of learning exist: the supervised learning and the unsupervised learning. Approximately and to simplify we can say that supervised learning is a learning in which either your data is labeled or you have a target so you have a simple cost that depends on the difference between the output and the target to optimize. In the case of unsupervised learning your data is not labeled and the model you learn is generative (in opposition to the discriminative model that corresponds to supervised learning). The recurrent systems used in speech recognition are the hybrid deep networks that bring together supervised and unsupervised learning. Basically, the network is pre-trained in an unsupervised way to boost the effectiveness of the supervised training. This pre-training can be seen as a very efficient way to initialize the weights of the whole network. Then the backpropagation algorithm (detailed in the next chapters) fine-tunes the network and this consists in the supervised learning. The hybrid strategy is a response to an optimization issue that appears as the depth of the networks increases and which is the trapping of local optima DNN-HMM systems From early 21 st century a new form of acoustic models based on Deep Neural Networks was introduced [3] [9]. Before that, Speech recognition was dominated by GMM-HMM systems. The main improvements brought by DNN are the ability to model data correlation (in this case, feature representation and classification are associated) and also the ability to model nonlinear data. Indeed, when it comes to modeling nonlinear data GMMs become statistically inefficient because it would require a very large amount of gaussians. DNN allows as well to improve the robustness of speech recognition when using DNN-HMM systems as demonstrated in [21]. Nevertheless, the successful performance of GMM-HMM systems has made it difficult for other methods to outperform them. Deep learning algorithms and parameters have had to be tuned a lot 4

12 before coming close to GMM-HMM performance and overstep it NN for robust features The principal motivation for working on better feature extraction systems is that the most relevant the feature is, the better the results are at the recognition stage. Papers [27], [19] and [20] explore respectively new methods to obtain bottleneck features, speaker adaptive features and raw waveform features. In those researches, they focus on DNN-HMM systems in which the novel features are used to learn better acoustic models. Article [14] investigates the performance of nonlinear features of spectrograms that are given as input to a DNN-HMM system for ASR. [18] demonstrates that robust stacked autoencoders are capable of learning robust representations on noisy data. Finally [5] show the efficiency of denoising deep autoencoders on noisy MFCC features. Thus, DNNs have been used for different purposes in Speech recognition systems. For this thesis, we chose to work on the application of autoencoders and deep autoencoders in feature extraction in order to create robust features. 5

13 3 Developing tools and architecture 3.1 Toolbox of Deep Learning General overview The Matlab deep learning toolbox used in this thesis [16] presents different libraries for different types of neural networks (NN, CNN, DBN, SAE, CAE) as well as tests and a range of functions. In this section I describe the mathematics useful for my work and the one that are behind this deep learning toolbox Neural networks A neural network is fed with input features and their labels (or targets in our case) for supervised learning. Basically, the system learns nonlinear layers of information and optimizes the weights using the backpropagation algorithm to minimize the error between the output and the target. The nonlinearity is introduced in the system through an activation function f used to calculate the nodes of the hidden layers i.e. the deep representations of data. The system {weights W, biases} is initialized randomly using a Gaussian distribution or pretrained using an unsupervised learning method for W and to 1 for the biases. The forward pass: nnff.m in [16] Assume x is a vector of the input nodes with the biases, f is the activation function and the matrix of weights is W. For each j th hidden layer the units outputs are calculated : h j = f(h j 1 W ). For the output layer, it is the output function that is used instead of the activation function f. The error and loss are also computed as follows: Assume a target y and an output z = F NN (x, W ), the error is e = y z and the square-loss is l = 1 2.e2. The backward pass : nnbp.m in [16] The aim is to find the weights W that minimize the training error loss L = X,Y l(y, F NN (X, W )), where X and Y are respectively matrices of input features and target features.the derivatives are calculated using the delta-rule. 6

14 Derivative of the error with respect to the unit: δe δh j = e j (3.1) Derivative of the unit with respect to the net input (partial derivatives) : δh j δnet j = h j (1 h j ) (3.2) Derivative of the net input with respect to a weight: And finally for a hidden to output weight and for an input to hidden weight δnet j δw jk = h k (3.3) w jk = e j h j (1 h j ) h k (3.4) w ki = ( j e j h j (1 h j ) w jk ) h k (1 h k ) h i (3.5) The Dropout technique can be used at this stage. It has been introduced by G. E. Hinton [12] and reduces the overfitting and improves the neural networks s training. It basically consists in omitting a part of the features in each training case by setting some units to zero. The gradients are then applied in nnapplygrads.m [16] according to the Stochastic Gradient Method and with tuning options on learning rate (µ) and momentum (α). W = µ W + α vw (3.6) W = W + W (3.7) where vw is an accumulated memory of the previous gradients, µ is the learning rate and α is the momentum. The constant learning rate allows us to control the rate of learning of the weights i.e. only a ratio of the calculated gradient is taken for update. If the learning rate is too high, it can happen that the training loss explodes (we overstep) and if the learning rate is too low the training loss does not go down or very slowly and it takes longer to train. The momentum can also be introduced. It accelerates the gradient descent by adding in to the W some of the last weights adjustments. All those steps are repeated on a pre-defined number of epochs. For each epoch the forward-pass and backward-pass are performed on all the training data. But, the data is separated into batches (subdivided amounts of data) to feed the backpropagation algorithm. That is to say that to complete one epoch, the number of iterations of the backpropagation algorithm for a database of N speech samples is N batchsize. 7

15 Notice that when working on speech recognition, the most common activation function is the logistic sigmoid function (discussed further in the implementation chapter). We also mentioned the pre-training of NNs in the previous chapter. Indeed, the weights need to be well-initialized to avoid the system to get stuck in a local optimum. This is usually done by using unsupervised learning and more precisely Restricted Boltzmann Machines Pre-training with Restricted Boltzmann Machines [11] RBMs are energy-based models used as generative models of many different types of data including MFCC. They are used to compose Deep Belief Networks which are a combination of several RBMs and a DNN. Indeed RBMs are an efficient pre-training procedure to NNs. A Restricted Boltzmann Machine is a two-layer network in which stochastic visible units that represent observations are connected to stochastic binary hidden units. As for speech recognition the visible units are real-valued inputs we use Gaussian-Bernouilli RBMs i.e. the hidden units are binary but the input units are linear with gaussian noise. In a RBM there are no visible-visible or hidden-hidden connections and that is why it is called a restricted system.[19] Bernouilli-Bernouilli RBM The joint configuration of the visible (v) and hidden (h) units is given via an energy function for Bernouilli-Bernouilli RBM: E(v, h) = a i v i v i h j w ij (3.8) i visible j hidden b j h j i,j where v i,h j are the binary states of visible unit i and hidden unit j and a i,b j are their bias terms and w ij is the weight between them. The probability that the network assigns to a visible vector v is: p(v) = h e E(v,h) v,h e E(v,h) (3.9) And finally, the limited connections within a RBM make the conditional distributions p(v h) and p(h v) quite straight forward: p(h j = 1 v) = σ(b j + i v i w ij ) (3.10) p(v i = 1 h) = σ(a i + j h j w ij ) (3.11) where σ is the logistic sigmoid function 1/(1 + exp( x)). In theory, the update rule for the weights is w ij = ɛ(< v i h j > data < v i h j > model ) (3.12) 8

16 where < > X denotes the expectation computed over the indicated distribution. However in practice, obtaining < v i h j > model is difficult and that is why the Contrastive Divergence approximation to the gradient is used instead [10] and the new update rule becomes: w ij = ɛ(< v i h j > data < v i h j > recon ) (3.13) where < v i h j > recon is obtained after initialization of the states of the visible units to a training vector, and then an update of the binary states of the hidden units with Eq.(3.10) and finally a setting to 1 of each v i with the probability from Eq(3.11). Gaussian-Bernouilli RBM [25] In case of a Gaussian-Bernouilli RBM, the energy function becomes: E(v, h) = i visible (v i a i ) 2 2σ 2 i j hidden b j h j i,j v i σ i h j w ij (3.14) where σ i is the standard deviation of the Gaussian noise for visible unit i. As it is difficult to learn the variance of the noise for each visible unit, the data is normalized to zero mean and unit variance in practice. Conditional probabilities for visible and hidden units are p(h j = 1 v) = σ(b j + i p(v i = v h) = N(v a i + j v i w ij σi 2 ) (3.15) h j w ij, σ 2 i ) (3.16) where N( µ, σ) denotes the Gaussian probability density function with mean µ and standard deviation σ. As for the Bernouilli-Bernouilli, the CD learning is used to train the RBM s parameters and the number of steps is usually set to 1 (CD1). The update rule becomes: w ij = ɛ(< 1 σi 2 v i h j > data < 1 σi 2 v i h j > recon ) (3.17) a i = ɛ(< 1 σi 2 v i > data < 1 σi 2 v i > model ) (3.18) b j = ɛ(< h j > data < h j > model ) (3.19) The toolbox [16] offers a package to build DBNs with a training of RBMs with CD1. Unfortunately it allows only Bernouilli-Bernouilli RBMs and that it why I became interested in another deep learning toolbox [22] discussed further in chapter 5. 9

17 Fig.3.1 shows how RBMs are used to initialize the weights of a DNN and form a DBN. The example is given with a d-layer network [n 1...n i...n d ] where n i is the number of nodes of layer i. The first Gaussian-Bernouilli RBM [n 1 n 2 ] is trained with feature vectors as inputs and the second Bernouilli-Bernouilli RBM [n 2 n 3 ] is trained with the output of the first RBM as input. That is repeated for all the layers of the wanted DBN and then the weights are unfolded to a DNN that is fine-tuned with backpropagation. Figure 3.1: Construction of a DBN. (figure extracted from [15]) Autoencoders Autoencoders are a special type of DNN whose input dimension is the same as the output one. They are trained to encode the input into some high-level or compressed representation so that it can be reconstructed from that representation. Hence, the output target is the input. [1] An AE is composed of an encoder that encodes the input signal into a hidden layer which is a nonlinear representation of the input: h = f θ (x) = σ(w x + b) (3.20) with parameters θ = {W, b} (W is the weights matrix and b an offset factor), the input vector x and the parameterized function σ. This deterministic mapping f is then mapped back to a reconstructed vector y that has the same dimension as the input vector. y = f θ (h) = σ(w h + b ) (3.21) with parameters θ = {W, b } and parameterized function σ. Notice that the parameterized functions of the encoder and the decoder can be different. In [24] they use either an {affine+sigmoid} encoder with either affine decoder with squared error loss or {affine+sigmoid} decoder with crossentropy loss. Typically, the input of an AE is a feature vector and the output is the reconstruction of this feature. In between, one or more (deep autoencoder) hidden layers represent a transformation of the feature. Usually, AE and 10

18 DAE are trained using backpropagation and SGD. To avoid the problems of backpropagation brought up previously, each layer can firstly be trained as an autoencoder in the case of a deep autoencoder (more than one hidden layer). Using [16] we can build an autoencoder as a three-layer NN in which the input signal is also the target signal Denoising auto-encoders [24] A denoising autoencoder is a variant version of the autoencoder described before. It is trained to reconstruct a clean version from the corrupted signal given as input. The input signal x is first corrupted into ˆx. Then ˆx is encoded h = f θ (ˆx) = σ(w ˆx + b) and decoded y = f θ (h) = σ(w h + b ) and the reconstruction loss is calculated between the clean version x and the reconstructed signal y. Hence, the system learns a clever mapping that denoises signals of the type of the inputs used for training. Using the same method as the AE we can implement a denoising autoencoder with [16]. The inputzeromaskedfraction parameter allows us to add noise at a certain ratio. All these tools provide us with the opportunity to build more robust features to be given as input to a pattern recognition system. 3.2 Speech recognition tools Figure 3.2: Recall of the aim of a speech recognition system. (figure extracted from [26]) The goal of a recognition tool is to process a speech sequence into speech vectors and to deduce from these features the most likely corresponding sequence as shown in Fig 3.2. For the purpose of this thesis, I tested two different free, 11

19 open-source toolkits that perform speech recognition: HTK [26] and Kaldi [17]. Both allow us to compute the state-of-the-art speech features MFCC and to perform speech recognition using GMM-HMM. In this section I first recall the theory behind a speech recognition system, and then I focus on the Kaldi toolkit Basic scheme of a speech recognition system HMM-based acoustic flat model A spoken word w is a sequence of phones K w. It happens that different sequences of phones define the same word w due to a different pronunciation. That is why we consider the likelihood p(y w) over multiple pronunciations Q [7]: p(y w) = Q p(y Q)p(Q w) (3.22) where for a particular sequence of pronunciation Q and for q (wl) a valid pronunciation for word w l, p(q w) = L P (q (wl) w l ) (3.23) l=1 Each phone is represented by a density Hidden Markov Model with transition probability parameters {a ij } and output distributions {b j ()} as pictured in Fig.3.3. In the figure the states x i 1, 2, 3, 4, 5. Figure 3.3: HMM-based phone model. (figure extracted from [7]) A Markov model is a finite state machine which changes state once every time step and each time a new speech vector is generated from the probability density {b j ()}. The HMM change of state or transition is managed from its current state x i to one of its connected states x j according to the transition probability {a ij } each time step. In practice only the observation O is known, the state sequence X = x 1,...x T is hidden [26]. The likelihood is the sum over 12

20 all possible state sequences of the joint probability P (O, X M): p(o M) = X T a x(0)x(1) t=1 b x(t) (o t )a x(t)x(t+1) (3.24) where x(0) and x(t + 1) are constrained to be respectively the entry-state and the exit-state. The most probable likelihood is max(p(o M k )). Given a set of training examples, the parameters of the models can be determined using the re-estimation procedure detailed below. Output distribution: GMM First, we define the output distributions b j (o t ) by a Gaussian Mixture Model: b j (o t ) = M c jm N (o t ; µ jm, Σ jm ) (3.25) m=1 where M is the number of gaussians, c jm is the weight of gaussian m and N ( ; µ, Σ) is a multivariate Gaussian with mean vector µ and covariance matrix Σ: 1 N (o; µ, Σ) = (2π)n Σ e 1 2 (o µ) Σ 1 (o µ) (3.26) with n the dimension of the observation o. Re-estimation with Baum-Welch algorithm First, the parameters of the HMM are initialized. It is usually done by using the global mean and covariance of training data in the output distributions and setting to equality all transition probabilities. Then the parameters are re-estimated using Baum-Welch algorithm: T t=1 ˆµ = L j(t)o t T t=1 L j(t) T t=1 ˆΣ j = L j(t)(o t µ j )(o t µ j ) T t=1 L j(t) (3.27) (3.28) where L(t) denotes the probability of being in state j at time t. L(t) is calculated using the Forward-Backward algorithm: for a forward probability α j (t) for a model M with N states, it can be calculated using a recursive method : N 1 α j (t) = P (o 1,..., o t, x(t) = j M) = ( α i (t 1)a ij )b j (o t ) (3.29) In the same way, the backward probability can be computed: β i (t) = P (o t+1,..., o T, x(t) = j M) = i=2 N 1 j=2 a ij b j (o t+1 )β j (t 1) (3.30) Once those probabilities computed, we can deduce L j (t) = α j (t)β j (t)/p (O M). We update the Gaussian parameters with the new L j (t) value and according to the value of P (O M) we re-iterate or stop the process. 13

21 Decoding Now that we have good estimates of the transition probabilities, we need to find the best path through the states that estimates the speech and in theory this is done using the Viterbi algorithm and recursive calculations. In Kaldi the decoding is carried out using graphs and decision trees and the method is explained further in the next paragraphs HTK versus Kaldi HTK is a toolbox created to build HMM system dedicated to Speech recognition. After I carried out the tutorials of both toolkits I became aware of their strengths and weaknesses. I chose to continue my work using Kaldi because it was running better on my operating system, appeared more flexible to me and that the codes and the architecture were easier for me to understand. Kaldi is written in C++ and also integrates codes for DNN which makes it more complete and modern as speech recognition toolkit General overview of Kaldi Figure 3.4: Overview of Kaldi tools. (figure extracted from [17]) External libraries BLAS Basic linear Algebra Subroutines and LAPACK Linear Algebra PACKage are numerical algebra libraries and OpenFST is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (fst) i.e. it permits among other things to the represent a probabilistic model and which is used in Kaldi for the finite-state framework [17]. 14

22 Kaldi Library The modules of the Kaldi library contain command-line tools to be used for the speech recognition purpose. For instance in the module feat we can find a command to compute the MFCC or another type of feature. In this thesis, we used the MFCC library just for the start with Kaldi. Given that the aim of the project is to create more robust features and to assess their performance using a standard pattern recognition system, Kaldi was used only for its GMM-HMM system. Notice also that I used Kaldi tools dedicated to the TIMIT database (further detailed in Chapter 5) From speech to decoding with Kaldi Data, dictionary and language preparation Data is first prepared with timit data prep.sh, the path to the TIMIT directory is provided and the program evaluates the list of speakers, finds the list of audio and transcript files and converts them, creates mapping files between speakers and utterances (one utterance is one speech signal) and also a gender mapping and finally writes the STM files necessary when scoring (getting the error rates) with the NIST s sclite tool [6]. timit prepare dict.sh creates the dictionary which is a sorted list the phones present in the training scripts. And timit format data.sh does the language preparation which consists in creating a N-gram language model. A N-gram language model provides with the prior probability of a phone sequence k = k n 1,..., k 1 : K P (k) = P (k n k n 1,..., k 1 ) (3.31) k=1 In practice to form a N-gram LM the product in Eq3.31 is truncated and the formula becomes: K P (k) = P (k n k n 1,..., k n N+1 ) (3.32) k=1 In our case, a phone bigram LM is computed using [4] and the data is converted into a canonical form and save in binary-format.fst files. In prepare lang.sh a directory is set up. The phones are organized into silence and non-silence categories and the script allows us to remove the optional silence phone sil by requiring its probability to be 0. This is done to avoid the scoring of silence phones. Feature extraction In my work, feature extraction is done in matlab and the vectors are converted into Kaldi format. This Kaldi format is a 2-file format that describes the data: scp format A text file in which each line has a key (utterance id) and extended filename that tells kaldi where to find the data. See Appendix A.2 for examples. 15

23 ark format A binary file in which each utterance defined by its key has its object data. See Appendix A.1 for examples. The conversion from Matlab to Kaldi format is detailed in Chapter 5. Training with train mono.sh The script computes a flat and monophone training with deltas-deltas features (that is to say for instance MFCC+deltas+deltas-deltas). In gmm-init-mono a flat-start monophone set is created in which each base phone is a monophone single-gaussian HMM with means and covariances equal to the mean and covariance of the training data. shared-phones option allows common probability density functions for specified sets of phones, otherwise all phones are separated. compile-train-graphs compiles the training graphs. Then the statistics of the GMMs are accumulated and a first estimation is done. This process {accumulation of statistics + re-estimation of gaussian parameters using iterations of EM (cf Baum Welch))} is iterated for a fixed number of times. A decision tree is created for each state in each phone and they are exported to graphs that serve for the decoding. The final model is saved under final.mdl. Decoding decode.sh and scoring score basic.sh Decoding is performed using the graphs saved at the end of training that contains the language model, the dictionary and the HMM definition. The system check that the feature vectors dimensions of testing are the same as the ones of training. And gmm-latgen-faster is used to decode the testing data and the results are saved in a archive. For scoring, lattice-best-path uses the previous results saved in an archive to find the best path of the phone sequence. The resulted phone map is compared to the reference map and the Phone Error Rate or Word Error Rate is computed with compute-wer. 16

24 4 Problem statement Once I got my hands on those tools, I was able to understand how to play with DNNs and to think of how to improve the current MFFC. The main goal of this thesis is to consider a side-way that was not thought of. Indeed, most authors of the research papers cited until now have been using DNNs, and speech recognition systems for years and most importantly they have at their disposal more computational power than I have. In the following paragraphs I explain the thought process I followed along this thesis. The first and basic assumption that we made is that the nonlinear mapping learned by a DNN might be more robust to noise than a standard linear mapping. The idea behind this supposition is that the network learns high-level representations of data and so capture the main characteristics of speech. That is why we thought of introducing autoencoders in the standard construction of the MFCC. We began by trying Fig.4.1. Figure 4.1: Integration of an AE in the MFCC block diagram. The global idea was that we could finally train NNs to replace blocks of the MFCC computations, see Fig.4.2. Figure 4.2: Replacement of MFCC blocks by NNs 17

25 After a standard training (weights randomly initialized over a gaussian distribution) of the AE shown in 4.1, the SNR between the input feature x and the reconstruction ˆx was very low (no more than 2dB). Recall, x 2 SNR db = 10 log 10 ( x ˆx 2 ) (4.1) However, the deep learning toolbox [16] only consider binary-binary RBMs so we could not use it efficiently on the real-valued power spectrum. That is why I came to use [22] that permits the training of a Gaussian-Bernouilli RBM. Inspiration for the setting of parameters used for unsupervised pre-training was taken from [11] and [13]. Even with such a pre-training we were unable to increase the SNR efficiently. The main challenge is to produce features that are robust to noise, so I became interested in denoising autoencoders. Denoising autoencoder are trained to produce a clean reconstruction of a noisy input. The system I built was inspired by [5]. The MFCC are computed and a deep denoising autoencoder is used to clean the data. But similarly to my first implementation, the problems with unsupervised learning were still here so I decided to pre-train the first mapping of the denoising AE with an other AE using [16]: Figure 4.3: Deep denoising autoencoder with pre-training On this working base, we thought of improving the performance of the deep denoising autoencoder by using a clustering method. The data would be divided using a standard clustering algorithm and each group would train a specific deep autoencoder. The assumption is that the less the data varies the more efficient is the mapping of the DDAE. The explanation on the implementation and the experiments setting of those different attempts are detailed in the next chapter. 18

26 5 Experiments and implementation All the implementation was done using Matlab and Xcode. A global view of the implementation is available in Fig.5.1. Figure 5.1: Global scheme of the system First I describe the speech database used in the experiments and then I detail the implementation. 5.1 TIMIT database The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus provides the user with an speech database in the English language. It contains 10 sentences spoken by 630 different speakers from 8 major dialect regions of the United States. The database is organized as this : 8 directories {dr1,..., dr8} that correspond to the different dialects. Into a directory drn, the data is divided into folders corresponding to the speakers and referenced with a {3-letter + digit} code e.g. DAB0. For one speech signal i.e. one sentence spoken by one particular speaker, the TIMIT corpus includes 4 different files: 19

27 .wav: the 16kHz speech signal named after the sentence code e.g. SA1..txt: transcription file of the words the person said (the sentence)..wrd: transcription file of the time-aligned words the person said..phn: transcription file of the time-aligned phones the person said. Samples of the transcription files are shown in B.1,B.2 and B.3 respectively. And the code to load the data into Matlab is detailed in C.1. Not all testing files from TIMIT were use for the experiments, only 192 utterances were selected according to the basic run.sh kaldi example script for TIMIT: for each dialect region 3 speakers are selected and for each one of them we consider 8 utterances. 5.2 AE integrated in MFCC computational line Framing setting Frame length: 25ms Frame shift: 10ms These values are standard ones for speech recognition and there are the ones used by default in Kaldi s feature extraction system Basic MFCCs computation The MFCC implementation is based on [2] and the global codes are detailed in C.2.1 and C.2.2. We start by implementing the basic MFCC. As in Kaldi, the number of triangular filters is set to M = 23 and the feature dimension is set to Q = 12. The log-energy is added as the 13 th coefficient of the MFCC and finally the deltas and deltas-deltas are computed and concatenated to the MFCC to form a 39-dimension feature vector. Recall, the deltas and deltas-deltas are computed between the frames, they are the first and second order frame-to-frame differences. The filterbanks are created as follows: first the start-, center- and endfrequencies of each filter are calculated in the mel-domain; then they are transformed back to the normal frequency domain to be turned into the sample scale; finally each filter is calculated in the normal frequency domain using standard lines equations. Code for computing the filterbanks is available in C.2.3. Notice that the Hamming window of 5.2 is applied before calculating the power-spectrum C Integration of the AE in the basic scheme At first, when we integrate the AE in the basic MFCC scheme we took the mid-layer nodes as the new feature as shown in Fig.5.3 Improvement attempts The script used with the second toolbox is detailed here C.3.1. To understand the issues of the pre-training with RBMs, we tried different sizes of input features. Particularly, I created a system in which each FBE of a frame was the 20

28 Figure 5.2: Plot of the Hamming window Figure 5.3: Integration of AE in the MFCC computations input feature of a set of autoencoders. In that way, the input-size of the NN was reduced and it included between 3 and around 10 samples. Even then the SNR stayed very low. Indeed it seemed that the parameters could not learn a mapping correctly because the variance of data was very large: for some vectors the reconstruction was quite good but for others it was very bad. After this observation, we decided to concatenate the obtained features with the standard MFCC but still the performance of recognition was poorer than when only the MFCC were used. Even when using a coefficient-to-coefficient concatenation to reduce the correlation of the resulted vector. Then we switched to use the output of the AE as the input to the rest of the MFCC block diagram, but there was no difference at all. We also moved the AE along the MFCC block diagram but there was no difference in performance. I concluded there was something in pre-training and training that was going wrong but I could not put my hand on it. Finally I decided to bypass the issues created by the RBM pre-training and I focused on denoising autoencoders. 5.3 Denoising DNN with supervised pre-training According to [5] we created a deep denoising autoencoder. Our method differs in the pre-training for which we use an AE to initialize the first weight matrix W 1 of the network as shown in Fig

5.3.1 Noisy signals The TIMIT database only provides us with clean speech. To realize the training and testing of our system, we added noise to the data.

29 5.3.1 Noisy signals The TIMIT database only provides us with clean speech. To realize the training and testing of our system, we added noise to the data. Three different types of noises were used: gaussian white noise, car noise and babble noise. For the sake of reflecting reality conditions, the noisy signals were built at 3 different levels of SNR, 5dB, 10dB and 15db for each type of noise. The code used to create noisy signals is detailed in C Final solution: denoising deep autoencoder with supervised pre-training and VQ Using the DDAE described above, we introduce vector quantization in the process. The training data is clustered using the K-means algorithm and the codebook is saved and used for denoising the testing data. Experiments were run for the number of centroids q = {2, 4, 8, 16} K-means theory The K-means algorithm is quite simple. For a start, random centroids are picked up. Then the euclidian-distance between all observations and the centroids are computed and the affiliation to a cluster is defined by the minimum distance to a centroid. The new centroids of the clusters are re-calculated. These two steps are repeated until convergence. Figure 5.4: Demonstration of the K-means algorithm extracted from clustering NN setting For what concerns the parameters of the DDAE: The activation function The sigmoid function is the most common activation function used for NNs in speech applications. It is defined over [, + ] ]0, 1[ by The pre-training autoencoder S(t) = e t (5.1) The layers are of sizes [39 100] since the input is the noisy 39-dimension MFFC. The activation function equals the output function and is the Sigmoid. A noisy factor of 0.5 is added. 22

30 The learning rate is set to 0.5 and the momentum is also set to 0.5. The optimization is run on 25 epochs with a batchsize of 250. The DDAE The layers are of sizes [ ] since the input is the noisy 39-dimension MFFC and the output is its clean reconstruction. The activation function is the sigmoid (encoder) and the output function is affine (decoder). The learning rate is set to and the momentums to 0.5. The optimization is run on 50 epochs with a batchsize of 250. The code for the training is detailed in C Kaldi : conversion After the computation of the denoised features in Matlab, the training and testing data are converted to Kaldi format. And the tool kaldi-to-matlab [23] is used in C.4.3 to create the.ark and.scp files mentioned earlier Kaldi setting For what concerns the parameters of Kaldi: max iter inc=30 (last iteration to increase gaussian on) totgauss=1000 (number of target gaussians) num iters=40 (number of iterations of training( realign iters= (iterations for re-alignment) boost silence=1.0 (factor by which to boost silence likelihoods in alignment) beam=6 careful=false (alignment option) I do not divide the data for HMM optimization so {train nj=1, test nj=1}. The reason why we chose to run a very simple recognition using a monophone system is because the aim here is to compare the performance of our features to the state-of-the-art MFFC. As long as we have the performance of the basic { MFFC + GMM-HMM } association in this system we can compare it to the association {our feature + GMM-HMM}. You can notice in looking at the code in C.4.4 that I modified a bit the code usually provided by Kaldi for TIMIT recognition. Indeed I wanted to feed the system with my own features and to do so I was forced to remove the automatic calculations of deltas and deltas-deltas in Kaldi. 23

31 6 Results By using the command bash RESULTS test kaldi displays the word error rate (expressed in percentage) of all tested systems in the terminal. In our case, we do consider the phone error rate instead that is defined by P ER = S + D + I N (6.1) where S is the number of substitutions, D the number of deletions, I the number of insertions and N the total number of phones in data. 6.1 Effect of the VQ of the feature vectors on denoising We can start with looking at the statistics of the data VQ with Table.6.1. Table 6.1: Data clustering statistics clusters weights of each cluster in the total data q = q = q = q = We can see that for q = {2, 4} the data is represented quite equitably. However when q increases some clusters become really small. This can involve issues in the training of the neural network. Indeed, at least one frame is needed to learn a parameter. In our case, the NN must learn around parameters. For the sake of training, the amount of training data given to the NN was calculated to allow the learning of a parameter over 5 frames. When the number of clusters q was increased, the amount of training data was increased accordingly to match the 5 frames per parameter. For each clustering level, we plot the SNR of the output of the DDAE as a function of q to evaluate the impact of clustering on the efficiency of the DDAE. The results are shown in Fig.6.1. We notice that for q = 4 the system must have encounter an issue. Apart from that, the general assumption that clustering data improves the efficiency of a global DDAE system seems to be verified. 24

32 Figure 6.1: Evolution of denoising level in function of clustering level q 6.2 Effect of the VQ of the feature vectors on speech recognition Recall, the tests were computed over a range of 3 different noises added to the clean speech at different levels: 5dB, 10dB and 15db. Fig.6.2 shows the reference PER to which we can compare the PER achieved by our new denoised feature vectors. Fig.6.3 shows the results in terms of PER of our speech recognition system for q = {1, 2, 4, 8, 16} (a) Clean training (b) Noisy training Figure 6.2: Results in terms of PER (%) for speech recognition 25

33 (a) Denoised training q = 1 (b) Denoised training q = 2 (c) Denoised training q = 4 (d) Denoised training q = 8 (e) Denoised training q = 16 Figure 6.3: Results in terms of PER (%) for speech recognition 6.3 Discussion of the results Speech recognition performance In real conditions, speech recognition is always performed on noisy speech because when hands-free devices are used there is always a certain level of environmental noise. The performance achieved by the system using clean data for training is then considered as the top efficiency that it is possible to reach with this particular system setting. When training the system on data reflecting real outdoor conditions, the PER rises of about 5% when testing clean speech but it drops by around 20% when testing noisy speech. Indeed the performance of the clean-system on noisy speech testing is very very poor. Our DDAE solution based on VQ seems to reduce the PER of about 2-3% for all kind of noises. If we consider the babble noise at a level of 5dB which 26

34 challenges quite well the recognition system because it must recognize speech among speech, the performance increases by 2.9%. See Fig.6.4 Figure 6.4: Evolution of PER for a noisy babble signal at 5dB Efficiency of the DDAE The DDAE efficiency as shown in the previous section is mitigated. For sure the NN mapping introduces a noise of its own but it also should be able to learn a very efficient denoising mapping. Fig.6.1 shows that the increase of q involves an increase of the SNR. But it must be noticed that the DDAE allows particularly the denoising of the very noisy signals (with a 5dB SNR) but also damages the less noisy signals and particularly the ones with 15dB SNR. 27

35 7 Conclusion and future work Issues in the implementation forced us to use an alternative way to initialize the weight matrix of the NN. It follows from that implementation that the denoising system proposed in this thesis is less reliable than the one presented in [5]. However the input of this work is the use of a clustering algorithm so that the denoising system becomes more fitted to the data. Indeed, the division of data allows the DDAE to learn more particular denoising mappings which are more efficient for producing robust features. We also must point out the timeconsuming aspect of our system. Naturally, learning q different DDAEs does, in addition to require a large amount of data, take a lot of time to run. The results revealed in the previous chapter demonstrate that the novel {MFCC + DDAE q + GMM-HMM} allows an 2-3% improvement in the recognition. Nevertheless it might be improved by exploring those leads: Explore the overfitting of the system by comparing the performance of the DDAE on its training data and on the testing data. Increase the amount of training data for the neural nets. Indeed in this work the amount was limited by the TIMIT corpus but also by the computational cost of learning from such large databases. Concerning pre-training, the first step would be to extend the AE pretraining to each layer of the DDAE as it is done with RBM pre-training. A straight forward improvement is also to use the RBM unsupervised learning to pre-train the DDAE. Concatenate the denoised MFCC with the standard MFFC to try compensating for the noise introduced by the DDAE. And maybe use the coefficient-to-coefficient trick to reduce the correlation of the output vector. 28

36 Appendices 29

37 A Kaldi formats Figure A.1: ark-file produced from Matlab 30

38 Figure A.2: scp-file produced from Matlab 31

39 B TIMIT files Figure B.1: orthographic transcription file.txt Figure B.2: time-aligned word transcription file.wrd 32

40 Figure B.3: time-aligned phone transcription file.phn 33

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given