Investigate more robust features for Speech Recognition using Deep Learning

Size: px
Start display at page:

Download "Investigate more robust features for Speech Recognition using Deep Learning"

Transcription

1 DEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 Investigate more robust features for Speech Recognition using Deep Learning TIPHANIE DENIAUX KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING

2

3 Abstract The new electronic devices and their constant progress brought up the challenge of improving the speech recognitions systems. Indeed, people tend to use more and more hands-free devices that are inclined to be used in noisy environments. The evolution of Machine Learning techniques has been very efficient for the last decade and speech recognition system using those techniques appeared. The main challenge of Automatic Speech Recognition systems nowadays is the improvement of the robustness to noise and reverberations. Deep Learning methods were used to either improve the speech representations or defining better distributions probabilities. The problem we face is the drop in the performance of ASR systems when inputs are noisy. The general approach is to define novel speech features that are more robust using Deep Neural Networks. To do so we got through different implementations as the incorporation of autooencoders in the MFCC block diagram or the deep denoising autoencoders with different pre-training methods. The final solution is a system that build more robust features from noisy MFCC. Our input is the demonstration that a denoising system using q quantized DDAEs defined by the clustering of the training data using K-means is more efficient than one denoising system applied to the whole data. The performance gained using such a system is of 2 to 3% in terms of phone error rate and might be improved using more training data and better tuned NN parameters.

4 Acknowledgment First I would like to thank my superviser Saikat Chatterjee for the opportunity he gave me to work on Deep Learning in speech recognition in the Communication Theory department at the School of Electrical Engineering, KTH. His help along this thesis has been a real support, and I really thank him for taking the time to evaluate with me our options and find solutions. I also thank Dr. Md Sahidullah for the help he gave me to start my work, his advices were valued for the rest of my thesis. I thank my examiner Mikael Skoglund and my opponent Fanny for reading my thesis and calling it into question. I thank the Master students that allowed me to attend their presentations because it provided me with experience to prepare my own one. I thank my family members for their support even if they sometimes could not understand what I was specifically working on. I thank Emeric for encouraging me and pushing me into doing my best at any cost and finally I thank all my Stockholm friends but also my french fellows for their friendship and support.

5 Contents 1 Introduction Motivation Structure of the report Background Introduction to Speech Recognition Feature extraction block Pattern recognition block Deep learning in Speech recognition What is deep learning? [3] DNN-HMM systems NN for robust features Developing tools and architecture Toolbox of Deep Learning General overview Neural networks Pre-training with Restricted Boltzmann Machines [11] Autoencoders Denoising auto-encoders Speech recognition tools Basic scheme of a speech recognition system HTK versus Kaldi General overview of Kaldi From speech to decoding with Kaldi Problem statement 17 5 Experiments and implementation TIMIT database AE integrated in MFCC computational line Framing setting Basic MFCCs computation Integration of the AE in the basic scheme Denoising DNN with supervised pre-training Noisy signals Final solution: denoising deep autoencoder with supervised pre-training and VQ

6 5.3.3 K-means theory NN setting Kaldi : conversion Kaldi setting Results Effect of the VQ of the feature vectors on denoising Effect of the VQ of the feature vectors on speech recognition Discussion of the results Conclusion and future work 28 A Kaldi formats 30 B TIMIT files 32 C Matlab files 34 C.1 Database loading C.2 MFCC implementation C.2.1 MFCC C.2.2 MFCC + deltas + deltas-deltas C.2.3 FilterBanks C.2.4 Power spectrum C.3 RBM learning C.3.1 Training using [22] C.4 DDAE implementation C.4.1 Noisy signals C.4.2 Training of the NN C.4.3 Format conversion C.4.4 Kaldi script

7 Abbreviations GM - Gaussian Model GMM - Gaussian Mixture Model NN - Neural Network DNN - Deep Neural Network DBN - Deep Belief Network AE - AutoEncoder SAE - Stacked AutoEncoders DDAE - Deep Denoising Autoencoder DBN - Deep Belief Network RBM - Restricted Boltzmann Machine MFCC - Mel-Frequency Cepstral Coefficients SGM - Stochastic Gradient Method CD - Contrastive Divergence LM - Language Model EM - Expectation Maximization WER - Word Error Rate PER - Phone Error Rate FFT - Fast Fourier Transform DFT - Discrete Fourier Transform FBE - FilterBanks Energies SNR - Signal Noise Ratio VQ - Vector Quantization 4

8 1 Introduction 1.1 Motivation The growing use of hands-free devices and voice-controlled systems has involved the development of high-performance speech recognition systems. Today s major challenge of Automatic Speech Recognition (ASR) systems is the presence of environmental noise and reverberation that causes a drop in the performance. Machine Learning has become a hot topic since early 2000 s and has been used to model the output distribution probabilities of the Hidden Markov models used in Speech Recognition. It is quite recently that the use of Deep learning has reached the same performance as the standard Gaussian Mixture Models. The novel pre-training methods are the new approaches that made this late achievement happened. But other applications of deep learning in speech recognition are also the discovery of bottleneck features [8] or the denoising of features [5]. So the goal of this thesis is to investigate how deep learning can be use in a side-way approach to create more robust speech features. 1.2 Structure of the report This report has been organized as follows. Chapter 2 sums up the background theory of speech recognition and deep learning. Chapter 3 lists and explains the tools used along for this Thesis. In Chapter IV the problem is formulated with its assumptions and the path taken along the thesis is expressed. The implementations done to get to the final solution and the details of the experiments are highlighted and explained in Chapter 5. Finally in Chapter 6 the results are presented and discussed. This last Chapter is followed by the conclusion. The parts of the code used to build the final solution are attached in the appendix. 1

9 2 Background In this chapter we provide a brief discussion of essential background in Speech recognition and Deep Learning. Algorithmic details and parameterization are discussed further in the next chapters. 2.1 Introduction to Speech Recognition Basically, an Automatic Speech Recognition System performs a task of recognition from a provided speech signal. In the case of this thesis, the output of the ASR is a text version of the speech and we operate at the phone level. Indeed, a phone is a distinct speech sound commonly associated with a vowel or a consonant. The English language has 42 phonemes that are units of sound and a phone is an acoustic representation associated with a phoneme. Figure 2.1: ASR System Feature extraction block A feature is a mathematical representation of a speech signal. At this stage, the speech waveform is transformed into a parametric representation for an easier analysis and processing in the next pattern recognition s stage. Indeed, raw waveform speech signals present large variations due to speaker variability or environment. Therefore, another domain than the time-domain needs to be used to represent speech and the first to come to mind is the Fourier domain because the frequency domain is relevant for speech. But while human hearing is a compromise between time and frequency resolution, a Fourier transform applied to a whole speech signal discards all timing information in the process. That is why we consider many short segments called frames of length in between 5 and 30 milliseconds. State-of-the-art extraction technique is the Mel-Frequency 2

10 Cepstral Coefficients. They have been chosen for their computational simplicity, their low dimensional encoding, and their success at the recognition stage. Figure 2.2: Mel-Frequency Cepstral Coefficients block diagram. Once the MFCC computed their first- and second-order temporal differences are concatenated and the final vector is given as input to the pattern recognition system Pattern recognition block The recognition can be realized at several levels: phones, triphones or words. In this thesis we work only on phone recognition as our goal is to demonstrate an improvement in recognition due to the development of more robust features. To deal with the temporal variability of speech the most current ASR systems use Hidden Markov Models combined with Gaussian Mixture Models to model the probability distributions over vectors of features. This model of probability distributions is referred to as the Acoustic Model. 2.2 Deep learning in Speech recognition With the advances made in detection and classification while using machine learning powerful techniques, the speech recognition community started to use Deep neural networks in ASR systems [3]. Basically, deep learning finds its origins within the Neurosciences and has been contributing to many different topics as shown in Figure What is deep learning? [3] Figure 2.3: Deep learning is at the center of many research areas 3

11 Deep learning is a hierarchical learning. That is to say that there are many layers of non-linear information which represent different level of abstraction, and that the whole concept is defined from the lower levels to build a high-level structure. A basic system is exposed in Figure 2.4. Considering the input nodes as x i, an activation function f and the weights w i j the hidden nodes y j are calculated as follows: y j = f( i wi j x i) Figure 2.4: Layered network Two types of learning exist: the supervised learning and the unsupervised learning. Approximately and to simplify we can say that supervised learning is a learning in which either your data is labeled or you have a target so you have a simple cost that depends on the difference between the output and the target to optimize. In the case of unsupervised learning your data is not labeled and the model you learn is generative (in opposition to the discriminative model that corresponds to supervised learning). The recurrent systems used in speech recognition are the hybrid deep networks that bring together supervised and unsupervised learning. Basically, the network is pre-trained in an unsupervised way to boost the effectiveness of the supervised training. This pre-training can be seen as a very efficient way to initialize the weights of the whole network. Then the backpropagation algorithm (detailed in the next chapters) fine-tunes the network and this consists in the supervised learning. The hybrid strategy is a response to an optimization issue that appears as the depth of the networks increases and which is the trapping of local optima DNN-HMM systems From early 21 st century a new form of acoustic models based on Deep Neural Networks was introduced [3] [9]. Before that, Speech recognition was dominated by GMM-HMM systems. The main improvements brought by DNN are the ability to model data correlation (in this case, feature representation and classification are associated) and also the ability to model nonlinear data. Indeed, when it comes to modeling nonlinear data GMMs become statistically inefficient because it would require a very large amount of gaussians. DNN allows as well to improve the robustness of speech recognition when using DNN-HMM systems as demonstrated in [21]. Nevertheless, the successful performance of GMM-HMM systems has made it difficult for other methods to outperform them. Deep learning algorithms and parameters have had to be tuned a lot 4

12 before coming close to GMM-HMM performance and overstep it NN for robust features The principal motivation for working on better feature extraction systems is that the most relevant the feature is, the better the results are at the recognition stage. Papers [27], [19] and [20] explore respectively new methods to obtain bottleneck features, speaker adaptive features and raw waveform features. In those researches, they focus on DNN-HMM systems in which the novel features are used to learn better acoustic models. Article [14] investigates the performance of nonlinear features of spectrograms that are given as input to a DNN-HMM system for ASR. [18] demonstrates that robust stacked autoencoders are capable of learning robust representations on noisy data. Finally [5] show the efficiency of denoising deep autoencoders on noisy MFCC features. Thus, DNNs have been used for different purposes in Speech recognition systems. For this thesis, we chose to work on the application of autoencoders and deep autoencoders in feature extraction in order to create robust features. 5

13 3 Developing tools and architecture 3.1 Toolbox of Deep Learning General overview The Matlab deep learning toolbox used in this thesis [16] presents different libraries for different types of neural networks (NN, CNN, DBN, SAE, CAE) as well as tests and a range of functions. In this section I describe the mathematics useful for my work and the one that are behind this deep learning toolbox Neural networks A neural network is fed with input features and their labels (or targets in our case) for supervised learning. Basically, the system learns nonlinear layers of information and optimizes the weights using the backpropagation algorithm to minimize the error between the output and the target. The nonlinearity is introduced in the system through an activation function f used to calculate the nodes of the hidden layers i.e. the deep representations of data. The system {weights W, biases} is initialized randomly using a Gaussian distribution or pretrained using an unsupervised learning method for W and to 1 for the biases. The forward pass: nnff.m in [16] Assume x is a vector of the input nodes with the biases, f is the activation function and the matrix of weights is W. For each j th hidden layer the units outputs are calculated : h j = f(h j 1 W ). For the output layer, it is the output function that is used instead of the activation function f. The error and loss are also computed as follows: Assume a target y and an output z = F NN (x, W ), the error is e = y z and the square-loss is l = 1 2.e2. The backward pass : nnbp.m in [16] The aim is to find the weights W that minimize the training error loss L = X,Y l(y, F NN (X, W )), where X and Y are respectively matrices of input features and target features.the derivatives are calculated using the delta-rule. 6

14 Derivative of the error with respect to the unit: δe δh j = e j (3.1) Derivative of the unit with respect to the net input (partial derivatives) : δh j δnet j = h j (1 h j ) (3.2) Derivative of the net input with respect to a weight: And finally for a hidden to output weight and for an input to hidden weight δnet j δw jk = h k (3.3) w jk = e j h j (1 h j ) h k (3.4) w ki = ( j e j h j (1 h j ) w jk ) h k (1 h k ) h i (3.5) The Dropout technique can be used at this stage. It has been introduced by G. E. Hinton [12] and reduces the overfitting and improves the neural networks s training. It basically consists in omitting a part of the features in each training case by setting some units to zero. The gradients are then applied in nnapplygrads.m [16] according to the Stochastic Gradient Method and with tuning options on learning rate (µ) and momentum (α). W = µ W + α vw (3.6) W = W + W (3.7) where vw is an accumulated memory of the previous gradients, µ is the learning rate and α is the momentum. The constant learning rate allows us to control the rate of learning of the weights i.e. only a ratio of the calculated gradient is taken for update. If the learning rate is too high, it can happen that the training loss explodes (we overstep) and if the learning rate is too low the training loss does not go down or very slowly and it takes longer to train. The momentum can also be introduced. It accelerates the gradient descent by adding in to the W some of the last weights adjustments. All those steps are repeated on a pre-defined number of epochs. For each epoch the forward-pass and backward-pass are performed on all the training data. But, the data is separated into batches (subdivided amounts of data) to feed the backpropagation algorithm. That is to say that to complete one epoch, the number of iterations of the backpropagation algorithm for a database of N speech samples is N batchsize. 7

15 Notice that when working on speech recognition, the most common activation function is the logistic sigmoid function (discussed further in the implementation chapter). We also mentioned the pre-training of NNs in the previous chapter. Indeed, the weights need to be well-initialized to avoid the system to get stuck in a local optimum. This is usually done by using unsupervised learning and more precisely Restricted Boltzmann Machines Pre-training with Restricted Boltzmann Machines [11] RBMs are energy-based models used as generative models of many different types of data including MFCC. They are used to compose Deep Belief Networks which are a combination of several RBMs and a DNN. Indeed RBMs are an efficient pre-training procedure to NNs. A Restricted Boltzmann Machine is a two-layer network in which stochastic visible units that represent observations are connected to stochastic binary hidden units. As for speech recognition the visible units are real-valued inputs we use Gaussian-Bernouilli RBMs i.e. the hidden units are binary but the input units are linear with gaussian noise. In a RBM there are no visible-visible or hidden-hidden connections and that is why it is called a restricted system.[19] Bernouilli-Bernouilli RBM The joint configuration of the visible (v) and hidden (h) units is given via an energy function for Bernouilli-Bernouilli RBM: E(v, h) = a i v i v i h j w ij (3.8) i visible j hidden b j h j i,j where v i,h j are the binary states of visible unit i and hidden unit j and a i,b j are their bias terms and w ij is the weight between them. The probability that the network assigns to a visible vector v is: p(v) = h e E(v,h) v,h e E(v,h) (3.9) And finally, the limited connections within a RBM make the conditional distributions p(v h) and p(h v) quite straight forward: p(h j = 1 v) = σ(b j + i v i w ij ) (3.10) p(v i = 1 h) = σ(a i + j h j w ij ) (3.11) where σ is the logistic sigmoid function 1/(1 + exp( x)). In theory, the update rule for the weights is w ij = ɛ(< v i h j > data < v i h j > model ) (3.12) 8

16 where < > X denotes the expectation computed over the indicated distribution. However in practice, obtaining < v i h j > model is difficult and that is why the Contrastive Divergence approximation to the gradient is used instead [10] and the new update rule becomes: w ij = ɛ(< v i h j > data < v i h j > recon ) (3.13) where < v i h j > recon is obtained after initialization of the states of the visible units to a training vector, and then an update of the binary states of the hidden units with Eq.(3.10) and finally a setting to 1 of each v i with the probability from Eq(3.11). Gaussian-Bernouilli RBM [25] In case of a Gaussian-Bernouilli RBM, the energy function becomes: E(v, h) = i visible (v i a i ) 2 2σ 2 i j hidden b j h j i,j v i σ i h j w ij (3.14) where σ i is the standard deviation of the Gaussian noise for visible unit i. As it is difficult to learn the variance of the noise for each visible unit, the data is normalized to zero mean and unit variance in practice. Conditional probabilities for visible and hidden units are p(h j = 1 v) = σ(b j + i p(v i = v h) = N(v a i + j v i w ij σi 2 ) (3.15) h j w ij, σ 2 i ) (3.16) where N( µ, σ) denotes the Gaussian probability density function with mean µ and standard deviation σ. As for the Bernouilli-Bernouilli, the CD learning is used to train the RBM s parameters and the number of steps is usually set to 1 (CD1). The update rule becomes: w ij = ɛ(< 1 σi 2 v i h j > data < 1 σi 2 v i h j > recon ) (3.17) a i = ɛ(< 1 σi 2 v i > data < 1 σi 2 v i > model ) (3.18) b j = ɛ(< h j > data < h j > model ) (3.19) The toolbox [16] offers a package to build DBNs with a training of RBMs with CD1. Unfortunately it allows only Bernouilli-Bernouilli RBMs and that it why I became interested in another deep learning toolbox [22] discussed further in chapter 5. 9

17 Fig.3.1 shows how RBMs are used to initialize the weights of a DNN and form a DBN. The example is given with a d-layer network [n 1...n i...n d ] where n i is the number of nodes of layer i. The first Gaussian-Bernouilli RBM [n 1 n 2 ] is trained with feature vectors as inputs and the second Bernouilli-Bernouilli RBM [n 2 n 3 ] is trained with the output of the first RBM as input. That is repeated for all the layers of the wanted DBN and then the weights are unfolded to a DNN that is fine-tuned with backpropagation. Figure 3.1: Construction of a DBN. (figure extracted from [15]) Autoencoders Autoencoders are a special type of DNN whose input dimension is the same as the output one. They are trained to encode the input into some high-level or compressed representation so that it can be reconstructed from that representation. Hence, the output target is the input. [1] An AE is composed of an encoder that encodes the input signal into a hidden layer which is a nonlinear representation of the input: h = f θ (x) = σ(w x + b) (3.20) with parameters θ = {W, b} (W is the weights matrix and b an offset factor), the input vector x and the parameterized function σ. This deterministic mapping f is then mapped back to a reconstructed vector y that has the same dimension as the input vector. y = f θ (h) = σ(w h + b ) (3.21) with parameters θ = {W, b } and parameterized function σ. Notice that the parameterized functions of the encoder and the decoder can be different. In [24] they use either an {affine+sigmoid} encoder with either affine decoder with squared error loss or {affine+sigmoid} decoder with crossentropy loss. Typically, the input of an AE is a feature vector and the output is the reconstruction of this feature. In between, one or more (deep autoencoder) hidden layers represent a transformation of the feature. Usually, AE and 10

18 DAE are trained using backpropagation and SGD. To avoid the problems of backpropagation brought up previously, each layer can firstly be trained as an autoencoder in the case of a deep autoencoder (more than one hidden layer). Using [16] we can build an autoencoder as a three-layer NN in which the input signal is also the target signal Denoising auto-encoders [24] A denoising autoencoder is a variant version of the autoencoder described before. It is trained to reconstruct a clean version from the corrupted signal given as input. The input signal x is first corrupted into ˆx. Then ˆx is encoded h = f θ (ˆx) = σ(w ˆx + b) and decoded y = f θ (h) = σ(w h + b ) and the reconstruction loss is calculated between the clean version x and the reconstructed signal y. Hence, the system learns a clever mapping that denoises signals of the type of the inputs used for training. Using the same method as the AE we can implement a denoising autoencoder with [16]. The inputzeromaskedfraction parameter allows us to add noise at a certain ratio. All these tools provide us with the opportunity to build more robust features to be given as input to a pattern recognition system. 3.2 Speech recognition tools Figure 3.2: Recall of the aim of a speech recognition system. (figure extracted from [26]) The goal of a recognition tool is to process a speech sequence into speech vectors and to deduce from these features the most likely corresponding sequence as shown in Fig 3.2. For the purpose of this thesis, I tested two different free, 11

19 open-source toolkits that perform speech recognition: HTK [26] and Kaldi [17]. Both allow us to compute the state-of-the-art speech features MFCC and to perform speech recognition using GMM-HMM. In this section I first recall the theory behind a speech recognition system, and then I focus on the Kaldi toolkit Basic scheme of a speech recognition system HMM-based acoustic flat model A spoken word w is a sequence of phones K w. It happens that different sequences of phones define the same word w due to a different pronunciation. That is why we consider the likelihood p(y w) over multiple pronunciations Q [7]: p(y w) = Q p(y Q)p(Q w) (3.22) where for a particular sequence of pronunciation Q and for q (wl) a valid pronunciation for word w l, p(q w) = L P (q (wl) w l ) (3.23) l=1 Each phone is represented by a density Hidden Markov Model with transition probability parameters {a ij } and output distributions {b j ()} as pictured in Fig.3.3. In the figure the states x i 1, 2, 3, 4, 5. Figure 3.3: HMM-based phone model. (figure extracted from [7]) A Markov model is a finite state machine which changes state once every time step and each time a new speech vector is generated from the probability density {b j ()}. The HMM change of state or transition is managed from its current state x i to one of its connected states x j according to the transition probability {a ij } each time step. In practice only the observation O is known, the state sequence X = x 1,...x T is hidden [26]. The likelihood is the sum over 12

20 all possible state sequences of the joint probability P (O, X M): p(o M) = X T a x(0)x(1) t=1 b x(t) (o t )a x(t)x(t+1) (3.24) where x(0) and x(t + 1) are constrained to be respectively the entry-state and the exit-state. The most probable likelihood is max(p(o M k )). Given a set of training examples, the parameters of the models can be determined using the re-estimation procedure detailed below. Output distribution: GMM First, we define the output distributions b j (o t ) by a Gaussian Mixture Model: b j (o t ) = M c jm N (o t ; µ jm, Σ jm ) (3.25) m=1 where M is the number of gaussians, c jm is the weight of gaussian m and N ( ; µ, Σ) is a multivariate Gaussian with mean vector µ and covariance matrix Σ: 1 N (o; µ, Σ) = (2π)n Σ e 1 2 (o µ) Σ 1 (o µ) (3.26) with n the dimension of the observation o. Re-estimation with Baum-Welch algorithm First, the parameters of the HMM are initialized. It is usually done by using the global mean and covariance of training data in the output distributions and setting to equality all transition probabilities. Then the parameters are re-estimated using Baum-Welch algorithm: T t=1 ˆµ = L j(t)o t T t=1 L j(t) T t=1 ˆΣ j = L j(t)(o t µ j )(o t µ j ) T t=1 L j(t) (3.27) (3.28) where L(t) denotes the probability of being in state j at time t. L(t) is calculated using the Forward-Backward algorithm: for a forward probability α j (t) for a model M with N states, it can be calculated using a recursive method : N 1 α j (t) = P (o 1,..., o t, x(t) = j M) = ( α i (t 1)a ij )b j (o t ) (3.29) In the same way, the backward probability can be computed: β i (t) = P (o t+1,..., o T, x(t) = j M) = i=2 N 1 j=2 a ij b j (o t+1 )β j (t 1) (3.30) Once those probabilities computed, we can deduce L j (t) = α j (t)β j (t)/p (O M). We update the Gaussian parameters with the new L j (t) value and according to the value of P (O M) we re-iterate or stop the process. 13

21 Decoding Now that we have good estimates of the transition probabilities, we need to find the best path through the states that estimates the speech and in theory this is done using the Viterbi algorithm and recursive calculations. In Kaldi the decoding is carried out using graphs and decision trees and the method is explained further in the next paragraphs HTK versus Kaldi HTK is a toolbox created to build HMM system dedicated to Speech recognition. After I carried out the tutorials of both toolkits I became aware of their strengths and weaknesses. I chose to continue my work using Kaldi because it was running better on my operating system, appeared more flexible to me and that the codes and the architecture were easier for me to understand. Kaldi is written in C++ and also integrates codes for DNN which makes it more complete and modern as speech recognition toolkit General overview of Kaldi Figure 3.4: Overview of Kaldi tools. (figure extracted from [17]) External libraries BLAS Basic linear Algebra Subroutines and LAPACK Linear Algebra PACKage are numerical algebra libraries and OpenFST is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (fst) i.e. it permits among other things to the represent a probabilistic model and which is used in Kaldi for the finite-state framework [17]. 14

22 Kaldi Library The modules of the Kaldi library contain command-line tools to be used for the speech recognition purpose. For instance in the module feat we can find a command to compute the MFCC or another type of feature. In this thesis, we used the MFCC library just for the start with Kaldi. Given that the aim of the project is to create more robust features and to assess their performance using a standard pattern recognition system, Kaldi was used only for its GMM-HMM system. Notice also that I used Kaldi tools dedicated to the TIMIT database (further detailed in Chapter 5) From speech to decoding with Kaldi Data, dictionary and language preparation Data is first prepared with timit data prep.sh, the path to the TIMIT directory is provided and the program evaluates the list of speakers, finds the list of audio and transcript files and converts them, creates mapping files between speakers and utterances (one utterance is one speech signal) and also a gender mapping and finally writes the STM files necessary when scoring (getting the error rates) with the NIST s sclite tool [6]. timit prepare dict.sh creates the dictionary which is a sorted list the phones present in the training scripts. And timit format data.sh does the language preparation which consists in creating a N-gram language model. A N-gram language model provides with the prior probability of a phone sequence k = k n 1,..., k 1 : K P (k) = P (k n k n 1,..., k 1 ) (3.31) k=1 In practice to form a N-gram LM the product in Eq3.31 is truncated and the formula becomes: K P (k) = P (k n k n 1,..., k n N+1 ) (3.32) k=1 In our case, a phone bigram LM is computed using [4] and the data is converted into a canonical form and save in binary-format.fst files. In prepare lang.sh a directory is set up. The phones are organized into silence and non-silence categories and the script allows us to remove the optional silence phone sil by requiring its probability to be 0. This is done to avoid the scoring of silence phones. Feature extraction In my work, feature extraction is done in matlab and the vectors are converted into Kaldi format. This Kaldi format is a 2-file format that describes the data: scp format A text file in which each line has a key (utterance id) and extended filename that tells kaldi where to find the data. See Appendix A.2 for examples. 15

23 ark format A binary file in which each utterance defined by its key has its object data. See Appendix A.1 for examples. The conversion from Matlab to Kaldi format is detailed in Chapter 5. Training with train mono.sh The script computes a flat and monophone training with deltas-deltas features (that is to say for instance MFCC+deltas+deltas-deltas). In gmm-init-mono a flat-start monophone set is created in which each base phone is a monophone single-gaussian HMM with means and covariances equal to the mean and covariance of the training data. shared-phones option allows common probability density functions for specified sets of phones, otherwise all phones are separated. compile-train-graphs compiles the training graphs. Then the statistics of the GMMs are accumulated and a first estimation is done. This process {accumulation of statistics + re-estimation of gaussian parameters using iterations of EM (cf Baum Welch))} is iterated for a fixed number of times. A decision tree is created for each state in each phone and they are exported to graphs that serve for the decoding. The final model is saved under final.mdl. Decoding decode.sh and scoring score basic.sh Decoding is performed using the graphs saved at the end of training that contains the language model, the dictionary and the HMM definition. The system check that the feature vectors dimensions of testing are the same as the ones of training. And gmm-latgen-faster is used to decode the testing data and the results are saved in a archive. For scoring, lattice-best-path uses the previous results saved in an archive to find the best path of the phone sequence. The resulted phone map is compared to the reference map and the Phone Error Rate or Word Error Rate is computed with compute-wer. 16

24 4 Problem statement Once I got my hands on those tools, I was able to understand how to play with DNNs and to think of how to improve the current MFFC. The main goal of this thesis is to consider a side-way that was not thought of. Indeed, most authors of the research papers cited until now have been using DNNs, and speech recognition systems for years and most importantly they have at their disposal more computational power than I have. In the following paragraphs I explain the thought process I followed along this thesis. The first and basic assumption that we made is that the nonlinear mapping learned by a DNN might be more robust to noise than a standard linear mapping. The idea behind this supposition is that the network learns high-level representations of data and so capture the main characteristics of speech. That is why we thought of introducing autoencoders in the standard construction of the MFCC. We began by trying Fig.4.1. Figure 4.1: Integration of an AE in the MFCC block diagram. The global idea was that we could finally train NNs to replace blocks of the MFCC computations, see Fig.4.2. Figure 4.2: Replacement of MFCC blocks by NNs 17

25 After a standard training (weights randomly initialized over a gaussian distribution) of the AE shown in 4.1, the SNR between the input feature x and the reconstruction ˆx was very low (no more than 2dB). Recall, x 2 SNR db = 10 log 10 ( x ˆx 2 ) (4.1) However, the deep learning toolbox [16] only consider binary-binary RBMs so we could not use it efficiently on the real-valued power spectrum. That is why I came to use [22] that permits the training of a Gaussian-Bernouilli RBM. Inspiration for the setting of parameters used for unsupervised pre-training was taken from [11] and [13]. Even with such a pre-training we were unable to increase the SNR efficiently. The main challenge is to produce features that are robust to noise, so I became interested in denoising autoencoders. Denoising autoencoder are trained to produce a clean reconstruction of a noisy input. The system I built was inspired by [5]. The MFCC are computed and a deep denoising autoencoder is used to clean the data. But similarly to my first implementation, the problems with unsupervised learning were still here so I decided to pre-train the first mapping of the denoising AE with an other AE using [16]: Figure 4.3: Deep denoising autoencoder with pre-training On this working base, we thought of improving the performance of the deep denoising autoencoder by using a clustering method. The data would be divided using a standard clustering algorithm and each group would train a specific deep autoencoder. The assumption is that the less the data varies the more efficient is the mapping of the DDAE. The explanation on the implementation and the experiments setting of those different attempts are detailed in the next chapter. 18

26 5 Experiments and implementation All the implementation was done using Matlab and Xcode. A global view of the implementation is available in Fig.5.1. Figure 5.1: Global scheme of the system First I describe the speech database used in the experiments and then I detail the implementation. 5.1 TIMIT database The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus provides the user with an speech database in the English language. It contains 10 sentences spoken by 630 different speakers from 8 major dialect regions of the United States. The database is organized as this : 8 directories {dr1,..., dr8} that correspond to the different dialects. Into a directory drn, the data is divided into folders corresponding to the speakers and referenced with a {3-letter + digit} code e.g. DAB0. For one speech signal i.e. one sentence spoken by one particular speaker, the TIMIT corpus includes 4 different files: 19

27 .wav: the 16kHz speech signal named after the sentence code e.g. SA1..txt: transcription file of the words the person said (the sentence)..wrd: transcription file of the time-aligned words the person said..phn: transcription file of the time-aligned phones the person said. Samples of the transcription files are shown in B.1,B.2 and B.3 respectively. And the code to load the data into Matlab is detailed in C.1. Not all testing files from TIMIT were use for the experiments, only 192 utterances were selected according to the basic run.sh kaldi example script for TIMIT: for each dialect region 3 speakers are selected and for each one of them we consider 8 utterances. 5.2 AE integrated in MFCC computational line Framing setting Frame length: 25ms Frame shift: 10ms These values are standard ones for speech recognition and there are the ones used by default in Kaldi s feature extraction system Basic MFCCs computation The MFCC implementation is based on [2] and the global codes are detailed in C.2.1 and C.2.2. We start by implementing the basic MFCC. As in Kaldi, the number of triangular filters is set to M = 23 and the feature dimension is set to Q = 12. The log-energy is added as the 13 th coefficient of the MFCC and finally the deltas and deltas-deltas are computed and concatenated to the MFCC to form a 39-dimension feature vector. Recall, the deltas and deltas-deltas are computed between the frames, they are the first and second order frame-to-frame differences. The filterbanks are created as follows: first the start-, center- and endfrequencies of each filter are calculated in the mel-domain; then they are transformed back to the normal frequency domain to be turned into the sample scale; finally each filter is calculated in the normal frequency domain using standard lines equations. Code for computing the filterbanks is available in C.2.3. Notice that the Hamming window of 5.2 is applied before calculating the power-spectrum C Integration of the AE in the basic scheme At first, when we integrate the AE in the basic MFCC scheme we took the mid-layer nodes as the new feature as shown in Fig.5.3 Improvement attempts The script used with the second toolbox is detailed here C.3.1. To understand the issues of the pre-training with RBMs, we tried different sizes of input features. Particularly, I created a system in which each FBE of a frame was the 20

28 Figure 5.2: Plot of the Hamming window Figure 5.3: Integration of AE in the MFCC computations input feature of a set of autoencoders. In that way, the input-size of the NN was reduced and it included between 3 and around 10 samples. Even then the SNR stayed very low. Indeed it seemed that the parameters could not learn a mapping correctly because the variance of data was very large: for some vectors the reconstruction was quite good but for others it was very bad. After this observation, we decided to concatenate the obtained features with the standard MFCC but still the performance of recognition was poorer than when only the MFCC were used. Even when using a coefficient-to-coefficient concatenation to reduce the correlation of the resulted vector. Then we switched to use the output of the AE as the input to the rest of the MFCC block diagram, but there was no difference at all. We also moved the AE along the MFCC block diagram but there was no difference in performance. I concluded there was something in pre-training and training that was going wrong but I could not put my hand on it. Finally I decided to bypass the issues created by the RBM pre-training and I focused on denoising autoencoders. 5.3 Denoising DNN with supervised pre-training According to [5] we created a deep denoising autoencoder. Our method differs in the pre-training for which we use an AE to initialize the first weight matrix W 1 of the network as shown in Fig

29 5.3.1 Noisy signals The TIMIT database only provides us with clean speech. To realize the training and testing of our system, we added noise to the data. Three different types of noises were used: gaussian white noise, car noise and babble noise. For the sake of reflecting reality conditions, the noisy signals were built at 3 different levels of SNR, 5dB, 10dB and 15db for each type of noise. The code used to create noisy signals is detailed in C Final solution: denoising deep autoencoder with supervised pre-training and VQ Using the DDAE described above, we introduce vector quantization in the process. The training data is clustered using the K-means algorithm and the codebook is saved and used for denoising the testing data. Experiments were run for the number of centroids q = {2, 4, 8, 16} K-means theory The K-means algorithm is quite simple. For a start, random centroids are picked up. Then the euclidian-distance between all observations and the centroids are computed and the affiliation to a cluster is defined by the minimum distance to a centroid. The new centroids of the clusters are re-calculated. These two steps are repeated until convergence. Figure 5.4: Demonstration of the K-means algorithm extracted from clustering NN setting For what concerns the parameters of the DDAE: The activation function The sigmoid function is the most common activation function used for NNs in speech applications. It is defined over [, + ] ]0, 1[ by The pre-training autoencoder S(t) = e t (5.1) The layers are of sizes [39 100] since the input is the noisy 39-dimension MFFC. The activation function equals the output function and is the Sigmoid. A noisy factor of 0.5 is added. 22

30 The learning rate is set to 0.5 and the momentum is also set to 0.5. The optimization is run on 25 epochs with a batchsize of 250. The DDAE The layers are of sizes [ ] since the input is the noisy 39-dimension MFFC and the output is its clean reconstruction. The activation function is the sigmoid (encoder) and the output function is affine (decoder). The learning rate is set to and the momentums to 0.5. The optimization is run on 50 epochs with a batchsize of 250. The code for the training is detailed in C Kaldi : conversion After the computation of the denoised features in Matlab, the training and testing data are converted to Kaldi format. And the tool kaldi-to-matlab [23] is used in C.4.3 to create the.ark and.scp files mentioned earlier Kaldi setting For what concerns the parameters of Kaldi: max iter inc=30 (last iteration to increase gaussian on) totgauss=1000 (number of target gaussians) num iters=40 (number of iterations of training( realign iters= (iterations for re-alignment) boost silence=1.0 (factor by which to boost silence likelihoods in alignment) beam=6 careful=false (alignment option) I do not divide the data for HMM optimization so {train nj=1, test nj=1}. The reason why we chose to run a very simple recognition using a monophone system is because the aim here is to compare the performance of our features to the state-of-the-art MFFC. As long as we have the performance of the basic { MFFC + GMM-HMM } association in this system we can compare it to the association {our feature + GMM-HMM}. You can notice in looking at the code in C.4.4 that I modified a bit the code usually provided by Kaldi for TIMIT recognition. Indeed I wanted to feed the system with my own features and to do so I was forced to remove the automatic calculations of deltas and deltas-deltas in Kaldi. 23

31 6 Results By using the command bash RESULTS test kaldi displays the word error rate (expressed in percentage) of all tested systems in the terminal. In our case, we do consider the phone error rate instead that is defined by P ER = S + D + I N (6.1) where S is the number of substitutions, D the number of deletions, I the number of insertions and N the total number of phones in data. 6.1 Effect of the VQ of the feature vectors on denoising We can start with looking at the statistics of the data VQ with Table.6.1. Table 6.1: Data clustering statistics clusters weights of each cluster in the total data q = q = q = q = We can see that for q = {2, 4} the data is represented quite equitably. However when q increases some clusters become really small. This can involve issues in the training of the neural network. Indeed, at least one frame is needed to learn a parameter. In our case, the NN must learn around parameters. For the sake of training, the amount of training data given to the NN was calculated to allow the learning of a parameter over 5 frames. When the number of clusters q was increased, the amount of training data was increased accordingly to match the 5 frames per parameter. For each clustering level, we plot the SNR of the output of the DDAE as a function of q to evaluate the impact of clustering on the efficiency of the DDAE. The results are shown in Fig.6.1. We notice that for q = 4 the system must have encounter an issue. Apart from that, the general assumption that clustering data improves the efficiency of a global DDAE system seems to be verified. 24

32 Figure 6.1: Evolution of denoising level in function of clustering level q 6.2 Effect of the VQ of the feature vectors on speech recognition Recall, the tests were computed over a range of 3 different noises added to the clean speech at different levels: 5dB, 10dB and 15db. Fig.6.2 shows the reference PER to which we can compare the PER achieved by our new denoised feature vectors. Fig.6.3 shows the results in terms of PER of our speech recognition system for q = {1, 2, 4, 8, 16} (a) Clean training (b) Noisy training Figure 6.2: Results in terms of PER (%) for speech recognition 25

33 (a) Denoised training q = 1 (b) Denoised training q = 2 (c) Denoised training q = 4 (d) Denoised training q = 8 (e) Denoised training q = 16 Figure 6.3: Results in terms of PER (%) for speech recognition 6.3 Discussion of the results Speech recognition performance In real conditions, speech recognition is always performed on noisy speech because when hands-free devices are used there is always a certain level of environmental noise. The performance achieved by the system using clean data for training is then considered as the top efficiency that it is possible to reach with this particular system setting. When training the system on data reflecting real outdoor conditions, the PER rises of about 5% when testing clean speech but it drops by around 20% when testing noisy speech. Indeed the performance of the clean-system on noisy speech testing is very very poor. Our DDAE solution based on VQ seems to reduce the PER of about 2-3% for all kind of noises. If we consider the babble noise at a level of 5dB which 26

34 challenges quite well the recognition system because it must recognize speech among speech, the performance increases by 2.9%. See Fig.6.4 Figure 6.4: Evolution of PER for a noisy babble signal at 5dB Efficiency of the DDAE The DDAE efficiency as shown in the previous section is mitigated. For sure the NN mapping introduces a noise of its own but it also should be able to learn a very efficient denoising mapping. Fig.6.1 shows that the increase of q involves an increase of the SNR. But it must be noticed that the DDAE allows particularly the denoising of the very noisy signals (with a 5dB SNR) but also damages the less noisy signals and particularly the ones with 15dB SNR. 27

35 7 Conclusion and future work Issues in the implementation forced us to use an alternative way to initialize the weight matrix of the NN. It follows from that implementation that the denoising system proposed in this thesis is less reliable than the one presented in [5]. However the input of this work is the use of a clustering algorithm so that the denoising system becomes more fitted to the data. Indeed, the division of data allows the DDAE to learn more particular denoising mappings which are more efficient for producing robust features. We also must point out the timeconsuming aspect of our system. Naturally, learning q different DDAEs does, in addition to require a large amount of data, take a lot of time to run. The results revealed in the previous chapter demonstrate that the novel {MFCC + DDAE q + GMM-HMM} allows an 2-3% improvement in the recognition. Nevertheless it might be improved by exploring those leads: Explore the overfitting of the system by comparing the performance of the DDAE on its training data and on the testing data. Increase the amount of training data for the neural nets. Indeed in this work the amount was limited by the TIMIT corpus but also by the computational cost of learning from such large databases. Concerning pre-training, the first step would be to extend the AE pretraining to each layer of the DDAE as it is done with RBM pre-training. A straight forward improvement is also to use the RBM unsupervised learning to pre-train the DDAE. Concatenate the denoised MFCC with the standard MFFC to try compensating for the noise introduced by the DDAE. And maybe use the coefficient-to-coefficient trick to reduce the correlation of the output vector. 28

36 Appendices 29

37 A Kaldi formats Figure A.1: ark-file produced from Matlab 30

38 Figure A.2: scp-file produced from Matlab 31

39 B TIMIT files Figure B.1: orthographic transcription file.txt Figure B.2: time-aligned word transcription file.wrd 32

40 Figure B.3: time-aligned phone transcription file.phn 33

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 5: GMM Acoustic Modeling and Feature Extraction CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Statistical NLP Spring Digitizing Speech

Statistical NLP Spring Digitizing Speech Statistical NLP Spring 2008 Lecture 10: Acoustic Models Dan Klein UC Berkeley Digitizing Speech 1 Frame Extraction A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon

More information

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ... Statistical NLP Spring 2008 Digitizing Speech Lecture 10: Acoustic Models Dan Klein UC Berkeley Frame Extraction A frame (25 ms wide extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon Arnfield

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II) Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (II) Outline for ASR ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system 1) Language Model 2) Lexicon/Pronunciation

More information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6 Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects

More information

Temporal Modeling and Basic Speech Recognition

Temporal Modeling and Basic Speech Recognition UNIVERSITY ILLINOIS @ URBANA-CHAMPAIGN OF CS 498PS Audio Computing Lab Temporal Modeling and Basic Speech Recognition Paris Smaragdis paris@illinois.edu paris.cs.illinois.edu Today s lecture Recognizing

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Tied state HMMs + DNNs in ASR Instructor: Preethi Jyothi Aug 17, 2017 Final Project Landscape Voice conversion using GANs Musical Note Extraction Keystroke

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Hidden Markov Model and Speech Recognition

Hidden Markov Model and Speech Recognition 1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Deep Learning Autoencoder Models

Deep Learning Autoencoder Models Deep Learning Autoencoder Models Davide Bacciu Dipartimento di Informatica Università di Pisa Intelligent Systems for Pattern Recognition (ISPR) Generative Models Wrap-up Deep Learning Module Lecture Generative

More information

arxiv: v1 [cs.cl] 23 Sep 2013

arxiv: v1 [cs.cl] 23 Sep 2013 Feature Learning with Gaussian Restricted Boltzmann Machine for Robust Speech Recognition Xin Zheng 1,2, Zhiyong Wu 1,2,3, Helen Meng 1,3, Weifeng Li 1, Lianhong Cai 1,2 arxiv:1309.6176v1 [cs.cl] 23 Sep

More information

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm EngOpt 2008 - International Conference on Engineering Optimization Rio de Janeiro, Brazil, 0-05 June 2008. Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course) 10. Hidden Markov Models (HMM) for Speech Processing (some slides taken from Glass and Zue course) Definition of an HMM The HMM are powerful statistical methods to characterize the observed samples of

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Deep Learning for Speech Recognition. Hung-yi Lee

Deep Learning for Speech Recognition. Hung-yi Lee Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics Phil Woodland: pcw@eng.cam.ac.uk Lent 2013 Engineering Part IIB: Module 4F11 What is Speech Recognition?

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Statistical NLP for the Web

Statistical NLP for the Web Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker Adaptation Instructor: Preethi Jyothi Oct 23, 2017 Speaker variations Major cause of variability in speech is the differences between speakers Speaking

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function 890 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Average Divergence Distance as a Statistical Discrimination Measure for Hidden Markov Models Jorge Silva and Shrikanth

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition Jorge Silva and Shrikanth Narayanan Speech Analysis and Interpretation

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

An Evolutionary Programming Based Algorithm for HMM training

An Evolutionary Programming Based Algorithm for HMM training An Evolutionary Programming Based Algorithm for HMM training Ewa Figielska,Wlodzimierz Kasprzak Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19,

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Sound Recognition in Mixtures

Sound Recognition in Mixtures Sound Recognition in Mixtures Juhan Nam, Gautham J. Mysore 2, and Paris Smaragdis 2,3 Center for Computer Research in Music and Acoustics, Stanford University, 2 Advanced Technology Labs, Adobe Systems

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why

More information

A New OCR System Similar to ASR System

A New OCR System Similar to ASR System A ew OCR System Similar to ASR System Abstract Optical character recognition (OCR) system is created using the concepts of automatic speech recognition where the hidden Markov Model is widely used. Results

More information

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.

More information

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural 1 2 The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural networks. First we will look at the algorithm itself

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Speech Recognition HMM

Speech Recognition HMM Speech Recognition HMM Jan Černocký, Valentina Hubeika {cernocky ihubeika}@fit.vutbr.cz FIT BUT Brno Speech Recognition HMM Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1/38 Agenda Recap variability

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm + September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature

More information

Notes on Back Propagation in 4 Lines

Notes on Back Propagation in 4 Lines Notes on Back Propagation in 4 Lines Lili Mou moull12@sei.pku.edu.cn March, 2015 Congratulations! You are reading the clearest explanation of forward and backward propagation I have ever seen. In this

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

CSC321 Lecture 20: Autoencoders

CSC321 Lecture 20: Autoencoders CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 / 16 Overview Latent variable models so far: mixture models Boltzmann machines Both of these involve discrete

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Comparing linear and non-linear transformation of speech

Comparing linear and non-linear transformation of speech Comparing linear and non-linear transformation of speech Larbi Mesbahi, Vincent Barreaud and Olivier Boeffard IRISA / ENSSAT - University of Rennes 1 6, rue de Kerampont, Lannion, France {lmesbahi, vincent.barreaud,

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

The effect of speaking rate and vowel context on the perception of consonants. in babble noise The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu

More information

Neural Networks biological neuron artificial neuron 1

Neural Networks biological neuron artificial neuron 1 Neural Networks biological neuron artificial neuron 1 A two-layer neural network Output layer (activation represents classification) Weighted connections Hidden layer ( internal representation ) Input

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

Estimation of Cepstral Coefficients for Robust Speech Recognition

Estimation of Cepstral Coefficients for Robust Speech Recognition Estimation of Cepstral Coefficients for Robust Speech Recognition by Kevin M. Indrebo, B.S., M.S. A Dissertation submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment

More information

Sparse Models for Speech Recognition

Sparse Models for Speech Recognition Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology Outline Introduction to speech recognition Motivations

More information

Gaussian Mixture Model Uncertainty Learning (GMMUL) Version 1.0 User Guide

Gaussian Mixture Model Uncertainty Learning (GMMUL) Version 1.0 User Guide Gaussian Mixture Model Uncertainty Learning (GMMUL) Version 1. User Guide Alexey Ozerov 1, Mathieu Lagrange and Emmanuel Vincent 1 1 INRIA, Centre de Rennes - Bretagne Atlantique Campus de Beaulieu, 3

More information

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ] Neural Networks William Cohen 10-601 [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ] WHAT ARE NEURAL NETWORKS? William s notation Logis;c regression + 1

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM

ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM ISOLATED WORD RECOGNITION FOR ENGLISH LANGUAGE USING LPC,VQ AND HMM Mayukh Bhaowal and Kunal Chawla (Students)Indian Institute of Information Technology, Allahabad, India Abstract: Key words: Speech recognition

More information

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

More information

Why DNN Works for Acoustic Modeling in Speech Recognition?

Why DNN Works for Acoustic Modeling in Speech Recognition? Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,

More information

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch Psychology 452 Week 12: Deep Learning What Is Deep Learning? Preliminary Ideas (that we already know!) The Restricted Boltzmann Machine (RBM) Many Layers of RBMs Pros and Cons of Deep Learning Course Structure

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

More information

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli

More information