ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

Similar documents
Hierarchical Multi-Stream Posterior Based Speech Recognition System


The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

p(d θ ) l(θ ) 1.2 x x x

Lecture 3: Pattern Classification

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Statistical Machine Learning from Data

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Why DNN Works for Acoustic Modeling in Speech Recognition?

Lecture 3: Pattern Classification. Pattern classification

Session 1: Pattern Recognition

Detection-Based Speech Recognition with Sparse Point Process Models

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Introduction to Neural Networks

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

IDIAP. Martigny - Valais - Suisse ADJUSTMENT FOR THE COMPENSATION OF MODEL MISMATCH IN SPEAKER VERIFICATION. Frederic BIMBOT + Dominique GENOUD *

Nonparametric Methods Lecture 5

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Support Vector Machines using GMM Supervectors for Speaker Verification

FACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno

Model-Based Margin Estimation for Hidden Markov Model Learning and Generalization

L11: Pattern recognition principles

Multi-level Gaussian selection for accurate low-resource ASR systems

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Mixtures of Gaussians with Sparse Structure

Lecture 9: Speech Recognition. Recognizing Speech

Lecture 9: Speech Recognition

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

A Generative Model Based Kernel for SVM Classification in Multimedia Applications

Automatic Speech Recognition (CS753)

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Mining Classification Knowledge

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Deep Learning for Speech Recognition. Hung-yi Lee

Robust Speech Recognition in the Presence of Additive Noise. Svein Gunnar Storebakken Pettersen

Machine Recognition of Sounds in Mixtures

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Robust Speaker Identification

Machine Learning for Signal Processing Bayes Classification and Regression

An Evolutionary Programming Based Algorithm for HMM training

FEATURE PRUNING IN LIKELIHOOD EVALUATION OF HMM-BASED SPEECH RECOGNITION. Xiao Li and Jeff Bilmes

Environmental Sound Classification in Realistic Situations

Temporal Modeling and Basic Speech Recognition

Supervised Learning: Non-parametric Estimation

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition

Hidden Markov Model and Speech Recognition

An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Machine Learning Overview

Pairwise Neural Network Classifiers with Probabilistic Outputs

Comparison of Log-Linear Models and Weighted Dissimilarity Measures

Computer Science March, Homework Assignment #3 Due: Thursday, 1 April, 2010 at 12 PM

Augmented Statistical Models for Speech Recognition

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Design and Implementation of Speech Recognition Systems

Automatic Phoneme Recognition. Segmental Hidden Markov Models

Dynamic Time-Alignment Kernel in Support Vector Machine

Algorithm-Independent Learning Issues

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Eigenvoice Speaker Adaptation via Composite Kernel PCA

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Feature selection and extraction Spectral domain quality estimation Alternatives

Discriminative Pronunciation Modeling: A Large-Margin Feature-Rich Approach

Fuzzy Support Vector Machines for Automatic Infant Cry Recognition

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Adaptive Boosting for Automatic Speech Recognition. Kham Nguyen

Acoustic Modeling for Speech Recognition

ECE521 Lectures 9 Fully Connected Neural Networks

Investigating Mixed Discrete/Continuous Dynamic Bayesian Networks with Application to Automatic Speech Recognition

This is the author s version of a work that was submitted/accepted for publication in the following source:

Full-covariance model compensation for

A Posteriori Corrections to Classification Methods.

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

On Variable-Scale Piecewise Stationary Spectral Analysis of Speech Signals for ASR

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Speech Recognition. Department of Computer Science & Information Engineering National Taiwan Normal University. References:

Final Exam, Machine Learning, Spring 2009

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Joint Factor Analysis for Speaker Verification

day month year documentname/initials 1

Frog Sound Identification System for Frog Species Recognition

Hidden Markov Models and other Finite State Automata for Sequence Processing

Estimation of Cepstral Coefficients for Robust Speech Recognition

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Novel Hybrid NN/HMM Modelling Techniques for On-line Handwriting Recognition

Augmented Statistical Models for Classifying Sequence Data

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

Pattern Classification

Transcription:

Zaragoza Del 8 al 1 de Noviembre de 26 ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION Ana I. García Moral, Carmen Peláez Moreno EPS-Universidad Carlos III Departamento de Teoría de la Señal y Comunicaciones Avd. de la Universidad 3, 28912 Leganés, Madrid Hervé Bourlard IDIAP Research Institute CP 592, rue du Simplon 4, 192 Martigny, Switzerland ABSTRACT In this work we try to estimate a posteriori probabilities needed for speech recognition by the K-Nearest-Neighbors rule (KNN), using a Multi-Layered Perceptrom (MLP) to obtain the distances between neighbors. Thus, we can distinguish two different works: on the one hand, we estimate a posteriori probabilities using KNN with the aim of using them in a speech recognition system [1][2]; and, on the other hand, we propose a new distance measure, MLP-distance. 1. INTRODUCTION Typical Automatic Speech Recognition (ASR) systems use features obtained from short-term spectrum, like Mel- Frequency Cepstral Coefficients (MFCC) or Perceptual Linear Prediction (PLP). Phoneme posterior probabilities can also be used as features, being more stable and robust [1][2]. There are several types of non-parametric methods of interest in pattern recognition. Some of them estimate the density functions p(x ω j ) - the class-contidional probability density function (probability density function for x given that the state of nature is ω j )- from sample patterns. Some other alternatives directly estimate the a posteriori probabilities P (ω j x) - the probability of the state of nature being ω j given that feature value x has been measured. This is closely related to non-parametric design procedures, such as the nearest-neighbor rule, which bypasses explicit probability estimation and goes directly to decision functions. Besides, there are also non-parametric procedures for transforming the feature space in the hope that it may be possible to employ parametric methods in the transformed space. The KNN classifier is a very simple non-parametric method for classification. Despite the simplicity of the algorithm, it performs very well and is an important benchmark method. The KNN classifier, as described by [3], requires a distance metric d, a positive integer k, and the reference templates X n of n labeled patterns. Euclidean or Mahalanobis distances have been typically used as local distance between vectors. In this work we investigate the use of MLP-distance as a measure of local similarity between two vectors since the a posteriori probabilities vector can be seen as a distribution over the phoneme space. This work must be considered as an experiment to evaluate the effectiveness of the KNN algorithm for estimating a posteriori probabilities, but about all, it must be considered as a preliminary experiment to evaluate the potential usefulness of an MLP as distance measurer between vectors. So, we are going to train an MLP with only two input vectors and one output, predicting whether or not (/1 output) the two input vectors belong to the same class. Using the softmax output of the MLP, we will obtain some kind of non-linear distance between input vectors, which can be used to replace the usual Euclidean distance used in KNN. In Figure 1 is shown a block diagram of the system, which can be divided into 3 main blocks. The first, feature extraction (for obtaining a posteriori probabilities), in which we do not get into details, can be replaced by other or even omitted, using for example directly PLP features. And the second and third blocks, proposed MLP d (an MLP as distance measurer) and KNN probabilities a posteriori estimation, are described in detail in sections 3 and 2, respectively. The document is organized as follows: Section 2 describes the KNN rule and its application as posterior probabilities estimator, Section 3 introduces the proposed MLPdistance, Section 4 shows experiments and results and finally, Section 5 gives conclusions and some ideas for future work. 2. K N -NEAREST-NEIGHBORS ESTIMATION To estimate p(x) from n training vectors or prototypes, we can center a cell about x and let it grow until it captures k n samples, where k n is some specified function of n (distance function). These samples are the k n nearest neighbors of x. It the density is high near x, the cell will 389

A.I. García, C. Peláez, H. Bourlard PLP Features (39-dimension) MLP 351 18 27 A Posteriori Probabilities (or Posteriors) (27-dimension) Train Posteriors Test Posteriors must go to zero. Although we shall not supply a proof, it can be shown that the conditions lim n k n = and lim n k n /n = are necessary and sufficient for p n (x) to converge to p(x) in probability at all points where p(x) is continuous. If we take k n = n and assume that p n (x) is a reasonably good approximation to p(x) we see from (1) that V n 1 np(x). 2.1. Estimation of a posterior probabilities The technique discussed in the previous section can be used to estimate the a posteriori probabilities P (ω j x) from a set of n labeled samples by using the samples to estimate the densities involved. Suppose that we place a cell of volume V around x and capture k samples, k i of which turn out to be labeled ω i. Then the obvious estimate for the joint probability p(x, ω i ) is Random Selection (per phoneme) n Train Posteriors MLP d 54 1 1 p n (x, ω i ) = k i/n V Thus, a reasonable estimate for P (ω i x) is P (ω i x) = p n (x, ω i ) cj=1 p n (x, ω j ) = k i k (2) (3) MLP-distances KNN New Test Posteriors Figure 1. Block diagram of the system. be relatively small, which leads to good resolution. If the density is low, it is true that he cell will grow large, but it will stop soon after it enters regions of higher density. In either case, if we take p n (x) = k n/n V n (1) We want k n to go to infinity as n goes to infinity, since this assures us that kn n will be a good estimate of the probability that a point will fall in the cell of volume V n. However, we also want k n to grow sufficiently slowly that the size of the cell needed to capture k n training samples will shrink to zero. Thus, it is clear from (1) that the radio That is, the estimate of the a posteriori probability that ω i is the state of nature is merely the fraction of the samples within the cell that are labeled ω i. For minimum error rate, we select the category most frequently represented within the cell. If there are enough samples and if the cell is sufficiently small, it can be shown that this will yield performance approaching the best possible. When it comes to choosing the size of the cell we can use k n -nearest-neighbor approach. V n would be expanded until some specified number of samples, were captured, such as k n = n. As n goes to infinity an infinite number of samples will fall within the infinitely small cell. The fact that the cell volume could become arbitrarily small and yet contain an arbitrarily large number of samples would allow us to learn the unknown probabilities with virtual certainty and thus eventually obtain optimum performance. Interestingly enough, we shall now see that we can obtain comparable performance if we base our decision solely on the label of the single nearest neighbor of x [3]. 3. MLP-DISTANCE In KNN algorithm described in Section 2, k n is some specified function of n. Usually, we calculate the Euclidean distance between the vector we want to classify and the n prototypes (or training vectors), with the aim of selecting only the k prototypes closest (with a smaller distance) to it. Once we have selected these k nearest vectors, we only need to vote according to the their targets, obtaining in this way the final posterior probabilities. 39

Zaragoza Del 8 al 1 de Noviembre de 26 Now, instead of using Euclidean distance, we propose to implement a very simple MLP which decides if two input vectors belong to the same class or not. In fact, softmax output of the MLP will give us a number between and 1, that is, if two inputs belong to the same class MLPdistance will be near to, and near to 1 if they belong to different classes. Thus, we use.5 as threshold. 4. EXPERIMENTS AND RESULTS This work must be considered as a first experiment to evaluate the effectiveness of the MLP-distance as local distance between vectors. But also, we want to evaluate the a posteriori probabilities estimated by KNN using this distance. 4.1. Database In the following sections, the results are reported on the Numbers95 [4] conected digit task. Numbers95 contains digit strings spoken in US English over a telephone channel and is a small vocabulary database. There are 3 word types in the database modelled by 27 context-independent phones. The training set consists of 333 sentences (2996 sentences were used for training and 334 for cross-validation) and 225 sentences were used for testing (see Table 1). Training and test were only performed on clean speech. The lexicon has 12 different words (from zero to nine plus oh and silence) with a single pronunciation per word. 4.2. Features Short-term spectral-based features, such as MFCC or PLP, are traditionally used in ASR. They can be modeled by a mixture of Gaussians (a typical function used to estimate the emission distribution of a standard HMM system) and this is the reason why they have been successfully applied. Nevertheless, spectral-based features not only contain lexical information, but also knowledge about the speaker or environmental noise. This extra information is cause of unnecessary variability in the feature vector, which may decrease the performance of the ASR system. Thus, we can use a transformation of traditional acoustic vectors as features for ASR, i.e. a posteriori probabilites. A multi-layer perceptron can be trained to estimate the phone posterior probabilities based on spectral-based features. In this case, the MLP performs a non-linear transformation. Because of this discriminant projection, posteriors are known to be more stable [1] and more robust to noise (chapter 6 of [5]). Moreover, the databases for training the MLP and for testing do not have to be the same so it is possible to train the MLP on a general-purpose database and use this posterior estimator to obtain features for more specific tasks [6]. Also, phone posterior probabilities can be seen as phone detectors [7], making them a very suitable set of features for speech recognition systems since words are formed by phones. Despite SET sentences PLP Posterior frames frames Train 2996 449.877 425.99 Cross-Validation 334 48.72 46.48 Test 225 336.433 318.433 distribution (%) FER (%) 2 15 1 5 1 8 6 4 2 Table 1. Feature data sets. Phoneme Distribution in Numbers95 Training Set Cross-Validation Set Test Set d t k dcl tcl kcl s z f th v n l r w hh iy ih eh ey ah ao ay ow uw er sil Phoneme Frame Error Rate per Phoneme in Posteriors from Numbers95 d t k dcl tcl kcl s z f th v n l r w hh iy ih eh ey ah ao ay ow uw er sil Phoneme Training Set - FER 13.23% Cross-Validation - FER 19.7% Set Test Set - FER 19.67% Figure 2. Analysis in Posterior sets: a) Phoneme Distribution and b) Frame Error Rate per Phoneme in a posteriori probabilities. their good properties, posterior features cannot be easily modeled by a mixture of Gaussians. In the what is called the Tandem approach [1], posteriors are used as input features for a standard HMM/GMM system. However, a PCA transform on the logarithm of the a posteriori probabilities has to be done previously to make them more Gaussian like and decorrelate the feature vector. We work with 39 dimensional acoustic vectors, 13 static features (PLP) extracted using a window length of 32ms and a window shift of 12.5ms, plus their delta and aceleration features. Posterior features were obtained using a MLP trained on Numbers95 (see Section 4.1). This MLP has 351 input nodes corresponding to the concatenation of 9 frames of 39 dimensional acoustic vectors (this is the reason why we lose 8 frames per sentence when we calculate a posteriori probabilities, as we can see in Table 1), one hidden layer with 18 units, and 27 output units in output layer, each of them corresponding to a different monophone. After training MLP, train and crossvalidation sets are also passed through the MLP (forward pass) to generate their a posteriori probabilities. If we make a brief analysis in all a posteriori probability sets, we obtain Figure 2 4.3. Training MLP d The first design problem we face is to decide which n posterior probabilities from training set should be used for training MLP d (the one we use to calculate distances between posterior probabilities). To train MLP d we would 391

A.I. García, C. Peláez, H. Bourlard SET frames Train 3273 2 Cross-Validation 299 2 Test 6662 2 Table 2. MLP d data sets. FER (%) 26 25.5 25 24.5 24 23.5 Mean test (5 utterances) frame error rate vs. k n Hidden size FER (%) 5 1.85 1 1.98 5 5.92 Table 3. Frame Error Rates (%)in MLP d test (for a subset of 6662 frames). have to confront every posterior vector (27 dimension) with itself and with the rest of posterior probabilities, therefore we would have N 2 training frames, being N the number of a posteriori probabilities in the training set (in this case 425.99, see Table 2). This quantity of data was too large for our purposes, so we have randomly selected a subset: approximately 12 frames per phoneme (more than 3 thousand posterior probabilities which will be transformed in more than 9 millions of training frames). We have also selected a subset for the test and crossvalidation sets, in the same way (see Table 2). The second design problem was to select the correct hidden size for the MLP d. In Table 3 we show Frame Error Rates (FER) obtained for different hidden sizes. Our final decision was 1 hidden units. Although, at first sight, the use of only 5 hidden units can be considered the best choice, the better results obtained in the next step (KNN estimation described in section 4.4) justified our decision. Analyzing the results showed in Table 3, we can see that MLP-distance is a very good method to compute local distances between vectors. 4.4. KNN We began the posterior estimation via KNN, by selecting n vectors from the training set (n prototypes). Thus, we use as prototypes the 3273 training vectors used in MLP d training (see Table 2). Once we have calculated the distances for all test vectors (between them and all the n train ones) using MLP d, we try to fix k n. As this process is very slow, we have only used 5 utterances from the test set to carry out this selection. In Figure 3 we can see the evolution of mean test frame error rate when k n varies from 1 to 2. Thus, we can notice that better results are obtained considering only the 3 nearest training frames (being results for k n = 1 and k n = 3 very close to it). 23 22.5 22 135 1 15 2 25 3 4 5 576 7 8 1 15 2 k n Figure 3. Variation of mean test frame error rate with k n. Experiment Training frames FER (%) Baseline 449.877 19.67 (Posteriors before KNN) 3-NN 3.273 21.82 (Posteriors after KNN) Table 4. Frame Error Rate comparative. Once the best k n was selected, we applied KNN to obtain new posteriors as it was explained in sections 2 and 3. In Table 4 we show FER obtained before and after KNN (really, 3NN: 3-nearest neighbors) and, in Figure 4 is showed the confusion matrix of KNN-posteriors. So, we have an increase of approximately 2% in FER, but using only a.77% (see Tables 2 and 1) of the training samples which is a promising result. 5. CONCLUSIONS AND FURTHER WORK As we mentioned before, this is a preliminary experiment to evaluate the effectiveness of both MLP as distance measurer (MLP-distance) and KNN as posterior estimator. For the MLP-distance, we obtained very good results using a posteriori probabilities as features. But, it is true, that we must carry out experiments with other features (e.g. PLPs) and also, with other distance measures, as the Kullback-Leibler (KL) divergence [8] or the classical Euclidean distance. On the other hand, we also obtained very good results using KNN as posterior estimator, mainly considering the quantity of data (less than 1% total training frames) used in KNN estimation, as we can see in Table 4. But, as we have already said in previous sections, this is only a preliminary experiment. Further work includes the following: Use PLP features to train MLP d and also to ob- 392

Zaragoza Del 8 al 1 de Noviembre de 26 1 Confusion Matrix 6. REFERENCES [1] H. Hermansky, D. Ellis, and S. Sharma, Tandem Connectionist Feature Extraction for Conventional HMM Systems, Proccedings of ICASSP, 2. Frame Accuracy Rate (%) 8 6 4 2 d t kdcl tcl kcl s z f th v n Phonemes l rwhh iy ihehey ah aoayow uw ersil d t k dclkcl tcl s z f th v n l r w hh iy sil ay owuwer ih eheyahao Phonemes Figure 4. Confusion Matrix for KNN-posteriors. tain posteriors with KNN, and compare results with those which use posterior features. Analyze other distances, as KL or Euclidean, and compare them with MLP-distance. Increase the number of prototypes n in KNN. Include these a posterior probabilites in a complete Tandem system to obtain word error rates (WER). [2] Q. Zhu, On Using MLP features in LVCSR, Proccedings of ICSLP, 24. [3] R. O. Duda, P.E. Hart, and D. G. Stork, Pattern Classification, Wiley, 2nd edition, 21. [4] J. Cole, M. Noel, T. Lander, and T. Durham, New telephone speech corpora at CSLU, Proccedings of European Conference on Speech Communication and TEchnology, vol. 1, pp. 821 824, 1995. [5] S. Ikbal, Nonlinear Feature Transformations for Noise Robust Speech Recognition, Ph.D. thesis, Ecole Polytechnique Federal de Lausanne, 24. [6] H. Hermansky and S. Sivadas, On the Use of Task Independent Training Data in Tandem Feature Extraction, Proceedings of ICASSP, vol. 1, 24. [7] P. Niyogi and M. M. Sondhi, Detecting Stop Consonants in Continuous Speech, The Journal of the Acoustic Society of America, vol. 111, no. 2, pp. 163 76, 22. [8] G. Aradilla, J. Vepa, and H. Bourlard, Using Posterior-Based Features in Template Matching for Speech Recognition, Proceedings of ICSLP, June 26. 393