PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Similar documents
SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

Robust Speaker Identification

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Generalized phase space projection for nonlinear noise reduction

Time Series Classification Using Gaussian Mixture Models of Reconstructed Phase Spaces

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

FACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno

L11: Pattern recognition principles

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

PHASE-SPACE REPRESENTATION OF SPEECH REVISITING THE DELTA AND DOUBLE DELTA FEATURES. Hua Yu

Mixtures of Gaussians with Sparse Structure

Eigenvoice Speaker Adaptation via Composite Kernel PCA

An Evolutionary Programming Based Algorithm for HMM training

Voice Activity Detection Using Pitch Feature

p(d θ ) l(θ ) 1.2 x x x

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753)

Machine Learning Overview

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Pattern Classification

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Monaural speech separation using source-adapted models

Dynamic Time-Alignment Kernel in Support Vector Machine

Randomized Algorithms

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Support Vector Machines using GMM Supervectors for Speaker Verification

A Generative Model Based Kernel for SVM Classification in Multimedia Applications

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Joint Factor Analysis for Speaker Verification

Lecture 7: Con3nuous Latent Variable Models

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

Clustering VS Classification

Pattern Recognition and Machine Learning

Augmented Statistical Models for Speech Recognition

Sparse Models for Speech Recognition

Discriminant Feature Space Transformations for Automatic Speech Recognition

IBM Research Report. A Convex-Hull Approach to Sparse Representations for Exemplar-Based Speech Recognition

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Sjoerd Verduyn Lunel (Utrecht University)

When Dictionary Learning Meets Classification

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Automatic Phoneme Recognition. Segmental Hidden Markov Models

Characterization of phonemes by means of correlation dimension

Available online at AASRI Procedia 1 (2012 ) AASRI Conference on Computational Intelligence and Bioinformatics

Reconstruction Deconstruction:

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

Tone Analysis in Harmonic-Frequency Domain and Feature Reduction using KLT+LVQ for Thai Isolated Word Recognition

Machine Learning Techniques for Computer Vision

Comparing linear and non-linear transformation of speech

No. 6 Determining the input dimension of a To model a nonlinear time series with the widely used feed-forward neural network means to fit the a

GMM-based classification from noisy features

CSC 411 Lecture 12: Principal Component Analysis

NONLINEAR TIME SERIES ANALYSIS, WITH APPLICATIONS TO MEDICINE

How to Deal with Multiple-Targets in Speaker Identification Systems?

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Semi-Supervised Learning of Speech Sounds

Modeling acoustic correlations by factor analysis

Invariant Scattering Convolution Networks

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Chaos, Complexity, and Inference (36-462)

DIMENSIONALITY REDUCTION METHODS IN INDEPENDENT SUBSPACE ANALYSIS FOR SIGNAL DETECTION. Mijail Guillemard, Armin Iske, Sara Krause-Solberg

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Mathematical Formulation of Our Example

Discriminative training of GMM-HMM acoustic model by RPCL type Bayesian Ying-Yang harmony learning

Information Dynamics Foundations and Applications

Phase-Space Reconstruction. Gerrit Ansmann

Hidden Markov Models and Gaussian Mixture Models

Dimensionality Reduction and Principal Components

Computer Science March, Homework Assignment #3 Due: Thursday, 1 April, 2010 at 12 PM

Augmented Statistical Models for Classifying Sequence Data

Speaker Verification Using Accumulative Vectors with Support Vector Machines

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Automatic Speech Recognition (CS753)

A Modular NMF Matching Algorithm for Radiation Spectra

Hidden Markov Models

Speaker recognition by means of Deep Belief Networks

Matrix Updates for Perceptron Training of Continuous Density Hidden Markov Models

Introduction to Machine Learning

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Pattern Recognition Applied to Music Signals

CS534 Machine Learning - Spring Final Exam

Proc. of NCC 2010, Chennai, India

FEATURE PRUNING IN LIKELIHOOD EVALUATION OF HMM-BASED SPEECH RECOGNITION. Xiao Li and Jeff Bilmes

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Transcription:

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu ABSTRACT Although isolated phoneme classification using features from time-domain phase space reconstruction has been investigated recently, the best representation of feature vectors for the discriminability over phoneme classes is still an open question. This paper applies Principal Component Analysis (PCA) to feature vectors from the reconstructed phase space. By using PCA projection, the basis of the feature space is orthogonalized. A Bayes classifier uses the transformed feature vectors to classify phoneme exemplars. The results show that the classification accuracy with PCA method surpasses the accuracy using only original features in most cases. PCA projection was implemented in three ways over the reconstructed phase space on both speaker-dependent and speaker-independent data. Models are trained and tested using data drawn from the TIMIT database. 1. INTRODUCTION State of the art speech recognition systems typically use cepstral coefficient features, obtained via a frame-based spectral analysis of the speech signal. Such frequency domain approaches do not necessarily preserve the nonlinear information present in speech. Reconstructed phase space (Abarbanel, 1996; Kantz and Schrieber, 2000) could capture the nonlinear information not preserved by traditional speech analysis techniques, which could result in the improved speech recognition. This paper is based on work supported by NSF under Grant No. IIS-0113508 The classical techniques used for phoneme classification task are Hidden Markov Models (HMM) (Lee and Hon, 1989; Young, 1992), often based on Gaussian Mixture Model (GMM) observation probabilities. The most common features are Mel Frequency Cepstral Coefficients (MFCCs). As an alternative to the traditional techniques, a nonlinear dynamical systems method called the phase space reconstruction can be applied to studying speech. The reconstructed phase space is simply a plot of the time-lagged vectors of signal, which is used to represent the nonlinear structure. Geometric structures occur in this processing space that are called trajectories or attractors. Reconstructed phase spaces are topologically equivalent to the original system, if the embedding dimension is large enough (Sauer et al., 1991). The full dynamics of the system can be recovered using phase space reconstruction. The previous results (Ye et al., 2002) showed that a Bayes classifier, using features extracted from phoneme reconstructed phase spaces, can be effective in classifying phonemes. In order to truly represent the underlying dynamic systems that produce the speech signals, usually a high dimensional phase space reconstruction is required. Considering the computational cost associated with the phase space method and training data requirement, a lower dimensional phase space reconstruction is usually desired in practice. Principal Component Analysis (PCA) is a transformation that can be used to reduce the feature dimension. The original feature space is transformed to another feature space on a different set of orthogonal base. By doing PCA transformation over the phase space, the eigenspaces that retain the most significant amount of information are kept. Previous work on transformation over frequency domain features (Kwon et al., 2002) and phase space features

(Broomhead and King, 1986) can also be found in literature. This paper shows that PCA projection helped improve the classification accuracy of isolated phoneme classification in general using features from the reconstructed phase space. The work uses nonparametric distribution model of phoneme reconstructed phase spaces as input to a Bayes classifier. The Bayes classifier is trained and tested on both speaker-dependent and speakerindependent corpus from the TIMIT database. 2. METHOD 2.1. Phase Space Reconstruction Phase space reconstruction techniques are founded on underlying principles of dynamical system theory (Sauer et al., 1991; Takens, 1980) and have been applied to a variety of time series analysis and nonlinear signal processing applications (Abarbanel, 1996; Kantz and Schrieber, 2000). Given a time series x = xn, n= 1... N, (1) where n is a time index, and N is the number of observations, the vectors in a reconstructed phase space is formed, according to Takens delay method (Takens, 1980), Figure 1 Reconstructed phase space of the vowel phoneme /aa/ illustrating trajectory The time lag used in the reconstructed phase space is empirical but guided by some key measures such as mutual information and autocorrelation (Abarbanel, 1996; Kantz and Schrieber, 2000). Based on these measures, a time lag of six is selected for all the experiments. The embedding dimensions before and after PCA projection are 15 and 3 respectively. x = [ x x x ], (2) n n ( d 1) n n where is the time delay and d is the embedding dimension. This reconstructed phase space is in essence no more than a multi-dimensional plot of the signal against delayed versions of itself. Figure 1 provides an illustrative phoneme reconstructed phase space with trajectory information. Figure 2 provides an illustrative phoneme reconstructed phase space with density information. In practice, the attractor is zero-meaned in the phase space and the amplitude variation is normalized from phoneme to phoneme using the standard deviation of the radius. Figure 2 Reconstructed phase space of the vowel phoneme /aa/ illustrating density 2.2. Principal Component Analysis In order to perform PCA over the phase space features, a trajectory matrix is compiled from the vectors that are created by the time delay embedding method.

X = x x x 1 1+ 1 + ( d 1) x x x 2 2+ 2 + ( d 1) x x x N ( d 1) N ( d 2) N A scatter matrix is formed, T S=X X and an eigendecomposition is performed such that T S=ΦΛΦ where the eigenvalues of Λ are reordered in nonincreasing order along the diagonal. Select the largest eigenvalues, and let Φ be a matrix containing corresponding columns of Φ. Then (3) (4) (5) Y=XΦ (6) is the new PCA projected trajectory matrix. Three types of projection were implemented and applied to each set of the experiments, which we denote PCA projection, individual projection and class-based projection. The difference between each implementation mainly depends on the various ways to compute and apply transformations over the data set. The PCA projection learns one scatter matrix from all the training data and applies the PCA transformation to the trajectory matrix from each phoneme. The individual projection learns and applies the transformation to the trajectory matrix from each phoneme in example-by-example basis. The class-based projection involves two steps in implementation. In the training phase, it learns a scatter matrix and applies transformation over each phoneme class. Assuming there are C phoneme classes, then in test phase, it applies C different transformations on the trajectory matrix from each phoneme exemplar and these projected trajectory matrices are used to compute probabilities under the corresponding class models for the Bayes classifier. 2.3. Nonparametric Distribution Model of Reconstructed Phase Space A statistical characterization, related to the natural measure or natural distribution of the attractor (Abarbanel, 1996; Kantz and Schrieber, 2000), of the reconstructed phase space is estimated by dividing the reconstructed phase space into 100 histogram bins as is illustrated in Figure 2. This is done by dividing each dimension into ten partitions such that each partition contains approximately 10% of all training data points. The intercepts of the bins are determined using all the training data. A typical phoneme reconstructed phase space is shown in Figure 1 with the corresponding intercepts, demonstrating the structure of the embedded signal. Figure 2 gives a portrait of the reconstructed phase space based on the natural distribution. 2.4. The Bayes Classifier The estimates of the natural distribution are used as input for a Bayes classifier. This classifier simply computes the conditional probabilities of the different classes given the values of attributes and then selects the class with the highest conditional probability. If an instance is described with n attributes a i (i=1 n), then the class that instance is classified to a class c from set of possible classes C according to a Maximum Likelihood (ML) classifier is: c= arg max p( c ) p( a c ) (7) j i j cj C i= 1 The conditional probabilities in the above formula are obtained from the estimates of the natural distribution using training data. This Bayes classifier minimizes the probability of classification error under the assumption that the sequence of points is independent. 3. EXPERIMENTS AND RESULTS The TIMIT corpus was used to train and evaluate the phoneme classification task. The three types of projection were implemented as described before: n

PCA projection Individual projection Class-based projection The embedding dimensions before and after PCA projection are 15 and 3 respectively for all the experiments. The speaker-dependent experiment used data from one male speaker with 417 phoneme exemplars over standard 48 phonemes (Lee and Hon, 1989). Classification results with three types of projection were obtained from the speakerdependent experiments, which can give a comparison between different implementations as mentioned above. The speaker-independent test used training data from six male speakers and testing data from three different male speakers with experiments run on three types of phonemes respectively. A total of 7 fricatives, 7 vowels, and 5 nasals are selected for these experiments. Also, classification results with three types of projection were obtained from the speaker-independent experiments, which can give an idea of how the projection over the phase space affects the classification accuracy on different types of phonemes. Table 1 shows the results of speaker-dependent experiments on a total of 48 phonemes with and without projection. Table 2 shows the results of speaker-independent experiments on a total of 7 fricative phonemes with and without projection. Table 3 shows the results of speaker-independent experiments on a total of 7 vowel phonemes with and without projection. Table 4 shows the results of speaker-independent experiments on a total of 5 nasal phonemes with and without projection. Without PCA Individual 24.33% 28.47% 25.30% 11.19% Table 1 Phoneme classification results of speaker-dependent experiments on a total of 48 phonemes 39.07% 42.38% 33.77% 29.14% Table 2 Phoneme classification results of fricatives 40.54% 43.24% 29.68% 8.78% Table 3 Phoneme classification results of vowels 55.21% 48.96% 47.92% 48.96% Table 4 Phoneme classification results of nasals Table 1 can give a comparison between different projection implementations on a total of 48 phonemes. In this case, the PCA projection method works best and the class-based projection method works worst. The PCA projection method also works best for the fricative and vowel phoneme classification tasks while the class-based projection method gives the lowest classification accuracies for these two tasks. It can be observed that some phonemes tend to be classified as one particular phoneme for both fricative and vowel experiments using class-based projection method. The confusion of these phonemes in the reconstructed phase space using distribution model can be observed by investigating the confusion matrices for each case. 4. CONCLUSIONS A novel approach for isolated phoneme classification task using the features from the reconstructed phase space is presented in this paper. In order to find the feature transformations over the reconstructed phase space that have the better discriminability in terms of classification

accuracy, the principal component analysis over the reconstructed phase space was investigated. The PCA has the potential benefits to reduce the feature dimensionality and preserves the most significant amount of information corresponding to the largest eigenvalues. Three projection implementations were tested. The experiment results showed that the PCA projection method yielded best classification accuracy and the classbased projection method yielded worst classification accuracy in overall. Future work would include investigating the feature transformations such as multiple discriminant analysis and nonlinear component analysis that can extract features possibly more useful for classification purposes. REFERENCES Abarbanel, H.D.I., 1996. Analysis of observed chaotic data. Springer, New York, xiv, 272 pp. Broomhead, D.S. and King, G., 1986. Extracting Qualitative Dynamics from Experimental Data. Physica D: 217-236. Kantz, H. and Schrieber, T., 2000. Nonlinear Time Series Analysis. Cambridge Nonlinear Science Series 7. Cambridge University Press, New York, NY, 304 pp. Kwon, O.-W., Lee, T.-W. and Chan, K., 2002. Application of variational Bayesian PCA for speech feature extraction. ICASSP, 1: 825-828. Lee, K.-F. and Hon, H.-W., 1989. Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11): 1641-1648. Sauer, T., Yorke, J.A. and Casdagli, M., 1991. Embedology. Journal of Statistical Physics, 65(3/4): 579-616. Takens, F., 1980. Detecting strange attractors in turbulence. Dynamical Systems and Turbulence, 898: 366-381. Ye, J., Povinelli, R.J. and Johnson, M.T., 2002. Phoneme Classification Using Naive Bayes Classifier In Reconstructed Phase Space. 10th Digital Signal Processing Workshop. Young, S., 1992. The general use of tying in phonemebased HMM speech recognition. ICASSP, 1: 569-572.