Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Similar documents
Lecture 5: GMM Acoustic Modeling and Feature Extraction

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

STA 414/2104: Machine Learning

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Automatic Speech Recognition (CS753)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models and Gaussian Mixture Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Estimation of Cepstral Coefficients for Robust Speech Recognition

Design and Implementation of Speech Recognition Systems

Hidden Markov Modelling

STA 4273H: Statistical Machine Learning

Hidden Markov Model and Speech Recognition

Joint Factor Analysis for Speaker Verification

An Evolutionary Programming Based Algorithm for HMM training

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

Discriminative models for speech recognition

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

CS 188: Artificial Intelligence Fall 2011

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Comparing linear and non-linear transformation of speech

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Robust Speaker Identification

Jorge Silva and Shrikanth Narayanan, Senior Member, IEEE. 1 is the probability measure induced by the probability density function

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Hidden Markov Models and Gaussian Mixture Models

Automatic Speech Recognition (CS753)

Eigenvoice Speaker Adaptation via Composite Kernel PCA

Independent Component Analysis and Unsupervised Learning

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Using Sub-Phonemic Units for HMM Based Phone Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

CEPSTRAL analysis has been widely used in signal processing

FACTORIAL HMMS FOR ACOUSTIC MODELING. Beth Logan and Pedro Moreno

Introduction to Machine Learning CMU-10701

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Automatic Phoneme Recognition. Segmental Hidden Markov Models

GMM-Based Speech Transformation Systems under Data Reduction

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Lecture 9: Speech Recognition. Recognizing Speech

Lecture 9: Speech Recognition

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Statistical Sequence Recognition and Training: An Introduction to HMMs

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

Computational Genomics and Molecular Biology, Fall

Speech Recognition HMM

Statistical Machine Learning from Data

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Deep Learning for Automatic Speech Recognition Part I

Machine Learning for OR & FE

Sparse Models for Speech Recognition

Outline of Today s Lecture

Hidden Markov Models

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

order is number of previous outputs

A New OCR System Similar to ASR System

Evaluation of the modified group delay feature for isolated word recognition

Monaural speech separation using source-adapted models

ASR using Hidden Markov Model : A tutorial

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

L7: Linear prediction of speech

Robust Speech Recognition in the Presence of Additive Noise. Svein Gunnar Storebakken Pettersen

CS229 Project: Musical Alignment Discovery

Factorial Hidden Markov Models for Speech Recognition: Preliminary Experiments

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models in Language Processing

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

CAMBRIDGE UNIVERSITY

p(d θ ) l(θ ) 1.2 x x x

Markov processes on curves for automatic speech recognition

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

arxiv: v1 [cs.sd] 25 Oct 2014

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Mixtures of Gaussians with Sparse Structure

FEATURE PRUNING IN LIKELIHOOD EVALUATION OF HMM-BASED SPEECH RECOGNITION. Xiao Li and Jeff Bilmes

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling

Feature extraction 2

Note Set 5: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

How does a dictation machine recognize speech?

Pattern Recognition and Machine Learning

Fast speaker diarization based on binary keys. Xavier Anguera and Jean François Bonastre

L23: hidden Markov models

Augmented Statistical Models for Speech Recognition

A Direct Criterion Minimization based fmllr via Gradient Descend

Discriminant Feature Space Transformations for Automatic Speech Recognition

Transcription:

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan

Overview. Research backgrounds 2. Reformulating the HMM as a trajectory model 3. Deriving its training algorithm 4. Discussing relationships to the other techniques (EMLLT, PoE, & HMM-based speech synthesis) 5. Evaluations both in speech recognition & synthesis 6. Conclusions & future plans

Overview. Research backgrounds 2. Reformulating the HMM as a trajectory model 3. Deriving its training algorithm 4. Discussing relationships to the other techniques (EMLLT, PoE, & HMM-based speech synthesis) 5. Evaluations both in speech recognition & synthesis 6. Conclusions & future plans

/ 42 Typical ASR framework Feature vector: MFCC (MF-PLP), their and Acoustic model: context-dependent HMMs Language model: word N-gram Limitations of the HMM () Piece-wise constant statistics within an HMM state (2) Conditional independence assumption (3) Weak duration modeling

2 / 42 Alternative acoustic models () Piece-wise constant statistics within an HMM state Polynomial regression HMM, Hidden dynamical model, Vocal tract resonance model, etc. (2) Conditional independence assumption Partly hidden Markov model, Stochastic segment model, Switching linear dynamical system, Conditional HMM, Dynamic Baysian network, Frame-correlated HMM, etc. (3) Weak duration modeling Hidden semi-markov model

3 / 42 Dynamic features [Furui;986] Augmenting dimensionality of observation vectors by adding their time derivatives Recognition accuracy improves very much Asimple method to capture time dependencies Ad hoc, rather than essential solution Allowing inconsistent statistics between static and dynamic features when it is used as a generative model (HMM + static & delta features ignores its relationship)

Overview. Research backgrounds 2. Reformulating the HMM as a trajectory model 3. Deriving its training algorithm 4. Discussing relationships to the other techniques (EMLLT, PoE, & HMM-based speech synthesis) 5. Evaluations both in speech recognition & synthesis 6. Conclusions & future plans

4 / 42 Reformulating the HMM as a trajectory model () Output probability of o from a standard HMM Λ o Observation vector sequence (KT ) Observation vector at time t (K ) q Gaussian component sequence Gaussian component at time t K Dimensionality of observation vector

KT 5 / 42 Reformulating the HMM as a trajectory model (2) Output probability of o from Λ according to q Gaussian distribution (KT ) KT KT KT Mean vector of q Covariance matrix of q t Mean vector of t Covariance matrix of

Observation vector static & dynamic features c t 2 c t c t c t+ c t+2 static feature (M ) st-order time derivative 2nd-order time derivative (K = 3M) Dynamic features calculated from static features Ex.) c t 2 c t c t c t+ c t+2 6 / 42

7 / 42 Relationship between o and c in a matrix form = 0 2 0 0 2 0-0 0 0 Window matrix projecting c into augmented space o 3MT MT Static feature vect sequence MT Ex) 0 0 0 0-2 - 0 2

Reformulating the HMM as a trajectory model (3) Current framework Above model is improper in the sense of statistical modeling - It allows inconsist static and dynamic feature vectors when it is used as a generative model Statistical model should be defined as a function of c Original observation is c, not augmented variable o 8 / 42

9 / 42 Reformulating the HMM as a trajectory model (4) should be normalized to yield a valid PDF : normalization constant where

Reformulating the HMM as a trajectory model (5) Normalized Gaussian distribution different Gaussian o o 2 c c 2 c c 2 P P 2 P P 2 22 P P T 2T c T c T P T P T2 P TT o T Mean vector (MT ) Covariance matrix (MT MT ) 0 / 42

Reformulating the HMM as a trajectory model (6) We may define a new statistical model by referred to as "trajectory-hmm" The mean vector is given as a smooth trajectory Variable statistics within a state The covariance matrix P is full Dependency of state output probabilities / 42

st Mel-cepstrum Time (frame) 2.0.0 0.0 5 0 5 20 25 30 35 40 45 50 55 sil a i d a sil sil a i d a sil Variance Natural speech Mean trajectory Mean sequence varies in a state Inter-frame correlation captured by Large Small 5 0 5 20 25 30 35 40 45 50 55 Time (frame) Inter-frame covariance matrix 2 / 42

st Mel-cepstrum Time (frame) 2.0.0 0.0 5 0 5 20 25 30 35 40 45 50 55 sil a i d a sil 5 0 5 20 25 30 35 40 45 50 55 Inter-frame covariance matrix sil a i d a sil Natural speech Mean trajectory Both mean & covariance vary according to durs. & neighboring models Possible to capture coarticulation effects Variance Large Small 2 / 42

Overview. Research backgrounds 2. Reformulating the HMM as a trajectory model 3. Deriving its training algorithm 4. Discussing relationships to the other techniques (EMLLT, PoE, & HMM-based speech synthesis) 5. Evaluations both in speech recognition & synthesis 6. Conclusions & future plans

Estimating trajectory-hmm parameters Auxiliary function of EM algorithm (hidden variable: q) Computing is prohibitive - Exact inference is intractable iterbi V approximation Searching the best q Optimizing Λ 2 / 42

23 / 42 Optimizing model parameters () Log likelihood of the trajectory-hmm band diagonal diagonal

Optimizing trajectory-hmm parameters (2) Introduce Gaussian component sequence matrix: 3MT 3MT 0 3MN 0 3MN N : #Gaussians in the model set : mapping & & 3MT 3MT 0 0 3MN 3MN 24 / 42

Optimizing mean vectors By setting symmetric, positive define 0 diagonal 0 = Solution of above set of linear equations m which maximizes model likelihood 25 / 42

26 / 42 Optimizing covariance matrices can be optimized using gradient methods (e.g., steepest ascent, quasi-newthon)

27 / 42 Searching the best Gaussian component seq. Computing is intractable (Because inter & intra frame covariance matrix is full) Unable to apply iterbi V algorithm to find Using approximate Viterbi algorithm to find better q can be computed time-recursive V iterbi algorithm with delayed decision Possible to search sub-optimal q

Time-recursive computation of () 3MT : diagonal, & can be computed : full Computing & is dif ficult However, they can be calculated time-recursively 28 / 42

Time-recursive computation of (2) : band, symmetric and positive define mat It can be factorized by Choleky decomposition : Cholesky factor (Upper triangular) : t-th diagonal element of matrix : Depends only on the Gaussian components from to t+l (L = Window length) can be computed time-recursively 29 / 42

Time-recursive computation of (3) (Forward substitution) (Backward substitution) can also be computed time-recursively 30 / 42

Time-recursive computation of (4) can be computed in a time-recursive manner 3 / 42

32 / 42 Viterbi algorithm with delayed decision Ex.) Approx. Viterbi algorithm with 2-frame delayed decision Gaussian components preceding 2 frames succeeding frame State sequence from to t 3 have been determined To compute the likelihood statistics at t+ is required -frame look-ahead

32 / 42 Viterbi algorithm with delayed decision Ex.) Approx. Viterbi algorithm with 2-frame delayed decision Gaussian components preceding 2 frames succeeding frame Computing likelihoods of all possible Gaussian sequences staying s at time t

32 / 42 Viterbi algorithm with delayed decision Ex.) Approx. Viterbi algorithm with 2-frame delayed decision J-frame delayed decision Gaussian components determine state at t 2 at t Select more likely path incorpolate the effect of state determination for neighbouring frames Coarticulation effect 00-ms 200-ms J=0 20 is sufficient For a 0-ms frame shift

Overview. Research backgrounds 2. Reformulating the HMM as a trajectory model 3. Deriving its training algorithm 4. Discussing relationships to the other techniques (EMLLT, PoE, & HMM-based speech synthesis) 5. Evaluations both in speech recognition & synthesis 6. Conclusions & future plans

Relationship to EMLLT () Covariance matrix modeling in ASR diagonal Unable to capture intra-frame correlation full Increasing model parameters Structured inverse covariance (precision) matrix modeling Semi-T ied Covariance Matrices (STC) Extended Maximum Likelihood Linear Transformation (EMLLT) Subspace for Precision And Mean (SPAM) Model STC EMLLT SPAM Basis Type rank- symmetric rank- symmetric full-rank symmetric Basis Order equal to dimensionality more than dimensionality more than dimensionality

Relationship to EMLLT (2) Inverse covariance (precision) matrix of trajectory-hmm Precision matrix Sum of rank- symmetric matrces #basis more than dimensionality EMLL T to capture inter-frame correlation 33 / 42

34 / 42 Relationship to Product of Experts (PoE) EMLL T as PoE [Sim & Gales;'04] HMM + static & delta features as PoE [Williams;'05] trajectory-hmm can be viewed as PoE PoE representation of the trajectory-hmm Augmented observations are modeled by Gaussian experts Producted Gaussians are normalized to yield valid PDF

3 / 42 Relationship to HMM-based speech synthesis Current speech synthesis paradigms Unit selection and concatenation High quality, but sometimes discontinuous Obtaining various voice qualities Large amount of speech data is required Speech synthesis from HMMs themselves (HTS) Buzzy, but smooth & stable V oice quality can be changed (e.g., adaptation, interpolation, eigenvoice)

Speech parameter generation from HMM () Synthesizing speech maximizing its output probability For given HMM Λ and Gaussian component sequence q, determine a obs. vector sequence which maximizes its output probabiltity from q: becomes a sequence of mean vectors 5 / 42

7 / 42 Speech parameter generation from HMM (2) = 0 2 0 0 2 0-0 0 0 Window matrix projecting c into augmented space o 3MT MT Static feature vect sequence MT Ex) 0 0 0 0-2 - 0 2

7 / 42 Speech parameter generation from HMM (3) Synthesizing speech maximizing its output probability For given HMM Λ and Gaussian component sequence q, determine a obs. vector sequence which maximizes its output probabiltity from q under the constraints o = Wc

Speech parameter generation from HMM (3) By setting, we obtain A sequence of speech parameter vector can be determined based on statistics both in static and dynamic features /sil/ /a/ /i/ /sil/ static delta 2.2 0 -.2 0.4-0.5 0 mean variance

Relationship between HTS and trajectory-hmm () Mean vector of the trajectory-hmm, Speech parameter vector sequence which maximizes its output probability from q, and are completely the same 9 / 42

Relationship between HTS and trajectory-hmm (2) When takes its maximum value? Estimating trajectory-hmm based on ML criterion Minimizing mean square error between c and Minimizing mean square error between c and Estimating parameters maximizing may improve HMM-based speech synthesis 20 / 42

Overview. Research backgrounds 2. Reformulating the HMM as a trajectory model 3. Deriving training algorithm 4. Discussing relationships to other techniques (EMLLT, PoE, HMM-based speech synthesis) 5. Evaluation both in speech recognition & synthesis 6. Conclusion & Future plans

Speech recognition experiment Database Training data Test data Feature vector Model structure Training procedure ATR Japanese continuous speech database b-set speaker MHT 450 utterances 53 utterances 0 8 order Mel-cepstral coefficients and its delta and delta delta 3-state left-to-right monophone model with single Gaussian state output distribution After training standard HMMs (baseline), trajectory-hmms are reestimated by using standard HMMs as their initial models 35 / 42

Average log likelihood Average log likelihood per frame 5.4 5.2 5.0 4.8 J : delay J=2 J=3 2 3 4 5 6 7 8 9 0 #iteration of Viterbi training J=4 J=5 J=6 J=7 Approx. iterbi V algorithm with larger J found more likely q Iterative training improved model likelihood 36 / 42

37 / 42 Examples of trajectories (data & mean) c() 2.5 2.5 0.5 0-0.5 - -.5 0 0.2 0.4 0.6 0.8.0.2.4.6.8 (sec) sil j i b u N n o j i ts u ry o k u w a pau one of training data HMM mean sequence

37 / 42 Examples of trajectories (data & mean) c() 2.5 2.5 0.5 0-0.5 - -.5 0 0.2 0.4 0.6 0.8.0.2.4.6.8 (sec) sil j i b u N n o j i ts u ry o k u w a pau one of training data HMM mean sequence mean trajectory from HMM

37 / 42 Examples of trajectories (data & mean) c() 2.5 2.5 0.5 0-0.5 - -.5 0 0.2 0.4 0.6 0.8.0.2.4.6.8 (sec) sil j i b u N n o j i ts u ry o k u w a pau one of training data HMM mean sequence mean trajectory from HMM mean trajectory from trajectory-hmm

Phoneme recognition experiment 00-best lists were genereted using HMMs (Baseline) Each hypothesis was re-aligned Without reference hypothesis (Baseline:9.7%) Phoneme error rate (%) 9 8 J : delay J=2 J=3 J=4 J=5 J=6 J=7 8.0% (9% error reduction) 2 3 4 5 6 7 8 9 0 #iteration of Viterbi training 38 / 42

Phoneme recognition experiment With reference hypothesis included (Baseline: 5.9%) Phoneme error rate (%) 0 9 J : delay J=2 J=3 J=4 J=5 9.0% (43% error reduction) J=6 J=7 2 3 4 5 6 7 8 9 0 #itaration of the Viterbi training 39 / 42

40 / 42 Speech synthesis experiment Spectrum : single Gauss F0 : multi-space proba Training data CMU ARCTIC database speaker AWB first 096 utteranc Test data Remaining 42 utterances Sampling rate 6 khz Window 25-ms Blackman window Frame rate 5-ms Spectral analysis 24-order Mel-cepstral analysis Dynamic feature calculated from frames Feature vector c(0) c(24), log F0, and their and Topology 5-state left-to-right HMM with no skip

Subjective listening test Test type Subjects Test sentences Paired comparison test 8 Graduate students 20 test sentences were chosen at random Preference Score (%) 0 20 40 60 80 00 Baum-Welch Viterbi trajectory-hmm 42.5% 38.4% 69.% : 95% confidence interval 4 / 42

Overview. Research backgrounds 2. Reformulating the HMM as a trajectory model 3. Deriving training algorithm 4. Discussing relationships to other techniques (EMLLT, PoE, HMM-based speech synthesis) 5. Evaluation both in speech recognition & synthesis 6. Conclusion & Future plans

42 / 42 Conclusions Reformulating the HMM as a trajectory model Deriving iterbi-type V training algorithm Evaluations both in speech recognition and synthesis Significant improvements over HMM were achiev Future plans Designing & implementing iterbi V decoder Large-scale evaluation (speaker independent, VCSR) L EM-type training ariational (V or MonteCarlo EM)