Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan

Overview. Research backgrounds 2. Reformulating the HMM as a trajectory model 3. Deriving its training algorithm 4. Discussing relationships to the other techniques (EMLLT, PoE, & HMM-based speech synthesis) 5. Evaluations both in speech recognition & synthesis 6. Conclusions & future plans

/ 42 Typical ASR framework Feature vector: MFCC (MF-PLP), their and Acoustic model: context-dependent HMMs Language model: word N-gram Limitations of the HMM () Piece-wise constant statistics within an HMM state (2) Conditional independence assumption (3) Weak duration modeling

2 / 42 Alternative acoustic models () Piece-wise constant statistics within an HMM state Polynomial regression HMM, Hidden dynamical model, Vocal tract resonance model, etc. (2) Conditional independence assumption Partly hidden Markov model, Stochastic segment model, Switching linear dynamical system, Conditional HMM, Dynamic Baysian network, Frame-correlated HMM, etc. (3) Weak duration modeling Hidden semi-markov model

3 / 42 Dynamic features [Furui;986] Augmenting dimensionality of observation vectors by adding their time derivatives Recognition accuracy improves very much Asimple method to capture time dependencies Ad hoc, rather than essential solution Allowing inconsistent statistics between static and dynamic features when it is used as a generative model (HMM + static & delta features ignores its relationship)

4 / 42 Reformulating the HMM as a trajectory model () Output probability of o from a standard HMM Λ o Observation vector sequence (KT ) Observation vector at time t (K ) q Gaussian component sequence Gaussian component at time t K Dimensionality of observation vector

KT 5 / 42 Reformulating the HMM as a trajectory model (2) Output probability of o from Λ according to q Gaussian distribution (KT ) KT KT KT Mean vector of q Covariance matrix of q t Mean vector of t Covariance matrix of

Observation vector static & dynamic features c t 2 c t c t c t+ c t+2 static feature (M ) st-order time derivative 2nd-order time derivative (K = 3M) Dynamic features calculated from static features Ex.) c t 2 c t c t c t+ c t+2 6 / 42

7 / 42 Relationship between o and c in a matrix form = 0 2 0 0 2 0-0 0 0 Window matrix projecting c into augmented space o 3MT MT Static feature vect sequence MT Ex) 0 0 0 0-2 - 0 2

Reformulating the HMM as a trajectory model (3) Current framework Above model is improper in the sense of statistical modeling - It allows inconsist static and dynamic feature vectors when it is used as a generative model Statistical model should be defined as a function of c Original observation is c, not augmented variable o 8 / 42

9 / 42 Reformulating the HMM as a trajectory model (4) should be normalized to yield a valid PDF : normalization constant where

Reformulating the HMM as a trajectory model (5) Normalized Gaussian distribution different Gaussian o o 2 c c 2 c c 2 P P 2 P P 2 22 P P T 2T c T c T P T P T2 P TT o T Mean vector (MT ) Covariance matrix (MT MT ) 0 / 42

Reformulating the HMM as a trajectory model (6) We may define a new statistical model by referred to as "trajectory-hmm" The mean vector is given as a smooth trajectory Variable statistics within a state The covariance matrix P is full Dependency of state output probabilities / 42

st Mel-cepstrum Time (frame) 2.0.0 0.0 5 0 5 20 25 30 35 40 45 50 55 sil a i d a sil sil a i d a sil Variance Natural speech Mean trajectory Mean sequence varies in a state Inter-frame correlation captured by Large Small 5 0 5 20 25 30 35 40 45 50 55 Time (frame) Inter-frame covariance matrix 2 / 42

st Mel-cepstrum Time (frame) 2.0.0 0.0 5 0 5 20 25 30 35 40 45 50 55 sil a i d a sil 5 0 5 20 25 30 35 40 45 50 55 Inter-frame covariance matrix sil a i d a sil Natural speech Mean trajectory Both mean & covariance vary according to durs. & neighboring models Possible to capture coarticulation effects Variance Large Small 2 / 42

Estimating trajectory-hmm parameters Auxiliary function of EM algorithm (hidden variable: q) Computing is prohibitive - Exact inference is intractable iterbi V approximation Searching the best q Optimizing Λ 2 / 42

23 / 42 Optimizing model parameters () Log likelihood of the trajectory-hmm band diagonal diagonal

Optimizing trajectory-hmm parameters (2) Introduce Gaussian component sequence matrix: 3MT 3MT 0 3MN 0 3MN N : #Gaussians in the model set : mapping & & 3MT 3MT 0 0 3MN 3MN 24 / 42

Optimizing mean vectors By setting symmetric, positive define 0 diagonal 0 = Solution of above set of linear equations m which maximizes model likelihood 25 / 42

26 / 42 Optimizing covariance matrices can be optimized using gradient methods (e.g., steepest ascent, quasi-newthon)

27 / 42 Searching the best Gaussian component seq. Computing is intractable (Because inter & intra frame covariance matrix is full) Unable to apply iterbi V algorithm to find Using approximate Viterbi algorithm to find better q can be computed time-recursive V iterbi algorithm with delayed decision Possible to search sub-optimal q

Time-recursive computation of () 3MT : diagonal, & can be computed : full Computing & is dif ficult However, they can be calculated time-recursively 28 / 42

Time-recursive computation of (2) : band, symmetric and positive define mat It can be factorized by Choleky decomposition : Cholesky factor (Upper triangular) : t-th diagonal element of matrix : Depends only on the Gaussian components from to t+l (L = Window length) can be computed time-recursively 29 / 42

Time-recursive computation of (3) (Forward substitution) (Backward substitution) can also be computed time-recursively 30 / 42

Time-recursive computation of (4) can be computed in a time-recursive manner 3 / 42

32 / 42 Viterbi algorithm with delayed decision Ex.) Approx. Viterbi algorithm with 2-frame delayed decision Gaussian components preceding 2 frames succeeding frame State sequence from to t 3 have been determined To compute the likelihood statistics at t+ is required -frame look-ahead

32 / 42 Viterbi algorithm with delayed decision Ex.) Approx. Viterbi algorithm with 2-frame delayed decision Gaussian components preceding 2 frames succeeding frame Computing likelihoods of all possible Gaussian sequences staying s at time t

32 / 42 Viterbi algorithm with delayed decision Ex.) Approx. Viterbi algorithm with 2-frame delayed decision J-frame delayed decision Gaussian components determine state at t 2 at t Select more likely path incorpolate the effect of state determination for neighbouring frames Coarticulation effect 00-ms 200-ms J=0 20 is sufficient For a 0-ms frame shift

Relationship to EMLLT () Covariance matrix modeling in ASR diagonal Unable to capture intra-frame correlation full Increasing model parameters Structured inverse covariance (precision) matrix modeling Semi-T ied Covariance Matrices (STC) Extended Maximum Likelihood Linear Transformation (EMLLT) Subspace for Precision And Mean (SPAM) Model STC EMLLT SPAM Basis Type rank- symmetric rank- symmetric full-rank symmetric Basis Order equal to dimensionality more than dimensionality more than dimensionality

Relationship to EMLLT (2) Inverse covariance (precision) matrix of trajectory-hmm Precision matrix Sum of rank- symmetric matrces #basis more than dimensionality EMLL T to capture inter-frame correlation 33 / 42

34 / 42 Relationship to Product of Experts (PoE) EMLL T as PoE [Sim & Gales;'04] HMM + static & delta features as PoE [Williams;'05] trajectory-hmm can be viewed as PoE PoE representation of the trajectory-hmm Augmented observations are modeled by Gaussian experts Producted Gaussians are normalized to yield valid PDF

3 / 42 Relationship to HMM-based speech synthesis Current speech synthesis paradigms Unit selection and concatenation High quality, but sometimes discontinuous Obtaining various voice qualities Large amount of speech data is required Speech synthesis from HMMs themselves (HTS) Buzzy, but smooth & stable V oice quality can be changed (e.g., adaptation, interpolation, eigenvoice)

Speech parameter generation from HMM () Synthesizing speech maximizing its output probability For given HMM Λ and Gaussian component sequence q, determine a obs. vector sequence which maximizes its output probabiltity from q: becomes a sequence of mean vectors 5 / 42

7 / 42 Speech parameter generation from HMM (2) = 0 2 0 0 2 0-0 0 0 Window matrix projecting c into augmented space o 3MT MT Static feature vect sequence MT Ex) 0 0 0 0-2 - 0 2

7 / 42 Speech parameter generation from HMM (3) Synthesizing speech maximizing its output probability For given HMM Λ and Gaussian component sequence q, determine a obs. vector sequence which maximizes its output probabiltity from q under the constraints o = Wc

Speech parameter generation from HMM (3) By setting, we obtain A sequence of speech parameter vector can be determined based on statistics both in static and dynamic features /sil/ /a/ /i/ /sil/ static delta 2.2 0 -.2 0.4-0.5 0 mean variance

Relationship between HTS and trajectory-hmm () Mean vector of the trajectory-hmm, Speech parameter vector sequence which maximizes its output probability from q, and are completely the same 9 / 42

Relationship between HTS and trajectory-hmm (2) When takes its maximum value? Estimating trajectory-hmm based on ML criterion Minimizing mean square error between c and Minimizing mean square error between c and Estimating parameters maximizing may improve HMM-based speech synthesis 20 / 42

Overview. Research backgrounds 2. Reformulating the HMM as a trajectory model 3. Deriving training algorithm 4. Discussing relationships to other techniques (EMLLT, PoE, HMM-based speech synthesis) 5. Evaluation both in speech recognition & synthesis 6. Conclusion & Future plans

Speech recognition experiment Database Training data Test data Feature vector Model structure Training procedure ATR Japanese continuous speech database b-set speaker MHT 450 utterances 53 utterances 0 8 order Mel-cepstral coefficients and its delta and delta delta 3-state left-to-right monophone model with single Gaussian state output distribution After training standard HMMs (baseline), trajectory-hmms are reestimated by using standard HMMs as their initial models 35 / 42

Average log likelihood Average log likelihood per frame 5.4 5.2 5.0 4.8 J : delay J=2 J=3 2 3 4 5 6 7 8 9 0 #iteration of Viterbi training J=4 J=5 J=6 J=7 Approx. iterbi V algorithm with larger J found more likely q Iterative training improved model likelihood 36 / 42

37 / 42 Examples of trajectories (data & mean) c() 2.5 2.5 0.5 0-0.5 - -.5 0 0.2 0.4 0.6 0.8.0.2.4.6.8 (sec) sil j i b u N n o j i ts u ry o k u w a pau one of training data HMM mean sequence

37 / 42 Examples of trajectories (data & mean) c() 2.5 2.5 0.5 0-0.5 - -.5 0 0.2 0.4 0.6 0.8.0.2.4.6.8 (sec) sil j i b u N n o j i ts u ry o k u w a pau one of training data HMM mean sequence mean trajectory from HMM

Phoneme recognition experiment 00-best lists were genereted using HMMs (Baseline) Each hypothesis was re-aligned Without reference hypothesis (Baseline:9.7%) Phoneme error rate (%) 9 8 J : delay J=2 J=3 J=4 J=5 J=6 J=7 8.0% (9% error reduction) 2 3 4 5 6 7 8 9 0 #iteration of Viterbi training 38 / 42

Phoneme recognition experiment With reference hypothesis included (Baseline: 5.9%) Phoneme error rate (%) 0 9 J : delay J=2 J=3 J=4 J=5 9.0% (43% error reduction) J=6 J=7 2 3 4 5 6 7 8 9 0 #itaration of the Viterbi training 39 / 42

40 / 42 Speech synthesis experiment Spectrum : single Gauss F0 : multi-space proba Training data CMU ARCTIC database speaker AWB first 096 utteranc Test data Remaining 42 utterances Sampling rate 6 khz Window 25-ms Blackman window Frame rate 5-ms Spectral analysis 24-order Mel-cepstral analysis Dynamic feature calculated from frames Feature vector c(0) c(24), log F0, and their and Topology 5-state left-to-right HMM with no skip

Subjective listening test Test type Subjects Test sentences Paired comparison test 8 Graduate students 20 test sentences were chosen at random Preference Score (%) 0 20 40 60 80 00 Baum-Welch Viterbi trajectory-hmm 42.5% 38.4% 69.% : 95% confidence interval 4 / 42

42 / 42 Conclusions Reformulating the HMM as a trajectory model Deriving iterbi-type V training algorithm Evaluations both in speech recognition and synthesis Significant improvements over HMM were achiev Future plans Designing & implementing iterbi V decoder Large-scale evaluation (speaker independent, VCSR) L EM-type training ariational (V or MonteCarlo EM)