Sparse Models for Speech Recognition

Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology

Outline Introduction to speech recognition Motivations for sparse models Maximum likelihood training of sparse models ML training of sparse banded models Discriminative training of sparse models Conclusions ECE/HKUST Weibin Zhang 2

Speech Recognition & its Applications 1. Automatic Speech Recognition (ASR): Convert speech wave into text automatically 2. Applications: Office/business systems: Manufacturing Telecommunications Mobile telephony Home Automation Navigation ECE/HKUST Weibin Zhang 3

History of ASR Technical Point of View ECE/HKUST Weibin Zhang 4

ASR Research -- Overview Statistical approaches lead in all area. Still big gap between human and machine performance...however Useful systems have been built which are changing the way we interact with the world...within five years people will discard their keyboards and interact with computers using touch-screens and voice controls... Bill Gates, Feb 2008 ECE/HKUST Weibin Zhang 5

Statistical speech recognition system ECE/HKUST Weibin Zhang 6

Statistical speech recognition system Language Model: P( recognize speech ) >> P( wreck a nice beach ) Dictionary: Wreck r e k Beach b i th Acoustic Model: P(O recognize speech ) ECE/HKUST Weibin Zhang 7

Statistical speech recognition system ECE/HKUST Weibin Zhang 8

Acoustic modeling Left-to-right hidden Markov models (HMMs) GMM-HMM based acoustic models p o t s j = m c jm N(o t : μ jm, Σ jm ) Θ = a ij, b j (o t ) = {a ij, c jm, u jm, Σ jm } ECE/HKUST Weibin Zhang 9

Evaluation of ASR system Word error rate (WER) = 1 accuracy Real time factor (RTF) ECE/HKUST Weibin Zhang 10

Covariance modeling ECE/HKUST Weibin Zhang 11

Covariance modeling Less Training data ECE/HKUST Weibin Zhang 12

ML training of sparse models Maximum likelihood (ML) training Θ = argmax{log(p(o Θ))} Θ The proposed new objective function S 1 i=2 M m=1 L Θ = log(p(o Θ)) ρ C im 1 ECE/HKUST Weibin Zhang 13

Maximizing the auxiliary function The precision matrices can be updated using C im = argmax{logdetc im trace S im C im λ C im 1 } C im 0 λ = 2ρ and S γ im is the sample covariance matrix. im Convex optimization or other more efficient methods (e.g. graphical lasso) ECE/HKUST Weibin Zhang 14

Experiments on the WSJ data Experimental setup Training, development and testing data sets Standard bigram language model Feature vector: 39-dimension MFCC 39 phonemes for English (39 3 triphones) 2843 tied HMM states ECE/HKUST Weibin Zhang 15

Tuning results on the dev. data set ECE/HKUST Weibin Zhang 16

WER on the testing data set Our result of 8.77% WER is comparable to the 8.6% WER reported in (Ko & Mak, 2011) using a similar testing configuration, but using 70 hours of training data Model type #Gaussians WER Rel. improv. Significant? Full 2 10.5-7.1 No Diagonal 10 9.84 ---- ---- Sparse 4 8.77 10.9% Yes ECE/HKUST Weibin Zhang 17

Sparse banded models Sparse models Sparse models Sparse banded feature reorder models ECE/HKUST Weibin Zhang 18

Training of sparse banded models Weighted lasso: f C im = H C im 1 H k, l = C im k, l = 0 Diagonal Sparse banded Full ECE/HKUST Weibin Zhang 19

Importance of the feature order O~N μ, Σ ; C = Σ 1 ; C ij = 0 o i and o j are conditionally independent (CI), given other variables. Rearrange the feature order so that o i and o j are CI if i j > b Three orders are investigated: HTK order : m 1 m 13 Δm 1 Δm 13 ΔΔm 1 ΔΔm 13 Knowledge-based order : m 1 Δm 1 ΔΔm 1 m 13 Δm 13 ΔΔm 13 Data-driven order : m 1 ΔΔm 1 Δm 6 Δm 10 ECE/HKUST Weibin Zhang 20

Results on the development data ECE/HKUST Weibin Zhang 21

Results on the test data Model type #Gaussians WER Rel. improv. Significant? Full 2 10.5-7.1 No Diagonal 10 9.84 ---- ---- Sparse 4 8.77 10.9% Yes Band8 4 8.91 9.5 Yes ECE/HKUST Weibin Zhang 22

Decoding time Sparse banded modes are the fastest since: 1) smaller searching beamwidths; 2) less model parameters. ECE/HKUST Weibin Zhang 23

Discriminative training MMI objective function: Θ = Argmax{ log P w r O, Θ } Θ New Objective function S 1 L Θ = log P w r O, Θ ρ C im 1 i=2 M m=1 A valid weak-sense auxiliary function is Q Θ; Θ =Q n Θ; Θ -Q d Θ; Θ +Q s Θ; Θ +Q I Θ; Θ S 1 M ρ C im 1 i=2 m=1 Same as ML training Ensure stability Improve generalization Regularization term ECE/HKUST Weibin Zhang 24

Results on the WSJ testing data Model type #Gaussians ML training MMI Full 2 11.68 9.18 Diagonal 10 9.84 9.04 Diagonal+ STC 10 9.26 8.66 Sparse 4 8.55 8.05 ECE/HKUST Weibin Zhang 25

Summary Sparse models are effective in dealing with the problems that conventional diagonal and full covariance models face: computation, incorrect model assumptions and over-fitting when training data is insufficient. We derive the overall training process under the HMM framework using both maximum likelihood training and discriminative training. The proposed sparse models subsume the traditional diagonal and full covariance models as special cases. ECE/HKUST Weibin Zhang 26

Thank you! ECE/HKUST Weibin Zhang 27