Sparse Models for Speech Recognition

Size: px

Start display at page:

Download "Sparse Models for Speech Recognition"

Lenard Rose
5 years ago
Views:

1 Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology

2 Outline Introduction to speech recognition Motivations for sparse models Maximum likelihood training of sparse models ML training of sparse banded models Discriminative training of sparse models Conclusions ECE/HKUST Weibin Zhang 2

3 Speech Recognition & its Applications 1. Automatic Speech Recognition (ASR): Convert speech wave into text automatically 2. Applications: Office/business systems: Manufacturing Telecommunications Mobile telephony Home Automation Navigation ECE/HKUST Weibin Zhang 3

4 History of ASR Technical Point of View ECE/HKUST Weibin Zhang 4

5 ASR Research -- Overview Statistical approaches lead in all area. Still big gap between human and machine performance...however Useful systems have been built which are changing the way we interact with the world...within five years people will discard their keyboards and interact with computers using touch-screens and voice controls... Bill Gates, Feb 2008 ECE/HKUST Weibin Zhang 5

6 Statistical speech recognition system ECE/HKUST Weibin Zhang 6

7 Statistical speech recognition system Language Model: P( recognize speech ) >> P( wreck a nice beach ) Dictionary: Wreck r e k Beach b i th Acoustic Model: P(O recognize speech ) ECE/HKUST Weibin Zhang 7

8 Statistical speech recognition system ECE/HKUST Weibin Zhang 8

9 Acoustic modeling Left-to-right hidden Markov models (HMMs) GMM-HMM based acoustic models p o t s j = m c jm N(o t : μ jm, Σ jm ) Θ = a ij, b j (o t ) = {a ij, c jm, u jm, Σ jm } ECE/HKUST Weibin Zhang 9

10 Evaluation of ASR system Word error rate (WER) = 1 accuracy Real time factor (RTF) ECE/HKUST Weibin Zhang 10

11 Covariance modeling ECE/HKUST Weibin Zhang 11

12 Covariance modeling Less Training data ECE/HKUST Weibin Zhang 12

13 ML training of sparse models Maximum likelihood (ML) training Θ = argmax{log(p(o Θ))} Θ The proposed new objective function S 1 i=2 M m=1 L Θ = log(p(o Θ)) ρ C im 1 ECE/HKUST Weibin Zhang 13

and S γ im is the sample covariance matrix.

14 Maximizing the auxiliary function The precision matrices can be updated using C im = argmax{logdetc im trace S im C im λ C im 1 } C im 0 λ = 2ρ and S γ im is the sample covariance matrix. im Convex optimization or other more efficient methods (e.g. graphical lasso) ECE/HKUST Weibin Zhang 14

15 Experiments on the WSJ data Experimental setup Training, development and testing data sets Standard bigram language model Feature vector: 39-dimension MFCC 39 phonemes for English (39 3 triphones) 2843 tied HMM states ECE/HKUST Weibin Zhang 15

16 Tuning results on the dev. data set ECE/HKUST Weibin Zhang 16

17 WER on the testing data set Our result of 8.77% WER is comparable to the 8.6% WER reported in (Ko & Mak, 2011) using a similar testing configuration, but using 70 hours of training data Model type #Gaussians WER Rel. improv. Significant? Full No Diagonal Sparse % Yes ECE/HKUST Weibin Zhang 17

18 Sparse banded models Sparse models Sparse models Sparse banded feature reorder models ECE/HKUST Weibin Zhang 18

19 Training of sparse banded models Weighted lasso: f C im = H C im 1 H k, l = C im k, l = 0 Diagonal Sparse banded Full ECE/HKUST Weibin Zhang 19

20 Importance of the feature order O~N μ, Σ ; C = Σ 1 ; C ij = 0 o i and o j are conditionally independent (CI), given other variables. Rearrange the feature order so that o i and o j are CI if i j > b Three orders are investigated: HTK order : m 1 m 13 Δm 1 Δm 13 ΔΔm 1 ΔΔm 13 Knowledge-based order : m 1 Δm 1 ΔΔm 1 m 13 Δm 13 ΔΔm 13 Data-driven order : m 1 ΔΔm 1 Δm 6 Δm 10 ECE/HKUST Weibin Zhang 20

21 Results on the development data ECE/HKUST Weibin Zhang 21

22 Results on the test data Model type #Gaussians WER Rel. improv. Significant? Full No Diagonal Sparse % Yes Band Yes ECE/HKUST Weibin Zhang 22

23 Decoding time Sparse banded modes are the fastest since: 1) smaller searching beamwidths; 2) less model parameters. ECE/HKUST Weibin Zhang 23

24 Discriminative training MMI objective function: Θ = Argmax{ log P w r O, Θ } Θ New Objective function S 1 L Θ = log P w r O, Θ ρ C im 1 i=2 M m=1 A valid weak-sense auxiliary function is Q Θ; Θ =Q n Θ; Θ -Q d Θ; Θ +Q s Θ; Θ +Q I Θ; Θ S 1 M ρ C im 1 i=2 m=1 Same as ML training Ensure stability Improve generalization Regularization term ECE/HKUST Weibin Zhang 24

25 Results on the WSJ testing data Model type #Gaussians ML training MMI Full Diagonal Diagonal+ STC Sparse ECE/HKUST Weibin Zhang 25

26 Summary Sparse models are effective in dealing with the problems that conventional diagonal and full covariance models face: computation, incorrect model assumptions and over-fitting when training data is insufficient. We derive the overall training process under the HMM framework using both maximum likelihood training and discriminative training. The proposed sparse models subsume the traditional diagonal and full covariance models as special cases. ECE/HKUST Weibin Zhang 26

27 Thank you! ECE/HKUST Weibin Zhang 27

Hidden Markov Model and Speech Recognition

Hidden Markov Model and Speech Recognition 1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed