Chapter 3: Basics of Language Modeling

Section 3.1. Language Modeling in Automatic Speech Recognition (ASR) All graphs in this section are from the book by Schukat-Talamazzini unless indicated otherwise http://www.minet.uni-jena.de/fakultaet/schukat/mypub/schukattalamazzini95:asg.pdf 2 2

Complexity of Speech Recognition Applications 3 3 From: Schukat-Talamazzini: Spracherkennung

What makes Speech Recognition difficult 4 4 From: Schukat-Talamazzini: Spracherkennung

Milestones in ASR 5 5 From: Schukat-Talamazzini: Spracherkennung

Introduction into ASR 6 6 Source: Steve Young: the HTK book

Basic Components of an ASR-System Speech input Stream of feature vectors A Feature Extraction Search h ^ W=argmax [P(A W) P(W)] [W] Acoustic Model P(A W) LanguageModel P(W) Recognised Word Sequence ^ W 7 7

The Markov Assumption Cutting history to M-1 words: Special cases: Trigram P(wi w1, w2,... wi 1) P(wi wi M 1,... wi 1) P(wi w1, w2,... wi 1) P(wi wi 2, wi 1) Bigram P(wi w1, w2,... w i 1) P(wi wi 1) Unigram P(w i w, w 1 2,... w i 1 ) P(w ) i Zerogram (uniform distribution) P(wi w1, w2,... w i 1) 1 W 8 8

Section 3.2. A Measure for LM Quality: Preplexity 9 9

Motivation Speech recognition can be computationally expensive For research and development: need a simple evaluation metric for fast turn-around times in experiments 10 10

Test your Language Model What s in your hometown newspaper??? 11 11

Test your Language Model What s in your hometown newspaper today 12 12

Test your Language Model It s raining cats and??? 13 13

Test your Language Model It s raining cats and dogs 14 14

Test your Language Model President Bill??? 15 15

Test your Language Model President Bill Gates 16 16

Definition of Perplexity Let w 1, w 2, w N be an independent test corpus not used during training. Perplexity PP P( w -1/N 1, w2,... wn ) Normalized probability of test corpus 17 17

Example: Zerogram Language Model Zerogram P(wi w1, w2,... w i 1) (uniform distribution) 1 W PP W N i1 P( w 1 W 1, w 2 1/ N,... w N ) -1/N Interpretation: perplexity is the average de-facto size of vocabulary 18 18

Alternate Formulation Assumption: Use an M-gram language model History h i =w i-m+1 w i-1 Idea: collapse all identical histories 19 19

Alternate Formulation Example: M=2 Corpus: to be or not to be h 2 = to w 2 = be h 6 = to w 6 = be a calculate P( be to ) only once and scale it by 2 20 20

Alternate Formulation PP P( w N P( w i h i ) i1 exp log -1/N 1, w2,... wn ) N i1 1/ N P( w i h i ) 1/ N Use Bayes decomposition Try to get rid of product exp 1 log N N i1 P( w i h i ) 21 21

Alternate Formulation exp 1 log N N i1 P( w i h i ) exp N 1 logp( w i h i ) N 1 i exp 1 N( w, h)logp( w h) N, w h N(w,h): absolute frequency of sequence h,w on test corpus exp w, h f ( w, h)log P( w h) f(w,h) relative frequency of sequence h,w on test corpus 22 22

Alternate Formulation: Final Result PP exp P( w... w 1 w, h f N ) 1/ N ( w, h)log P( w h) 23 23

Alternate Formulation: Zerogram Example Zerogram (uniform distribution) P(wi w1, w2,... w i 1) 1 W PP exp w, h f ( w, h)log 1 exp log f ( w, h) W w, h P( w h) exp w, h 1 f ( w, h)log W log W * W exp 1 Identical result 24 24

Perplexity and Error Rate Perplexity is correlated to word error rate Power law relation 25 25 Klakow + Peters in Speech Communication

Quality Measure: Mean Rank Definition: Let w follow h in the test text Sort all words after a given history according to p(w_i h) Determine position of correct word w Avarage over all events in the test text 26 26

Alternative: Average Rank 27 27 Source: Witten, Bell: Text Compression

Quality Measure: Mean Rank Measure Perplexity Mean Rank Correlation 0.955 0.957 28 28

Quality Measure: worst Score Which fraction of the test text have a score worse than S c Correlation: 0.919 29 29

Summary Speech recognition Basic task Language model as a component Perplexity: Measures quality of language model 30 30