Chapter 3: Basics of Language Modelling

Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple evaluation metric for fast turnaround times in experiments 2 2

Test your Language Model What s in your hometown newspaper??? 3 3

Test your Language Model What s in your hometown newspaper today 4 4

Test your Language Model It s raining cats and??? 5 5

Test your Language Model It s raining cats and dogs 6 6

Test your Language Model President Bill??? 7 7

Test your Language Model President Bill Gates 8 8

Definition Language Model A language model either: Predicts the next word w given a sequence of predecessor words h (history) P(w i hi ) P(w i w1,... wi 1) with i : position in text or Predicts the probability of a sequence of words P( w, w,... w,. w ) P( W 1 2 N 1 N ) 9 9

Language Model in Speech Recognition (ASR) Speech input Stream of feature vectors A Feature Extraction Search ^ W=argmax [P(A W) P(W)] [W] Acoustic Model P(A W) LanguageModel P(W) Recognised Word Sequence ^ W 10 10

Language Model in Machine Translation (MT) Source Sentence W F LexiconModel P(W F W E ) Search ^ W E =argmax [P(W F W E ) P(W E )] [W E ] LanguageModel P(W) E ) Recognised Word Sequence ^ W E 11 11

Sentence Completion Directly use of P(w h) Needs good search algorithms 12 12

The Markov Assumption: truncate the history Cutting history to M-1 words: Special cases: P(wi w1, w2,... wi 1) P(wi wi M 1,... wi 1) Zerogram (uniform distribution) P(wi w1, w2,... w i 1) 1 W Unigram P(w i w, w 1 2,... w i 1 ) P(w ) i Bigram Trigram P(wi w1, w2,... w i 1) P(wi wi 1) P(wi w1, w2,... wi 1) P(wi wi 2, wi 1). 13 13

How would you measure the quality of an LM? 14 14

Definition of Perplexity Let w 1, w 2, w N be an independent test corpus not used during training. Perplexity PP P( w -1/N 1, w2,... wn ) Interpretation: Normalized probability of test corpus 15 15

Example: Zerogram Language Model Zerogram P(wi w1, w2,... w i 1) (uniform distribution) 1 W PP W N i1 P( w 1 W 1, w 2 1/ N,... w N ) -1/N Interpretation: perplexity is the average de-facto size of vocabulary 16 16

Alternate Formulation Assumption: Use an M-gram language model History h i =w i-m+1 w i-1 Idea: collapse all identical histories 17 17

Alternate Formulation Example: M=2 Corpus: to be or not to be h 2 = to w 2 = be h 6 = to w 6 = be a calculate P( be to ) only once and scale it by 2 18 18

Alternate Formulation PP P( w N P( w i h i ) i1 exp log -1/N 1, w2,... wn ) N i1 1/ N P( w i h i ) 1/ N Use Bayes decomposition Try to get rid of product exp 1 log N N i1 P( w i h i ) 19 19

Alternate Formulation exp 1 log N N i1 P( w i h i ) exp N 1 logp( w i h i ) N 1 i exp 1 N( w, h)logp( w h) N, w h N(w,h): absolute frequency of sequence h,w on test corpus exp w, h f ( w, h)log P( w h) f(w,h) relative frequency of sequence h,w on test corpus 20 20

Alternate Formulation: Final Result PP P( w exp 1 w, h... w f N ) 1/ N ( w, h)log P( w h) Perplexity can also be calculated using conditional probabilities trained in the training corpus and relative frequencies from the test corpus. 21 21

Alternate Formulation: Zerogram Example Zerogram (uniform distribution) P(wi w1, w2,... w i 1) 1 W PP exp w, h f ( w, h)log 1 exp log f ( w, h) W w, h P( w h) exp w, h 1 f ( w, h)log W log W * W exp 1 Identical result 22 22

Perplexity and Error Rate Perplexity is correlated to word error rate Power law relation 23 23 Klakow + Peters in Speech Communication

Quality Measure: Mean Rank Definition: Let w follow h in the test text Sort all words after a given history according to p(w h) Determine position of correct word w Average over all events in the test text 24 24

Alternative: Average Rank Task: guess the next word Metric: Measure how many attempts it takes 25 25 Source: Witten, Bell: Text Compression

Quality Measure: Mean Rank Measure Perplexity 0.955 Mean Rank 0.957 Correlation with ASR error rate Mean rank rank equally good in predicting ASR performance 26 26

Summary Language models predict the next word Applications: Speech Recognition Machine Translation Natural Language Generation Information retrieval Sentence Completion Perplexity: Measures quality of language model 27 27