N-grams Now Simple (Unsmoothed) N-grams Smoothing { Add-one Smoothing { Backo { Deleted Interpolation Reading: { Jurafsky & Martin Ch. 6:! 6.6 incl. 1
Word-prediction Applications Augmentative Communication Systems { Helping disabled communicate { Spelling too slow { Menus are limited Context sensitive spelling error correction Speech recognition language modelling For example: They are leaving in about fifteen minuets to go to her house. The study was conducted mainly be John Black. The design an construction of the system will take more than a year. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of [this problem.] He is trying to fine out. 2
Simple N-grams If vocabulary V and every word has an equal proability of occurring and w i has an equal probability of following another word w j What is P (w i )? { Answer: 1=V What is P (w j jw i )? { Answer: 1=V What is P (w i w j ) i.e. w i followed by w j? { Answer: P (w i )P (w j jw i ) { Answer: 1=V 2 But this is far too simple Use training corpus for counting But go further than P (w i ) { e.g. P (the) 0:07 3
P (word sequence) P (w 1 w 2 ::: w n;1 W n )=P (w n 1 ) P (w n 1 )= P (w 1 )P (w 2 jw 1 )P (w 3 jw 2 1) :::P(w n jw n;1 1 ) P (w n 1 )= Q n k=1 P (w kjw k;1 1 ) Unigram { Assume P (w n jw n;1 1 )=P (w n ) { So P (w n 1 )= Q n k=1 P (w k) Bigram { Assume P (w n jw n;1 1 )=P (w n jw n;1 ) { So P (w n 1 )= Q n k=1 P (w kjw k;1 ) Trigram { Assume P (w n jw n;1 1 )=P (w n jw n;1 n;2 ) { So P (w n 1 )= Q n k=1 P (w kjw k;1 k;2 ) N-gram { Assume P (w n jw n;1 1 )=P (w n jw n;1 n;n+1 ) { So P (w n 1 )= Q n k=1 P (w kjw k;1 k;n+1 ) 4
Bigram Example Berkeley Restaurant Project { Speech-based restaurant consultant { Limited domain Examples { I'm looking for Chinese food. { Is Cafe Venezia open for lunch? Bigram probabilities for the word eat: eat on.16 eat Thai.03 eat some.06 eat breakfast.03 eat lunch.06 eat in.02 eat dinner.05 eat Chinese.02 eat at.04 eat Mexican.02 eat a.04 eat tomorrow.01 eat Indian.04 eat dessert.007 eat today.03 eat British.001 5
P (I want to eat British food) <s> I.25 I want.32 want to.65 to eat.26 British food.60 <s> I d.06 I would.29 want a.05 to have.14 British restaurant.15 <s> Tell.04 I don t.08 want some.04 to spend.09 British cuisine.01 <s> I m.02 I have.04 want thai.01 to be.02 British lunch.01 < s > means \Start of sentence" P (Britishjeat)=0:001 P (I want to eat British food) = P (Ij < s >)P (wantji)p (tojwant)p (eatjto) P (Britishjeat)P (foodjbritish) = 0:25 0:32 0:65 0:26 0:001 0:60 = 0:0000081 6
Training Count Normalize { So that probabilities lie between 0 and 1 Bigrams: P (w n jw n;1 ) = C(w n;1 w n ) P w C(w n;1w) = C(w n;1w n ) C(w n;1 ) 7
Training Bigrams Berkeley RP V=1616 Bigram counts for seven words: I want to eat Chinese food lunch I 8 1087 0 13 0 0 0 want 3 0 786 0 6 8 6 to 3 0 10 860 3 0 12 eat 0 0 2 0 19 2 52 Chinese 2 0 0 0 0 120 1 food 19 0 17 0 0 0 0 lunch 4 0 0 0 0 1 0 Unigram counts: I 3437 want 1215 to 3256 eat 938 Chinese 213 food 1506 lunch 459 8
Bigram Probabilities: I want to eat Chinese food lunch I.0023.32 0.0038 0 0 0 want.0025 0.65 0.0049.0066.0049 to.00092 0.0031.26.00092 0.0037 eat 0 0.0021 0.020.0021.055 Chinese.0094 0 0 0 0.56.0047 food.013 0.011 0 0 0 0 lunch.0087 0 0 0 0.0022 0 9
Training & Testing Choose corpus for training { Is it too specic to the task { Is it too general? Divide into training and testing sets Don't choose test set from training set Use test set to evaluate architectures Cross-validation is often used { Choose a portion of the corpus, say 9/10, for training { Leave the remainder (1/10) as testing data { Evaluate { Now choose a dierent training set and old testing set part of new training set. { Re-evaluate and repeat until... { Take averages as indicator of performance 10
Add-one Smoothing We don't want zero-probability N-grams For unigram probabilities: Add-one smoothing: C(w n ) P (w n ) = P C(w) w = C(w n) N P (w n ) = P C(w n)+1 w C (w) For bigram probabilities: = C(w n)+1 N + V P (w n jw n;1 ) = C(w n;1w n ) P w C(w n;1) Add-one smoothing: P (w n jw n;1 ) = C(w n;1w n )+1 C(w n;1 )+V 11
Smoothed Bigrams Berkeley RP V=1616 Smoothed bigram counts for seven words: I want to eat Chinese food lunch I 9 1088 1 14 1 1 1 want 4 1 787 1 7 9 7 to 4 1 11 861 4 1 13 eat 1 1 3 1 20 3 53 Chinese 3 1 1 1 1 121 2 food 20 1 18 1 1 1 1 lunch 5 1 1 1 1 2 1 Unigram counts: I 5053 want 2931 to 4872 eat 2554 Chinese 1829 food 3122 lunch 2075 12
Smoothed bigram Probabilities: I want to eat Chinese food lunch I.0018.22.00020.0028.00020.00020.00020 want.0014.00035.28.00035.0025.0032.0025 to.00082.00021.0023.18.00082.00021.0027 eat.00039.00039.0012.00039.0078.0012.021 Chinese.0016.00055.00055.00055.00055.066.0011 food.0064.00032.0058.00032.00032.00032.00032 lunch.0024.00048.00048.00048.00048.00096.00048 Unsmoothed bigram Probabilities: I want to eat Chinese food lunch I.0023.32 0.0038 0 0 0 want.0025 0.65 0.0049.0066.0049 to.00092 0.0031.26.00092 0.0037 eat 0 0.0021 0.020.0021.055 Chinese.0094 0 0 0 0.56.0047 food.013 0.011 0 0 0 0 lunch.0087 0 0 0 0.0022 0 13
Smoothed Counts Counts become adjusted by smoothing Some \weight" is given to zero-counts This comes from reducing the weight given to non-zero counts Adjusted count comes from adding one to count and muliplying by a normalisation factor, N : N+V c N i = (c i +1) N + V We dene the discount as: d i = c i c i 14
Smoothed bigram counts: I want to eat Chinese food lunch I 6 740.68 10.68.68.68 want 2.42 331.42 3 4 3 to 3.69 8 594 3.69 9 eat.37.37 1.37 7.4 1 20 Chinese.36.12.12.12.12 15.24 food 10.48 9.48.48.48.48 lunch 1.1.22.22.22.22.44.22 Unsmoothed bigram counts: I want to eat Chinese food lunch I 8 1087 0 13 0 0 0 want 3 0 786 0 6 8 6 to 3 0 10 860 3 0 12 eat 0 0 2 0 19 2 52 Chinese 2 0 0 0 0 120 1 food 19 0 17 0 0 0 0 lunch 4 0 0 0 0 1 0 I 0.68 want 0.42 to 0.69 Discounts: eat 0.37 Chinese 0.12 food 0.48 lunch 0.22 15
Problem Too much/little weight can be given to zero-counts In general add-one smoothing is a poor smoothing method Witten-Bell Discounting is a relatively simple, but more \sensible" approach to smoothing. It assigns more appropriate weighting to zero-counts. Beyond scope of this module but details in Jurafsky & Martin. 16
Backo If we have no examples of a particular trigram w n;2 w n;1 w n to calculate P (w n jw n;2 w n;1 ) Then \back o" and simply use the bigram probability P (w n jw n;1 ) What if no examples of bigram w n;1 w n? Just use P (w n )! But remember: P i j P (w njw i w j )=1 So we mus use some discounting to adjust probabilities of lower order models when we backo to them. P (w n jw n;2 w n;1 )= ~P (w n jw n;2 w n;1 ), if C(w n;2 w n;1 w n ) > 0 else (w n;1 n;2 ) ~ P (wn jw n;1 ), if C(w n;1 w n ) > 0 else (w n;1 ) ~ P (w n ), otherwise 17
Deleted Interpolation Rather than backing o, use linear combination of trigram, bigram, and unigram. So P (w n jw n;2 w n;1 )= 1 P (w n jw n;2 w n;1 )+ 2 P (w n jw n;1 )+ 3 P (w n ) such that P i i =1 i can be learned automatically from corpus training data (using HMMs) i canbemadetovary according to particular trigram, so: So P (w n jw n;2 w n;1 )= 1 (w n;1 n;2 )P (w njw n;2 w n;1 ) + 2 w n;1 n;2 )P (w njw n;1 ) + 3 w n;1 n;2 )P (w n) 18