Language Model. Introduction to N-grams

Size: px

Start display at page:

Download "Language Model. Introduction to N-grams"

Bruce Johnston
5 years ago
Views:

1 Language Model Introduction to N-grams

2 Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction P(about 15 minutes from) > P(about 15 minuets from) Speech Recognition P(I saw a van) > P(eyes awe of an)

3 How to Compute Language Modeling For a given sentence s =(w 1,...,w n ) Words w t are discrete Sequence length n is random Goal: probability of an upcoming word p(w t w 1,...,w t 1 )

4 Chain Rule of Probability p(s) =p(w 1 )p(w 2 w 1 )p(w 3 w 1,w 2 ) p(w n w 1,...,w n 1 ) Example p(its water is so transparent) =p(its) p(water its) p(is its water) p(so its water is) p(transparent its water is so) The probability of a sentence can be obtained via the Chain Rule of Probability

5 Evaluation of Language Model real frequently observed higher probability ungrammatical rarely observed lower probability

6 Evaluation via Perplexity PP(s) =(p((w 1,...,w n ))) 1 n = ny t=1 1 p(w t w 1,...,w t 1 )! 1 n The best language model is the one that best predicts an unseen sentence

7 Markov Assumption Too many possible combinations of Impossible to infer Approximation: Unigram Bigram p(w t w 1,...,w t 1 ) Higher order approximation w 1,...,w t p(w t w 1,,w t 1 ) P W2 W 1 (w t w t 1 ) p(w t w 1,,w t 1 ) P W3 W 1,W 2 (w t w t 1,w t 2 )

8 Parameter Estimation In a bigram language model, parameters are P W2 W 1 (w 2 w 1 ), 8 w 1,w 2 2 vocabulary Straight forward: the ML estimator P W2 W 1 (w 2 w 1 )= count(w 2,w 1 ) count(w 1 ) count(, ) count( ) is the number of cooccurrence of a word pair. is the number occurrence of a word.

9 An Example Training corpus: <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> Induced parameters: P W2 W 1 (I hsi) = 2 3 P W2 W 1 (Sam hsi) = 1 3 P W2 W 1 (am I) = 2 3 P W2 W 1 (h/si Sam) = 1 2 P W2 W 1 (Sam am) = 1 2 P W2 W 1 (do I) = 1 3

Online Resource https://research.googleblog.

10 Online Resource

11 Shakespeare as Corpus Corpus has 884,647 tokens Vocabulary size V=29,066 Shakespeare produced 300,000 bigrams out of V^2 = 844 millions possible bigrams 99.96% of the bigram tables are zero (why?)

12 Practical Issues Sparsity Things that don t occur in the training corpus, but occur in real life. Training Corpus denied the allegations denied the reports denied the claims denied the request Real Applications denied the offer P (o er denied the) = 0

13 Smoothing the Zero s When we have sparse statistics, steal probability mass to generalize better. Reference: Chen, Stanley F., and Joshua Goodman. "An empirical study of smoothing techniques for language modeling."

14 Good Turing Smoothing Consider a scenario, one is fishing and caught 18 fishes x10 x3 x2 Carp perch whitefish x1 x1 x1 trout salmon eel How likely is that next species is trout? Assume there are new species, how likely is it that next species is new? Now how likely is that next species is trout?

15 Leave-One-Validation Take each of one of the fish out in turn 18 training sets of size 17, held-out of size 1 The fraction of held-out fishes are unseen in the training? # of fishes occur once / 18 = 3/18 The fraction of held-out fishes are seen k times in training? (# of fishes occur (k+1) times)*(k+1) / 18 Use things-we-saw-(k+1)-times to estimate things-we-saw-k-times

Good Turing Smoothing Consider a scenario, one is fishing and caught 18 fishes x10 x3 x2 Carp perch whitefish x1 x1 x1 trout salmon eel How likely is that next

16 Good Turing Smoothing Consider a scenario, one is fishing and caught 18 fishes x10 x3 x2 Carp perch whitefish x1 x1 x1 trout salmon eel How likely is that next species is trout? 1/18 Assume there are new species, how likely is it that next species is new? 3/18 Now how likely is that next species is trout? 1/18 * (2/3) = 1/27

17 Good Turing Smoothing N i is the number of words that occur i times is the number of samples N = X i in i A word (with occurrence i ) should occur with probability (i + 1)N i+1 N i N Good Turing Smoothing

18 NIPS 2017 Good Turing

19 Absolute Discounting Smoothing Steal probability mass to unseen samples P (AD) discounted bigram W 2 W 1 (w 2 w 1 )= count(w 2,w 1 ) count(w 1 ) d Interpolation weight + (w 1 )p W (w 2 ) unigram distribution

20 Absolute Discounting P (AD) glasses sun W 2 W 1 (w 2 w 1 )= count(w 2,w 1 ) count(w 1 ) d + (w 1 )p W (w 2 ) Some word (e.g. Fransisco) always occurs with other words (e.g. San), but this contributes to the unigram distribution. Principle of probability X w 1 P (AD) W 2 W 1 (w 2 w 1 )P W (w 1 ) 6= P W (w 2 ) Choice of continuation distribution! P(glasses) << P(Fransisco)

21 Kneser-Ney Smoothing P (KN) (w W 2 W 1 2 w 1 )= count(w 2,w 1 ) count(w 1 ) d + (w 1 )P cont (w 2 ) The normalized discount; the probability mass we ve discounted (w) = P cont (w) = d c(w) {w0 : count(w, w 0 ) > 0} {w 0 : count(w, w 0 ) > 0} P w 0 2vocabulary {w0 : count(w, w 0 ) > 0} The number of word types that can follow w P(glasses) > P(Fransisco) Marginal constraint gives an only solution to the interpolated distribution.

22 Trigram and More Recursive formulation of KN smoothing P (KN) (w n w 1,w 2,...,w n 1 )= max{count(kn) (w 1,w 2,...,w n ) d, 0} count (KN) (w 1,w 2,...,w n 1 ) + (w 1,...,w n 1 )P (KN) (w n w 2,...,w n 1 ) count (KN) ( ) = # of occurrence of for the highest order # of unique word types for for lower order

23 Bayesian Interpretation of KN Smoothing Corpus KN Smoothing? Bayesian Inference Parameters

24 Smoothing via Context Tree Parameterized the conditional distribution of Markov model P (w u =(w 1,...,w n 1 )) = G u (w) G u is a probability vector associated with context u G ; G a G in a G find a G about a G toad in a G stuck in a Smoothing equals the dependency between G u and G parent(u)

25 Chinese Restaurant Process Chinese Restaurant Process CRP(d,,G 0 ) is a distribution over distributions over a probability space CRP is defined over draws from G 1 =CPR(d,,G 0 ) Sample space is all (unbounded) tables in a restaurant A sequence of customers visit this restaurant, and randomly pick a table to sit The first customer sits at the first table The i-th customer chooses his seat after observing the seating arrangement. Samples from CRP is equivalent to the seating arrangements of infinite customers

26 Sample Generation in CRP Let Let Let x i c k t. be the table the i-th customer sits be the number of customer sitting at table be the number of occupied tables +dt. +(i 1) G 0 W.p. this customer sits from ; otherwise, he chooses his seat based on current seating arrangement. k x i x 1,...,x i Xt. k=1 1, seating arrangement c k d +(i 1) table + + dt. k +(i 1) G 0 Interpolated weight Absolute discounting Continuation probability

27 CRP in Language Model CRP is defined over draws from N-gram distribution is built on (N-1)-gram distribution base distribution Output distribution CRP P (w w 2,...,w n ) P (w w 1,w 2,...,w n ) G u (w) CRP(d u, u,g parent(u) (w))

28 Neural Network in Language Model N-gram models can be thought as classifications history (w 1,...,w n 1 ) predict current word w n Discriminative model via neural networks

29 FNN in Language Model N-gram models are inherently classification problem: Given the context, predict the next word (one of V classes) output word softmax as probability distribution context of N words Reference:

Curse of N-gram Models Language has long-distance dependencies The computer which I had just put into the machine room on the fifth floor crashed.

30 Curse of N-gram Models Language has long-distance dependencies The computer which I had just put into the machine room on the fifth floor crashed. Modeling via RNN Use this to predict the next word Reference: mikolov_interspeech2010_is pdf

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high