Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS 6501: Natural Language Processing 1
This lecture Language Models What are N-gram models? How to use probabilities What does P(Y X) mean? How can I manipulate it? How can I estimate its value in practice? CS 6501: Natural Language Processing 2
What is a language model? Probability distributions over sentences (i.e., word sequences ) P(W) = P(w 1 w 2 w 3 w 4 w k ) Can use them to generate strings P(w k w 2 w 3 w 4 w k 1 ) Rank possible sentences P( Today is Tuesday ) > P( Tuesday Today is ) P( Today is Tuesday ) > P( Today is Virginia ) CS 6501: Natural Language Processing 3
Language model applications Context-sensitive spelling correction CS 6501: Natural Language Processing 4
Language model applications Autocomplete CS 6501: Natural Language Processing 5
Language model applications Smart Reply CS 6501: Natural Language Processing 6
Language model applications Language generation https://pdos.csail.mit.edu/archive/scigen/ CS 6501: Natural Language Processing 7
Bag-of-Words with N-grams N-grams: a contiguous sequence of n tokens from a given piece of text http://recognize-speech.com/language-model/n-gram-model/comparison CS 6501: Natural Language Processing 8
N-Gram Models Unigram model: P w 1 P w 2 P w 3 P(w n ) Bigram model: P w 1 P w 2 w 1 P w 3 w 2 P(w n w n 1 ) Trigram model: P w 1 P w 2 w 1 P w 3 w 2, w 1 P(w n w n 1 w n 2 ) N-gram model: P w 1 P w 2 w 1 P(w n w n 1 w n 2 w n N ) CS 6501: Natural Language Processing 9
Random language via n-gram http://www.cs.jhu.edu/~jason/465/powerpo int/lect01,3tr-ngram-gen.pdf Behind the scenes probability theory CS 6501: Natural Language Processing 10
Sampling with replacement 1. P( ) =? 2. P( ) =? 3. P(red, ) =? 4. P(blue) =? 5. P(red ) =? 6. P( red) =? 7. P( ) =? 8. P( ) =? 9. P(2 x, 3 x, 4 x ) =? CS 6501: Natural Language Processing 11
Sampling words with replacement Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 12
Implementation: how to sample? Sample from a discrete distribution p(x) Assume n outcomes in the event space X 1. Divide the interval [0,1] into n intervals according to the probabilities of the outcomes 2. Generate a random number r between 0 and 1 3. Return x i where r falls into CS 6501: Natural Language Processing 13
Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 14
Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 15
Recap: probability Theory Conditional probability P(blue ) =? P B A = P(B, A)/P(A) Bayes rule: P B A = P(A B)P B P(A) Verify: P(red ), P( red ), P( ), P(red) Independent P B A = P(B) Prove: P A, B = P A P(B) CS 6501: Natural Language Processing 16
The Chain Rule The joint probability can be expressed in terms of the conditional probability: P(X,Y) = P(X Y) P(Y) More variables: P(X, Y, Z) = P(X Y, Z) P(Y, Z) = P(X Y, Z) P(Y Z) P(Z) P X 1, X 2, X n = P X 1 P X 2 X 1 P X 3 X 2, X 1 P X n X 1, X n 1 n X i X 1, X i 1 = P X 1 Π i=2 CS 6501: Natural Language Processing 17
Language model for text Probability distribution over sentences p w 1 w 2 w n = p w 1 p w 2 w 1 p w 3 w 1, w 2 Complexity - O(V n ) n - maximum sentence length We need independence assumptions! p w n w 1, w 2,, w n 1 Chain rule: from conditional probability to joint probability 475,000 main headwords in Webster's Third New International Dictionary Average English sentence length is 14.3 words A rough estimate: O(475000 14 ) 475000 How large is this? 14 8bytes 1024 4 3.38e66 TB CS 6501: Natural Language Processing 18
Probability models Building a probability model: defining the model (making independent assumption) estimating the model s parameters use the model (making inference) param Values Θ Trigram Model (defined in terms of parameters like P( is today ) ) definition of P CS 6501: Natural Language Processing 19
Independent assumption Independent assumption even though X and Y are not actually independent, we treat them as independent Make the model compact (e.g., from 100k 14 to 100k 2 ) CS 6501: Natural Language Processing 20
Language model with N-gram The chain rule: P X 1, X 2, X n = P X 1 P X 2 X 1 P X 3 X 2, X 1 P X n X 1, X n 1 N-gram language model assumes each word depends only on the last n-1 words (Markov assumption) CS 6501: Natural Language Processing 21
Language model with N-gram Example: trigram (3-gram) P w n w 1, w n 1 = P w n w n 2, w n 1 P(w 1, w n )= P w 1 P w 2 w 1 P w n w n 2, w n 1 P "Today is a sunny day" =P( Today )P( is Today )P( a is, Today ) P( day sunny, a ) CS 6501: Natural Language Processing 22
Unigram model CS 6501: Natural Language Processing 23
Bigram model Condition on the previous word CS 6501: Natural Language Processing 24
Ngram model CS 6501: Natural Language Processing 25
More examples Yoav s blog post: http://nbviewer.jupyter.org/gist/yoavg/d761 21dfde2618422139 10-gram character-level LM: First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off. BIRON: Hide thy head. CS 6501: Natural Language Processing 26
* More linux/kernel/time.c examples ~~/* * Please report this on hardware. */ void irq_mark_irq(unsigned long old_entries, eval); Yoav s blog post: /* http://nbviewer.jupyter.org/gist/yoavg/d761 * Divide only 1000 for ns^2 -> us^2 conversion values don't 21dfde2618422139 overflow: seq_puts(m, "\ttramp: %ps", 10-gram if (likely(t->flags character-level & WQ_UNBOUND)) { LM: /* * Update inode information. If the * slowpath and sleep time (abs or rel) * @rmtp: remaining (either due * to consume the state of ring buffer size. */ header_size - size, in bytes, of the chain. */ BUG_ON(!error); } while (cgrp) { if (old) { if (kdb_continue_catastrophic; #endif (void *)class->contending_point]++; CS 6501: Natural Language Processing 27
Questions? CS 6501: Natural Language Processing 28
Maximum likelihood Estimation Best means data likelihood reaches maximum θ = argmax θ P(X θ) Unigram Language Model p(w )=? Estimation Document 10/100 5/100 3/100 3/100 1/100 text? mining? assocation? database? query? text 10 mining 5 association 3 database 3 algorithm 2 query 1 efficient 1 A paper (total #words=100) CS 6501: Natural Language Processing 29
Which bag of words more likely generate: aaadaaakoaaaa a K a a E a o ad a a a K a D b E P F o n CS 6501: Natural Language Processing 30
Parameter estimation General setting: Given a (hypothesized & probabilistic) model that governs the random experiment The model gives a probability of any data p(x θ) that depends on the parameter θ Now, given actual sample data X={x 1,,x n }, what can we say about the value of θ? Intuitively, take our best guess of θ -- best means best explaining/fitting the data Generally an optimization problem CS 6501: Natural Language Processing 31
Maximum likelihood estimation Data: a collection of words, w 1, w 2,, w n Model: multinomial distribution p(w) with parameters θ i = p(w i ) Maximum likelihood estimator: θ = argmax θ Θ p(w θ) p W θ = N c w 1,, c(w N ) i=1 N N N θ i c(w i ) i=1 log p W θ = c w i log θ i + const i=1 θ = argmax θ Θ c w i log θ i N i=1 θ i c(w i ) CS 6501: Natural Language Processing 32
Maximum likelihood estimation θ = argmax θ Θ σ N i=1 c w i log θ i N L W, θ = c w i log θ i + λ θ i 1 i=1 N i=1 L = c w i + λ θ θ i θ i = c w i i λ Lagrange multiplier Set partial derivatives to zero Since N θ i =1we have λ = c w i σ i=1 N i=1 Requirement from probability θ i = c w i c w i σ N i=1 ML estimate CS 6501: Natural Language Processing 33
Maximum likelihood estimation For N-gram language models p w i w i 1,, w i n+1 = c(w i,w i 1,,w i n+1 ) c(w i 1,,w i n+1 ) c = N Length of document or total number of words in a corpus CS 6501: Natural Language Processing 34
A bi-gram example <S> I am Sam </S> <S> I am legend </S> <S> Sam I am </S> P( I <S>) =? P(am I) =? P( Sam am) =? P( </S> Sam) =? P( <S>I am Sam</S> bigram model) =? CS 6501: Natural Language Processing 35
Practical Issues We do everything in the log space Avoid underflow Adding is faster than multiplying Toolkits log p 1 p 2 = log p 1 + log p 2 KenLM: https://kheafield.com/code/kenlm/ SRILM: http://www.speech.sri.com/projects/srilm CS 6501: Natural Language Processing 36
More resources Google n-gram: https://research.googleblog.com/2006/08/allour-n-gram-are-belong-to-you.html File sizes: approx. 24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663 CS 6501: Natural Language Processing 37
More resources Google n-gram viewer https://books.google.com/ngrams/ Data: http://storage.googleapis.com/books/ngrams/ books/datasetsv2.html circumvallate 1978 335 91 circumvallate 1979 261 91 CS 6501: Natural Language Processing 38
CS 6501: Natural Language Processing 39
CS 6501: Natural Language Processing 40
CS 6501: Natural Language Processing 41
CS 6501: Natural Language Processing 42
How about unseen words/phrases Example: Shakespeare corpus consists of N=884,647 word tokens and a vocabulary of V=29,066 word types Only 30,000 word types occurred Words not in the training data 0 probability Only 0.04% of all possible bigrams occurred CS 6501: Natural Language Processing 43
Next Lecture Dealing with unseen n-grams Key idea: reserve some probability mass to events that don t occur in the training data How much probability mass should we reserve? CS 6501: Natural Language Processing 44
Recap N-gram language models How to generate text from a language model How to estimate a language model Reading: Speech and Language Processing Chapter 4: N-Grams CS 6501: Natural Language Processing 45