Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Size: px

Start display at page:

Download "Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:"

Derek Woods
6 years ago
Views:

1 Lecture 2: N-gram Kai-Wei Chang University of Virginia kw@kwchang.net Couse webpage: CS 6501: Natural Language Processing 1

2 This lecture Language Models What are N-gram models? How to use probabilities What does P(Y X) mean? How can I manipulate it? How can I estimate its value in practice? CS 6501: Natural Language Processing 2

3 What is a language model? Probability distributions over sentences (i.e., word sequences ) P(W) = P(w 1 w 2 w 3 w 4 w k ) Can use them to generate strings P(w k w 2 w 3 w 4 w k 1 ) Rank possible sentences P( Today is Tuesday ) > P( Tuesday Today is ) P( Today is Tuesday ) > P( Today is Virginia ) CS 6501: Natural Language Processing 3

4 Language model applications Context-sensitive spelling correction CS 6501: Natural Language Processing 4

5 Language model applications Autocomplete CS 6501: Natural Language Processing 5

6 Language model applications Smart Reply CS 6501: Natural Language Processing 6

7 Language model applications Language generation CS 6501: Natural Language Processing 7

8 Bag-of-Words with N-grams N-grams: a contiguous sequence of n tokens from a given piece of text CS 6501: Natural Language Processing 8

9 N-Gram Models Unigram model: P w 1 P w 2 P w 3 P(w n ) Bigram model: P w 1 P w 2 w 1 P w 3 w 2 P(w n w n 1 ) Trigram model: P w 1 P w 2 w 1 P w 3 w 2, w 1 P(w n w n 1 w n 2 ) N-gram model: P w 1 P w 2 w 1 P(w n w n 1 w n 2 w n N ) CS 6501: Natural Language Processing 9

10 Random language via n-gram int/lect01,3tr-ngram-gen.pdf Behind the scenes probability theory CS 6501: Natural Language Processing 10

11 Sampling with replacement 1. P( ) =? 2. P( ) =? 3. P(red, ) =? 4. P(blue) =? 5. P(red ) =? 6. P( red) =? 7. P( ) =? 8. P( ) =? 9. P(2 x, 3 x, 4 x ) =? CS 6501: Natural Language Processing 11

12 Sampling words with replacement Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 12

13 Implementation: how to sample? Sample from a discrete distribution p(x) Assume n outcomes in the event space X 1. Divide the interval [0,1] into n intervals according to the probabilities of the outcomes 2. Generate a random number r between 0 and 1 3. Return x i where r falls into CS 6501: Natural Language Processing 13

14 Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 14

15 Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 15

16 Recap: probability Theory Conditional probability P(blue ) =? P B A = P(B, A)/P(A) Bayes rule: P B A = P(A B)P B P(A) Verify: P(red ), P( red ), P( ), P(red) Independent P B A = P(B) Prove: P A, B = P A P(B) CS 6501: Natural Language Processing 16

17 The Chain Rule The joint probability can be expressed in terms of the conditional probability: P(X,Y) = P(X Y) P(Y) More variables: P(X, Y, Z) = P(X Y, Z) P(Y, Z) = P(X Y, Z) P(Y Z) P(Z) P X 1, X 2, X n = P X 1 P X 2 X 1 P X 3 X 2, X 1 P X n X 1, X n 1 n X i X 1, X i 1 = P X 1 Π i=2 CS 6501: Natural Language Processing 17

18 Language model for text Probability distribution over sentences p w 1 w 2 w n = p w 1 p w 2 w 1 p w 3 w 1, w 2 Complexity - O(V n ) n - maximum sentence length We need independence assumptions! p w n w 1, w 2,, w n 1 Chain rule: from conditional probability to joint probability 475,000 main headwords in Webster's Third New International Dictionary Average English sentence length is 14.3 words A rough estimate: O( ) How large is this? 14 8bytes e66 TB CS 6501: Natural Language Processing 18

19 Probability models Building a probability model: defining the model (making independent assumption) estimating the model s parameters use the model (making inference) param Values Θ Trigram Model (defined in terms of parameters like P( is today ) ) definition of P CS 6501: Natural Language Processing 19

20 Independent assumption Independent assumption even though X and Y are not actually independent, we treat them as independent Make the model compact (e.g., from 100k 14 to 100k 2 ) CS 6501: Natural Language Processing 20

21 Language model with N-gram The chain rule: P X 1, X 2, X n = P X 1 P X 2 X 1 P X 3 X 2, X 1 P X n X 1, X n 1 N-gram language model assumes each word depends only on the last n-1 words (Markov assumption) CS 6501: Natural Language Processing 21

22 Language model with N-gram Example: trigram (3-gram) P w n w 1, w n 1 = P w n w n 2, w n 1 P(w 1, w n )= P w 1 P w 2 w 1 P w n w n 2, w n 1 P "Today is a sunny day" =P( Today )P( is Today )P( a is, Today ) P( day sunny, a ) CS 6501: Natural Language Processing 22

23 Unigram model CS 6501: Natural Language Processing 23

24 Bigram model Condition on the previous word CS 6501: Natural Language Processing 24

25 Ngram model CS 6501: Natural Language Processing 25

26 More examples Yoav s blog post: 21dfde gram character-level LM: First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off. BIRON: Hide thy head. CS 6501: Natural Language Processing 26

27 * More linux/kernel/time.c examples ~~/* * Please report this on hardware. */ void irq_mark_irq(unsigned long old_entries, eval); Yoav s blog post: /* * Divide only 1000 for ns^2 -> us^2 conversion values don't 21dfde overflow: seq_puts(m, "\ttramp: %ps", 10-gram if (likely(t->flags character-level & WQ_UNBOUND)) { LM: /* * Update inode information. If the * slowpath and sleep time (abs or rel) remaining (either due * to consume the state of ring buffer size. */ header_size - size, in bytes, of the chain. */ BUG_ON(!error); } while (cgrp) { if (old) { if (kdb_continue_catastrophic; #endif (void *)class->contending_point]++; CS 6501: Natural Language Processing 27

28 Questions? CS 6501: Natural Language Processing 28

29 Maximum likelihood Estimation Best means data likelihood reaches maximum θ = argmax θ P(X θ) Unigram Language Model p(w )=? Estimation Document 10/100 5/100 3/100 3/100 1/100 text? mining? assocation? database? query? text 10 mining 5 association 3 database 3 algorithm 2 query 1 efficient 1 A paper (total #words=100) CS 6501: Natural Language Processing 29

30 Which bag of words more likely generate: aaadaaakoaaaa a K a a E a o ad a a a K a D b E P F o n CS 6501: Natural Language Processing 30

31 Parameter estimation General setting: Given a (hypothesized & probabilistic) model that governs the random experiment The model gives a probability of any data p(x θ) that depends on the parameter θ Now, given actual sample data X={x 1,,x n }, what can we say about the value of θ? Intuitively, take our best guess of θ -- best means best explaining/fitting the data Generally an optimization problem CS 6501: Natural Language Processing 31

32 Maximum likelihood estimation Data: a collection of words, w 1, w 2,, w n Model: multinomial distribution p(w) with parameters θ i = p(w i ) Maximum likelihood estimator: θ = argmax θ Θ p(w θ) p W θ = N c w 1,, c(w N ) i=1 N N N θ i c(w i ) i=1 log p W θ = c w i log θ i + const i=1 θ = argmax θ Θ c w i log θ i N i=1 θ i c(w i ) CS 6501: Natural Language Processing 32

33 Maximum likelihood estimation θ = argmax θ Θ σ N i=1 c w i log θ i N L W, θ = c w i log θ i + λ θ i 1 i=1 N i=1 L = c w i + λ θ θ i θ i = c w i i λ Lagrange multiplier Set partial derivatives to zero Since N θ i =1we have λ = c w i σ i=1 N i=1 Requirement from probability θ i = c w i c w i σ N i=1 ML estimate CS 6501: Natural Language Processing 33

34 Maximum likelihood estimation For N-gram language models p w i w i 1,, w i n+1 = c(w i,w i 1,,w i n+1 ) c(w i 1,,w i n+1 ) c = N Length of document or total number of words in a corpus CS 6501: Natural Language Processing 34

35 A bi-gram example <S> I am Sam </S> <S> I am legend </S> <S> Sam I am </S> P( I <S>) =? P(am I) =? P( Sam am) =? P( </S> Sam) =? P( <S>I am Sam</S> bigram model) =? CS 6501: Natural Language Processing 35

36 Practical Issues We do everything in the log space Avoid underflow Adding is faster than multiplying Toolkits log p 1 p 2 = log p 1 + log p 2 KenLM: SRILM: CS 6501: Natural Language Processing 36

37 More resources Google n-gram: File sizes: approx. 24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663 CS 6501: Natural Language Processing 37

38 More resources Google n-gram viewer Data: books/datasetsv2.html circumvallate circumvallate CS 6501: Natural Language Processing 38

39 CS 6501: Natural Language Processing 39

40 CS 6501: Natural Language Processing 40

41 CS 6501: Natural Language Processing 41

42 CS 6501: Natural Language Processing 42

43 How about unseen words/phrases Example: Shakespeare corpus consists of N=884,647 word tokens and a vocabulary of V=29,066 word types Only 30,000 word types occurred Words not in the training data 0 probability Only 0.04% of all possible bigrams occurred CS 6501: Natural Language Processing 43

44 Next Lecture Dealing with unseen n-grams Key idea: reserve some probability mass to events that don t occur in the training data How much probability mass should we reserve? CS 6501: Natural Language Processing 44

45 Recap N-gram language models How to generate text from a language model How to estimate a language model Reading: Speech and Language Processing Chapter 4: N-Grams CS 6501: Natural Language Processing 45

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky Language Modeling Introduction to N-grams Many Slides are adapted from slides by Dan Jurafsky Probabilistic Language Models Today s goal: assign a probability to a sentence Why? Machine Translation: P(high