9/10/18. CS 6120/CS4120: Natural Language Processing. Outline. Probabilistic Language Models. Probabilistic Language Modeling

Size: px

Start display at page:

Download "9/10/18. CS 6120/CS4120: Natural Language Processing. Outline. Probabilistic Language Models. Probabilistic Language Modeling"

Ada Stevenson
5 years ago
Views:

1 Outlne CS 6120/CS4120: Natural Language Processng Instructor: Prof. Lu Wang College of Computer and Informaton Scence Northeastern Unversty Webpage: Probablstc language model and n-grams Estmatng n-gram probabltes Language model evaluaton and perplexty Generalzaton and zeros Smoothng: add-one Interpolaton, backoff, and web-scale LMs Smoothng: Kneser-Ney Smoothng [Modfed from Dan Jurafsky s sldes] Probablstc Language Models Assgn a probablty to a sentence Machne Translaton: P(hgh wnds tonght > P(large wnds tonght Spell Correcton The offce s about ffteen mnuets from my house P(about ffteen mnutes from > P(about ffteen mnuets from Speech Recognton P(I saw a van >> P(eyes awe of an Text Generaton n general: Summarzaton, queston-answerng Probablstc Language Modelng Goal: compute the probablty of a sentence or sequence of words: P(W = P(w1,w2,w3,w4,w5 wn Related task: probablty of an upcomng word: P(w5 w1,w2,w3,w4 A model that computes ether of these: P(W or P(wn w1,w2 wn-1 s called a language model. Better: the grammar But language model (or LM s standard How to compute P(W How to compute ths jont probablty: P(ts, water, s, so, transparent, that Intuton: let s rely on the Chan Rule of Probablty Quck Revew: Probablty Recall the defnton of condtonal probabltes p(b A = P(A,B/P(A Rewrtng: P(A,B = P(AP(B A More varables: P(A,B,C,D = P(AP(B AP(C A,BP(D A,B,C The Chan Rule n General P(x 1,x 2,x 3,,x n = P(x 1 P(x 2 x 1 P(x 3 x 1,x 2 P(x n x 1,,x n-1 1

2 The Chan Rule appled to compute jont probablty of words n sentence P(w 1 w n = P(w w 1 w 1 The Chan Rule appled to compute jont probablty of words n sentence P(w 1 w n = P(w w 1 w 1 P( ts water s so transparent = P(ts P(water ts P(s ts water P(so ts water s P(transparent ts water s so How to estmate these probabltes Could we just count and dvde? P(the ts water s so transparent that = Count(ts water s so transparent that the Count(ts water s so transparent that How to estmate these probabltes Could we just count and dvde? P(the ts water s so transparent that = Count(ts water s so transparent that the Count(ts water s so transparent that No! Too many possble sentences! We ll never see enough data for estmatng these Markov Assumpton Smplfyng assumpton: P(the ts water s so transparent that P(the that Or maybe P(the ts water s so transparent that P(the transparent that Markov Assumpton P(w 1 w n P(w w k w 1 In other words, we approxmate each component n the product P(w w 1 w 1 P(w w k w 1 2

3 Smplest case: Ungram model ffth, an, of, futures, the, an, ncorporated, a, a, the, nflaton, most, dollars, quarter, n, s, mass thrft, dd, eghty, sad, hard, 'm, july, bullsh that, or, lmted, the P(w 1 w n P(w Some automatcally generated sentences from a ungram model Bgram model Condton on the prevous word: P(w w 1 w 1 P(w texaco, rose, one, n, ths, ssue, s, pursung, growth, n, a, boler, house, sad, mr., gurra, mexco, 's, moton, control, proposal, wthout, permsson, from, fve, hundred, ffty, fve, yen outsde, new, car, parkng, lot, of, the, agreement, reached ths, would, be, a, record, november N-gram models We can extend to trgrams, 4-grams, 5-grams N-gram models We can extend to trgrams, 4-grams, 5-grams In general ths s an nsuffcent model of language because language has long-dstance dependences: The computer(s whch I had just put nto the machne room on the ffth floor s (are crashng. But we can often get away wth N-gram models Today s Outlne Probablstc language model and n-grams Estmatng n-gram probabltes Language model evaluaton and perplexty Generalzaton and zeros Smoothng: add-one Interpolaton, backoff, and web-scale LMs Smoothng: Kneser-Ney Smoothng Estmatng bgram probabltes The Maxmum Lkelhood Estmate for bgram probablty P(w = count(w 1,w count(w 1 P(w =,w 3

4 An example An example P(w =,w <s> I am Sam </s> <s> Sam I am </s> <s> I do not lke green eggs and ham </s> P(w =,w <s> I am Sam </s> <s> Sam I am </s> <s> I do not lke green eggs and ham </s> More examples: Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by md prced tha food s what m lookng for tell me about chez pansse can you gve me a lstng of the knds of food that are avalable m lookng for a good place to eat breakfast when s caffe veneza open durng the day Raw bgram counts Out of 9222 sentences Raw bgram probabltes Normalze by ungrams: Result: Bgram estmates of sentence probabltes P(<s> I want englsh food </s> = P(I <s> P(want I P(englsh want P(food englsh P(</s> food =

5 Knowledge P(englsh want =.0011 P(chnese want =.0065 P(to want =.66 P(eat to =.28 P(food to = 0 P(want spend = 0 P ( <s> =.25 Practcal Issues We do everythng n log space Avod underflow (also addng s faster than multplyng log(p 1 p 2 p 3 p 4 = log p 1 + log p 2 + log p 3 + log p 4 Language Modelng Toolkts Google N-Gram Release, August 2006 SRILM Google N-Gram Release serve as the ncomng 92 serve as the ncubator 99 serve as the ndependent 794 serve as the ndex 223 serve as the ndcaton 72 serve as the ndcator 120 serve as the ndcators 45 serve as the ndspensable 111 serve as the ndspensble 40 serve as the ndvdual 234 Today s Outlne Probablstc language model and n-grams Estmatng n-gram probabltes Language model evaluaton and perplexty Generalzaton and zeros Smoothng: add-one Interpolaton, backoff, and web-scale LMs Smoothng: Kneser-Ney Smoothng 5

6 Evaluaton: How good s our model? Evaluaton: How good s our model? Does our language model prefer good sentences to bad ones? Assgn hgher probablty to real or frequently observed sentences Than ungrammatcal or rarely observed sentences? We tran parameters of our model on a tranng set. We test the model s performance on data we haven t seen. A test set s an unseen dataset that s dfferent from our tranng set, totally unused. An evaluaton metrc tells us how well our model does on the test set. Tranng on the test set We can t allow test sentences nto the tranng set We wll assgn t an artfcally hgh probablty when we set t n the test set Tranng on the test set Bad scence! And volates the honor code Extrnsc evaluaton of N-gram models Best evaluaton for comparng models A and B Put each model n a task spellng corrector, speech recognzer, MT system Run the task, get an accuracy for A and for B How many msspelled words corrected properly How many words translated correctly Compare accuracy for A and B Dffculty of extrnsc evaluaton of N-gram models Extrnsc evaluaton Tme-consumng; can take days or weeks So Sometmes use ntrnsc evaluaton: perplexty Dffculty of extrnsc evaluaton of N-gram models Extrnsc evaluaton Tme-consumng; can take days or weeks So Sometmes use ntrnsc evaluaton: perplexty Bad approxmaton unless the test data looks just lke the tranng data So generally only useful n plot experments But s helpful to thnk about. 6

7 Intuton of Perplexty Intuton of Perplexty The Shannon Game: How well can we predct the next word? I always order pzza wth cheese and The 33 rd Presdent of the US was I saw a Ungrams are terrble at ths game. (Why? A better model of a text s one whch assgns a hgher probablty to the word that actually occurs The Shannon Game: How well can we predct the next word? I always order pzza wth cheese and The 33 rd Presdent of the US was I saw a Ungrams are terrble at ths game. (Why? mushrooms 0.1 pepperon 0.1 anchoves 0.01 A better model of a text s one whch assgns a hgher probablty to the word that actually occurs. fred rce and 1e-100 Perplexty Perplexty The best language model s one that best predcts an unseen test set Gves the hghest P(sentence Perplexty s the nverse probablty of the test set, normalzed by the number of words: PP(W = P(w 1...w N 1 N = 1 N P(w 1...w N The best language model s one that best predcts an unseen test set Gves the hghest P(sentence Perplexty s the nverse probablty of the test set, normalzed by the number of words: PP(W = P(w 1...w N 1 N = 1 N P(w 1...w N Chan rule: For bgrams: Perplexty The best language model s one that best predcts an unseen test set Gves the hghest P(sentence Perplexty s the nverse probablty of the test set, normalzed by the number of words: PP(W = P(w 1...w N 1 N = 1 N P(w 1...w N Perplexty as branchng factor Let s suppose a sentence consstng of random dgts What s the perplexty of ths sentence accordng to a model that assgn P=1/10 to each dgt? Chan rule: For bgrams: Mnmzng perplexty s the same as maxmzng probablty 7

8 Perplexty as branchng factor Let s suppose a sentence consstng of random dgts What s the perplexty of ths sentence accordng to a model that assgn P=1/10 to each dgt? Lower perplexty = better model Tranng 38 mllon words, test 1.5 mllon words, WSJ N-gram Order Ungram Bgram Trgram Perplexty Today s Outlne Probablstc language model and n-grams Estmatng n-gram probabltes Language model evaluaton and perplexty Generalzaton and zeros Smoothng: add-one Interpolaton, backoff, and web-scale LMs Smoothng: Kneser-Ney Smoothng The perls of overfttng N-grams only work well for word predcton f the test corpus looks lke the tranng corpus In real lfe, t often doesn t We need to tran robust models that generalze! The perls of overfttng N-grams only work well for word predcton f the test corpus looks lke the tranng corpus In real lfe, t often doesn t We need to tran robust models that generalze! One knd of generalzaton: Zeros! Thngs that don t ever occur n the tranng set But occur n the test set Zeros In tranng set, we see dened the allegatons dened the reports dened the clams dened the request P( offer dened the = 0 But n test set, dened the offer dened the loan 8

9 Zero probablty bgrams Bgrams wth zero probablty mean that we wll assgn 0 probablty to the test set! And hence we cannot compute perplexty (can t dvde by 0! Today s Outlne Probablstc language model and n-grams Estmatng n-gram probabltes Language model evaluaton and perplexty Generalzaton and zeros Smoothng: add-one Interpolaton, backoff, and web-scale LMs Smoothng: Kneser-Ney Smoothng The ntuton of smoothng (from Dan Klen Add-one estmaton When we have sparse statstcs: P(w dened the 3 allegatons 2 reports 1 clams 1 request 7 total Steal probablty mass to generalze better P(w dened the 2.5 allegatons 1.5 reports 0.5 clams 0.5 request 2 other 7 total allegatons allegatons reports reports clams clams request request attack man attack man outcome outcome Also called Laplace smoothng Pretend we saw each word one more tme than we dd Just add one to all the counts! (Instead of takng away counts P MLE estmate: MLE (w =, w Add-1 estmate: P Add 1 (w =, w +1 +V Add-one estmaton Also called Laplace smoothng Pretend we saw each word one more tme than we dd Just add one to all the counts! MLE estmate: P MLE (w =, w Berkeley Restaurant Corpus: Laplace smoothed bgram counts Add-1 estmate: P Add 1 (w =, w +1 +V Why add V? 9

Laplace-smoothed bgrams Add-1 estmaton s a blunt nstrument So add-1 sn t used for N-grams: We ll see better methods (nowadays, neural LM becomes popular, wll dscuss later But add-1 s used to smooth

10 Laplace-smoothed bgrams Add-1 estmaton s a blunt nstrument So add-1 sn t used for N-grams: We ll see better methods (nowadays, neural LM becomes popular, wll dscuss later But add-1 s used to smooth other NLP models For text classfcaton (comng soon! In domans where the number of zeros sn t so huge. Today s Outlne Probablstc language model and n-grams Estmatng n-gram probabltes Language model evaluaton and perplexty Generalzaton and zeros Smoothng: add-one Interpolaton, backoff, and web-scale LMs Smoothng: Kneser-Ney Smoothng Backoff and Interpolaton Sometmes t helps to use less context Condton on less context for contexts you haven t learned much about Backoff: use trgram f you have good evdence otherwse bgram otherwse ungram Interpolaton: mx ungram, bgram, trgram In general, nterpolaton works better Lnear Interpolaton Smple nterpolaton ˆP(wn wn 2wn 1 = l1p(wn wn 2wn 1 +l2p(wn wn 1 +l3p(wn X X l = 1 How to set the lambdas? Use a held-out corpus Tranng Data Choose λs to maxmze the probablty of held-out data: Fx the N-gram probabltes (on the tranng data Then search for λs that gve largest probablty to held-out set: Held-Out Data Test Data log P(w 1...w n M(λ 1...λ k = log P M (λ1...λ k (w 10

11 A Common Method Grd Search Take a lst of possble values, e.g. [0.1, 0.2,,0.9] Try all combnatons Lnear Interpolaton Smple nterpolaton ˆP(wn wn 2wn 1 = l1p(wn wn 2wn 1 +l2p(wn wn 1 +l3p(wn Lambdas condtonal on context: X X l = 1 Lnear Interpolaton Smple nterpolaton ˆP(wn wn 2wn 1 = l1p(wn wn 2wn 1 +l2p(wn wn 1 +l3p(wn Lambdas condtonal on context: X X l = 1 How to estmate? Unknown words: Open versus closed vocabulary tasks If we know all the words n advanced Vocabulary V s fxed Closed vocabulary task Often we don t know ths Out Of Vocabulary= OOV words Open vocabulary task Unknown words: Open versus closed vocabulary tasks If we know all the words n advanced Vocabulary V s fxed Closed vocabulary task Often we don t know ths Out Of Vocabulary = OOV words Open vocabulary task Instead: create an unknown word token <UNK> Tranng of <UNK> probabltes Create a fxed lexcon L of sze V (e.g. selectng hgh frequency words At text normalzaton phase, any tranng word not n L changed to <UNK> Now we tran ts probabltes lke a normal word At test tme If text nput: Use UNK probabltes for any word not n tranng Smoothng for Web-scale N-grams Stupd backoff (Brants et al No dscountng, just use relatve frequences 1 S(w w k+1 " $ = # $ % $ S(w = count(w N count(w k+1 1 count(w k+1 f count(w k+1 > S(w w k+2 otherwse Untl ungram probablty 11

12 Today s Outlne Probablstc language model and n-grams Estmatng n-gram probabltes Language model evaluaton and perplexty Generalzaton and zeros Smoothng: add-one Interpolaton, backoff, and web-scale LMs Smoothng: Kneser-Ney Smoothng Absolute dscountng: just subtract a lttle from each count Suppose we wanted to subtract a lttle from a count of 4 to save probablty mass for the zeros How much to subtract? Church and Gale (1991 s clever dea Dvde up 22 mllon words of AP Newswre Tranng and held-out set for each bgram n the tranng set see the actual count n the held-out set! It sure looks lke c* = (c -.75 Bgram count Bgram count n n tranng heldout set Absolute Dscountng Interpolaton Save ourselves some tme and just subtract 0.75 (or some d! dscounted bgram P AbsoluteDscountng (w =, w d Interpolaton weght + λ(w 1 P(w ungram But should we really just use the regular ungram P(w? Kneser-Ney Smoothng I Better estmate for probabltes of lower-order ungrams! glasses Shannon game: I can t see wthout my readng? Francsco s more common than glasses Francsco but Francsco always follows San The ungram s useful exactly when we haven t seen ths bgram! Instead of P(w: How lkely s w Pcontnuaton(w: How lkely s w to appear as a novel contnuaton? For each word, count the number of unque bgrams t completes Every unque bgram was a novel contnuaton the frst tme t was seen P CONTINUATION (w {w 1 :, w > 0} Unque bgrams w s n Kneser-Ney Smoothng II How many tmes does w appear as a novel contnuaton (unque bgrams: P CONTINUATION (w {w 1 :, w > 0} Normalzed by the total number of word bgram types {(w j 1, w j : c(w j 1, w j > 0} All unque bgrams n the corpus {w P CONTINUATION (w = 1 :, w > 0} {(w j 1, w j : c(w j 1, w j > 0} Kneser-Ney Smoothng III Alternatve metaphor: The number of # of unque words seen to precede w {w 1 :, w > 0} normalzed by the # of words precedng all words: P CONTINUATION (w = {w 1 :, w > 0} {w' 1 : c(w' 1, w' > 0} w' A frequent word (Francsco occurrng n only one context (San wll have a low contnuaton probablty 12

13 Kneser-Ney Smoothng IV P KN (w = max(c(w, w 1 d, 0 + λ(w 1 P CONTINUATION (w λ s a normalzng constant; the probablty mass we ve dscounted d λ(w 1 = {w :, w > 0} Kneser-Ney Smoothng: Recursve formulaton 1 P KN (w w n+1 = max(c KN (w n+1 1 c KN (w n+1 d, λ(w n+1 1 P KN (w w n+2!# count( for the hghest order c KN ( = " $# contnuatoncount( for lower order the normalzed dscount The number of word types that can follow w-1 = # of word types we dscounted = # of tmes we appled normalzed dscount Contnuaton count = Number of unque sngle word contexts for Language Modelng Homework Probablstc language model and n-grams Estmatng n-gram probabltes Language model evaluaton and perplexty Generalzaton and zeros Smoothng: add-one Interpolaton, backoff, and web-scale LMs Smoothng: Kneser-Ney Smoothng Readng J&M ch1 and ch Start thnkng about course project and fnd a team Project proposal due Oct 1 st The format of the proposal wll be posted on Pazza 13

CS 6120/CS4120: Natural Language Processing

CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Outline Probabilistic language