Language Modeling. Mausam. (Based on slides of Michael Collins, Dan Jurafsky, Dan Klein, Chris Manning, Luke Zettlemoyer)

Size: px

Start display at page:

Download "Language Modeling. Mausam. (Based on slides of Michael Collins, Dan Jurafsky, Dan Klein, Chris Manning, Luke Zettlemoyer)"

Francis Johnson
5 years ago
Views:

1 Language Modelng Mausam Based on sldes of Mchael Collns, Dan Jurafsky, Dan Klen, Chrs Mannng, Luke Zettlemoyer

2 Outlne Motvaton Task Defnton Probablty Estmaton Evaluaton Smoothng 2 Smple Interpolaton and Back-off Advanced Algorthms

3 The Language Modelng Problem Setup: Assume a fnte vocabulary of ords We can construct an nfnte set of strngs Data: gven a tranng set of example sentences Problem: estmate a probablty dstrbuton

model: Dstrbutons over acoustc aves gven a sentence

4 The Nosy-Channel Model We ant to predct a sentence gven acoustcs: The nosy channel approach: Acoustc model: Dstrbutons over acoustc aves gven a sentence Language model: Dstrbutons over sequences of ords sentences

5 Acoustcally Scored Hypotheses the staton sgns are n deep n englsh the statons sgns are n deep n englsh the staton sgns are n deep nto englsh the staton 's sgns are n deep n englsh the staton sgns are n deep n the englsh -474 the staton sgns are ndeed n englsh the staton 's sgns are ndeed n englsh the staton sgns are ndans n englsh the staton sgns are ndan n englsh the statons sgns are ndans n englsh the statons sgns are ndans and englsh -485

6 ASR System Components Language Model Acoustc Model source P channel Pa a best decoder observed a argmax P a = argmax Pa P

7 Translaton: Codebreakng? Also knong nothng offcal about, but havng guessed and nferred consderable about, the poerful ne mechanzed methods n cryptography methods hch I beleve succeed even hen one does not kno hat language has been coded one naturally onders f the problem of translaton could concevably be treated as a problem n cryptography. When I look at an artcle n Russan, I say: Ths s really rtten n Englsh, but t has been coded n some strange symbols. I ll no proceed to decode. Warren Weaver 955:8, quotng a letter he rote n 947

8 MT System Components Language Model Translaton Model source Pe e channel Pf e f best e decoder observed f argmax Pe f = argmax Pf epe e e

9 Probablstc Language Models: Other Applcatons Why assgn a probablty to a sentence? Machne Translaton: Phgh nds tonte > Plarge nds tonte Speech Recognton PI sa a van >> Peyes ae of an Spell Correcton The offce s about ffteen mnuets from my house Pabout ffteen mnutes from > Pabout ffteen mnuets from + Summarzaton, queston-anserng, etc., etc.!!

10 Outlne Motvaton Task Defnton Probablty Estmaton Evaluaton Smoothng 0 Smple Interpolaton and Back-off Advanced Algorthms

11 Probablstc Language Modelng Goal: compute the probablty of a sentence or sequence of ords: PW = P, 2, 3, 4, 5 n Related task: probablty of an upcomng ord: P 5, 2, 3, 4 A model that computes ether of these: PW or P n, 2 n- s called a language model. Better: the grammar But language model or LM s standard

12 Ho to compute PW Ho to compute ths jont probablty: Pts, ater, s, so, transparent, that P ts ater s so transparent = Pts Pater ts Ps ts ater Pso ts ater s Ptransparent ts ater s so

13 Ho to estmate these probabltes Could e just count and dvde? Pthe ts ater s so transparent that = Countts ater s so transparent that the Countts ater s so transparent that No! Too many possble sentences! We ll never see enough data for estmatng these

14 Markov Assumpton Smplfyng assumpton: Pthe ts ater s so transparent that» Pthe that Or maybe Andre Markov Pthe ts ater s so transparent that» Pthe transparent that

15 Markov Assumpton In other ords, e approxmate each component n the product k n P P 2 2 k P P

16 Smplest Case: Ungram Models Smplest case: ungrams P 2 n P Generatve process: pck a ord, pck a ord, untl you pck </s> Graphcal model: Examples: ffth, an, of, futures, the, an, ncorporated, a, a, the, nflaton, most, dollars, quarter, n, s, mass thrft, dd, eghty, sad, hard, 'm, july, bullsh that, or, lmted, the 2 n- </s>. Bg problem th ungrams: Pthe the the the >> PI lke ce cream!

17 Bgram Models Condtoned on prevous sngle ord P P 2 Generatve process: pck <s>, pck a ord condtoned on prevous one, repeat untl to pck </s> Graphcal model: <s> 2 n- </s> Examples: texaco, rose, one, n, ths, ssue, s, pursung, groth, n, a, boler, house, sad, mr., gurra, mexco, 's, moton, control, proposal, thout, permsson, from, fve, hundred, ffty, fve, yen outsde, ne, car, parkng, lot, of, the, agreement, reached ths, ould, be, a, record, november

18 N-Gram Models We can extend to trgrams, 4-grams, 5-grams N-gram models are eghted regular languages Many lngustc arguments that language sn t regular. Long-dstance effects: The computer hch I had just put nto the machne room on the ffth floor. Recursve structure We often get aay th n-gram models PCFG LM later: [Ths, quarter, s, surprsngly, ndependent, attack, pad, off, the, rsk, nvolvng, IRS, leaders, and, transportaton, prces,.] [It, could, be, announced, sometme,.] [Mr., Toseland, beleves, the, average, defense, economy, s, drafted, from, slghtly, more, than, 2, stocks,.]

19 Proof: Ungram LMs are a Well Defned Dstrbutons* Smplest case: ungrams Generatve process: pck a ord, pck a ord, untl you pck </s> For all strngs x of any length: px 0 Clam: the sum over strng of all lengths s : Σ x px = Step : decompose sum over length pn s prob. of sent. th n ords Step 2: For each length, nner sum s Step 3: For stoppng prob. p s =P</s>, e get a geometrc seres

20 Outlne Motvaton Task Defnton Probablty Estmaton Evaluaton Smoothng 26 Smple Interpolaton and Back-off Advanced Algorthms

21 Estmatng bgram probabltes The Maxmum Lkelhood Estmate P - = count -, count - P - = c -, c -

22 An example P - = c -, c - <s> I am Sam </s> <s> Sam I am </s> <s> I do not lke green eggs and ham </s>

23 More examples: Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by md prced tha food s hat m lookng for tell me about chez pansse can you gve me a lstng of the knds of food that are avalable m lookng for a good place to eat breakfast hen s caffe veneza open durng the day

24 Ra bgram counts Out of 9222 sentences

25 Ra bgram probabltes Normalze by ungrams: Result:

26 Bgram estmates of sentence probabltes P<s> I ant englsh food </s> = PI <s> Pant I Penglsh ant Pfood englsh P</s> food =.00003

27 What knds of knoledge? Penglsh ant =.00 Pchnese ant =.0065 Pto ant =.66 Peat to =.28 Pfood to = 0 Pant spend = 0 P <s> =.25 World knoledge Grammatcal knoledge

28 Practcal Issues We do everythng n log space Avod underflo also addng s faster than multplyng logp p 2 p 3 p 4 = log p + log p 2 + log p 3 + log p 4

29 Language Modelng Toolkts SRILM

30 Google N-Gram Release, August 2006

31 Google N-Gram Release serve as the ncomng 92 serve as the ncubator 99 serve as the ndependent 794 serve as the ndex 223 serve as the ndcaton 72 serve as the ndcator 20 serve as the ndcators 45 serve as the ndspensable serve as the ndspensble 40 serve as the ndvdual 234

32 Google Book N-grams

33 When do Noun Phrases start appearng regularly The Space n of text? Unlnkable Noun Phrases South Non-entty Amerca Noun 762 Phrases strange thngs 794 Serous change 82 Abraham Lncoln 849 Typcal ones 86 hypothess: In a set of apple juce 865 unlnkable noun phrases, Unlnkable enttes Computer Scence 943 phrases gong back 00s Bradley Hosptal 976 of years are more often heatgrass juce 983 non-entty. FY Hurrcane Rta Year source: Google Books ngrams

usage frequency n text Google Books n-grams Temporally-Aare features example non-entty: Some people slope of best ft lne 700 800 year 900

34 usage frequency n text Google Books n-grams Temporally-Aare features example non-entty: Some people slope of best ft lne year Year example entty: Starbucks Coffee usage frequency n text Google contnuous usage snce <year> Books 40 n-grams year

35 Outlne Motvaton Task Defnton Probablty Estmaton Evaluaton Smoothng 4 Smple Interpolaton and Back-off Advanced Algorthms

36 Evaluaton: Ho good s our model? Does our language model prefer good sentences to bad ones? Assgn hgher probablty to real or frequently observed sentences Than ungrammatcal or rarely observed sentences? We tran parameters of our model on a tranng set. We test the model s performance on data e haven t seen. A test set s an unseen dataset that s dfferent from our tranng set, totally unused. An evaluaton metrc tells us ho ell our model does on the test set.

37 Extrnsc evaluaton of N-gram models Best evaluaton for comparng models A and B Put each model n a task spellng corrector, speech recognzer, MT system Run the task, get an accuracy for A and for B Ho many msspelled ords corrected properly Ho many ords translated correctly Compare accuracy for A and B

38 Dffculty of extrnsc n-vvo evaluaton of N- gram models Extrnsc evaluaton Tme-consumng; can take days or eeks So Sometmes use ntrnsc evaluaton: perplexty Bad approxmaton unless the test data looks just lke the tranng data So generally only useful n plot experments But s helpful to thnk about.

39 Intuton of Perplexty The Shannon Game: Ho ell can e predct the next ord? I alays order pzza th cheese and The 33 rd Presdent of the US as I sa a Ungrams are terrble at ths game. Why? A better model of a text mushrooms 0. pepperon 0. anchoves 0.0 s one hch assgns a hgher probablty to the ord that actually occurs. fred rce and e-00

40 Perplexty The best language model s one that best predcts an unseen test set Gves the hghest Psentence Perplexty s the nverse probablty of the test set, normalzed by the number of ords: PPW = P 2... N - N = N P 2... N Chan rule: For bgrams: Mnmzng perplexty s the same as maxmzng probablty

41 The Shannon Game ntuton for perplexty From Josh Goodman Ho hard s the task of recognzng dgts 0,,2,3,4,5,6,7,8,9 Perplexty 0 Ho hard s recognzng 30,000 names at Mcrosoft. Perplexty = 30,000 If a system has to recognze Operator n 4 Sales n 4 Techncal Support n 4 30,000 names n 20,000 each Perplexty s 53 Perplexty s eghted equvalent branchng factor

42 Perplexty as branchng factor Let s suppose a sentence consstng of random dgts What s the perplexty of ths sentence accordng to a model that assgn P=/0 to each dgt?

43 Another form of Perplexty Loer s better! Example: unform model perplexty s N Interpretaton: effectve vocabulary sze accountng for statstcal regulartes Typcal values for nespaper text: Unform: 20,000; Ungram: 000s, Bgram: , Trgram: Important note: Its easy to get bogus perplextes by havng bogus probabltes that sum to more than one over ther event spaces. Be careful!

44 Loer perplexty = better model Tranng 38 mllon ords, test.5 mllon ords, WSJ N-gram Order Ungram Bgram Trgram Perplexty

45 Outlne Motvaton Task Defnton Probablty Estmaton Evaluaton Smoothng 5 Smple Interpolaton and Back-off Advanced Algorthms

46 The Shannon Vsualzaton Method Choose a random bgram <s>, accordng to ts probablty No choose a random bgram, x accordng to ts probablty And so on untl e choose </s> Then strng the ords together <s> I I ant ant to to eat eat Chnese Chnese food food </s> I ant to eat Chnese food

47 Approxmatng Shakespeare

48 Shakespeare as corpus N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bgram types out of V 2 = 844 mllon possble bgrams. So 99.96% of the possble bgrams ere never seen have zero entres n the table Quadrgrams orse: What's comng out looks lke Shakespeare because t s Shakespeare

49 The all street journal s not shakespeare no offense

50 The perls of overfttng N-grams only ork ell for ord predcton f the test corpus looks lke the tranng corpus In real lfe, t often doesn t We need to tran robust models that generalze! One knd of generalzaton: Zeros! Thngs that don t ever occur n the tranng set But occur n the test set

51 Zeros Tranng set: dened the allegatons dened the reports dened the clams dened the request Test set dened the offer dened the loan P offer dened the = 0

52 Zero probablty bgrams Bgrams th zero probablty mean that e ll assgn 0 probablty to the test set! And hence e cannot compute perplexty can t dvde by 0!

53 allegatons reports clams request attack man outcome allegatons reports clams request attack man outcome The ntuton of smoothng When e have sparse statstcs: P dened the 3 allegatons 2 reports clams request 7 total Steal probablty mass to generalze better P dened the 2.5 allegatons.5 reports 0.5 clams 0.5 request 2 other 7 total

54 Add-one estmaton Also called Laplace smoothng Pretend e sa each ord one more tme than e dd Just add one to all the counts! MLE estmate: P MLE - = c -, c - Add- estmate: P Add- - = c -, + c - +V

55 Berkeley Restaurant Corpus: Laplace smoothed bgram counts

56 Laplace-smoothed bgrams

57 Reconsttuted counts

58 Compare th ra bgram counts

59 More general formulatons: Add-k P Add-k - = c -, + k c - + kv P Add-k - = c -, + m V c - + m

60 Outlne Motvaton Task Defnton Probablty Estmaton Evaluaton Smoothng 67 Smple Interpolaton and Back-off Advanced Algorthms

61 Backoff and Interpolaton Sometmes t helps to use less context Condton on less context for contexts you haven t learned much about Backoff: use trgram f you have good evdence, otherse bgram, otherse ungram Interpolaton: mx ungram, bgram, trgram Interpolaton often orks better

62 Backoff Defne the ords nto seen and unseen Backoff Problem? Not a probablty dstrbuton 69, BO B P Α c c P

63 Katz Backoff Defne the ords nto seen and unseen Backoff, ML c c P * B A P P * BO B P Α P P * ML P P }, : { k v c v A }, : { k v c v B

64 Lnear Interpolaton Smple nterpolaton Lambdas condtonal on context:

65 Ho to set the lambdas? Use a held-out corpus Tranng Data Held-Out Data Test Data Choose λs to maxmze the probablty of held-out data: Fx the N-gram probabltes on the tranng data Then search for λs that gve largest probablty to held-out set: å logp... n Ml...l k = logp M l...l k -

66 Add-k + Interpolaton P Add-k - = c -, + m V c - + m P UngramPror - = c -, + mp c - + m

67 Unknon ords: Open vs closed vocabulary tasks If e kno all the ords n advanced Vocabulary V s fxed Closed vocabulary task Often e don t kno ths Out Of Vocabulary = OOV ords Open vocabulary task Instead: create an unknon ord token <UNK> Tranng of <UNK> probabltes Create a fxed lexcon L of sze V At text normalzaton phase, any tranng ord not n L changed to <UNK> No e tran ts probabltes lke a normal ord At decodng tme If text nput: Use UNK probabltes for any ord not n tranng

68 Huge eb-scale n-grams Ho to deal th, e.g., Google N-gram corpus Prunng Only store N-grams th count > threshold. Remove sngletons of hgher-order n-grams Entropy-based prunng Effcency Effcent data structures lke tres Bloom flters: approxmate language models Store ords as ndexes, not strngs Use Huffman codng to ft large numbers of ords nto to bytes Quantze probabltes 4-8 bts nstead of 8-byte float

69 Smoothng for Web-scale N-grams Stupd backoff Brants et al No dscountng, just use relatve frequences 76 otherse count f... count... count... 2 k k k k k S S S = count N

70 Outlne Motvaton Task Defnton Probablty Estmaton Evaluaton Smoothng 77 Smple Interpolaton and Back-off Advanced Algorthms

71 Advanced smoothng algorthms Many advanced smoothng algorthms Good-Turng Kneser-Ney Wtten-Bell Use the count of thngs e ve seen once to help estmate the count of thngs e ve never seen

72 Notaton: N c = Frequency of frequency c N c = the count of thngs e ve seen c tmes Sam I am I am Sam I do not eat I 3 sam 2 am 2 do not eat 79 N = 3 N 2 = 2 N 3 =

73 Good-Turng smoothng ntuton You ask people ther favorte game a scenaro nspred from Josh Goodman, and fnd: 0 crcket, 3 tenns, 2 soccer, carrom, chess, rummy = 8 games Ho lkely s t that next anser s carrom? /8 Ho lkely s t that next game s ne.e. staapu or TT Let s use our estmate of thngs-e-sa-once to estmate the ne thngs. 3/8 because N =3 Assumng so, ho lkely s t that next game s carrom? Must be less than /8 Ho to estmate?

74 Good Turng calculatons P GT * thngs th zero frequency = N N c* = c +N c+ N c Unseen staapu or TT c = 0: MLE p = 0/8 = 0 Seen once carrom c = MLE p = /8 P * GT unseen = N /N = 3/8 C*carrom = 2 * N2/N = 2 * /3 = 2/3 P * GTcarrom = 2/3 / 8 = /27

75 Ney et al. s Good Turng Intuton H. Ney, U. Essen, and R. Kneser, 995. On the estmaton of 'small' probabltes by leavng-one-out. IEEE Trans. PAMI. 7:2, Held-out ords: 82

76 Ney et al. Good Turng Intuton Intuton from leave-one-out valdaton Take each of the T tranng ords out n turn T [tranng sets of sze T, held-out of sze ] What fracton of held-out ords are unseen n tranng? N /T What fracton of held-out ords are seen c tmes n tranng? c+n c+ /T So n the future e expect c+n c+ /T of the ords to be those th tranng count c There are N c ords th tranng count c Each specfc ord should occur th probablty: Tranng N N 2 N 3 Held out N 0 N N 2 c+n c+ /T/N c or expected count: c N N c c* c N 35 N 447 N 350 N 446

77 Good-Turng complcatons Problem: hat about the? c=447 For small c, N c > N c+ For large c, too jumpy, zeros reck estmates N N 2 N3 Smple Good-Turng [Gale and Sampson]: replace emprcal N c th a best-ft poer la once counts get unrelable N N 2

78 Resultng Good-Turng numbers Numbers from Church and Gale mllon ords of AP Nesre c* = c +N c+ N c Count Good Turng c* c

79 Resultng Good-Turng numbers Numbers from Church and Gale mllon ords of AP Nesre c* = c +N c+ N c It sure looks lke c* = c -.75 Count Good Turng c* c

80 Absolute Dscountng Save ourselves some tme and just subtract 0.75 or some d! Maybe keepng a couple extra values of d for counts and 2 Rarely used by tself 87, AbsoluteDscountng c d c P dscounted bgram

81 Absolute Dscountng th Backoff Defne the ords nto seen and unseen Backoff Can consder herarchcal formulatons: trgram s recursvely backed off to Katz bgram estmate, etc Can also have multple count thresholds nstead of just 0 and >0, AD c d c P c d c,, KB B B P P Α c d c P

82 Absolute Dscountng Interpolaton But should e really just use the regular ungram P? 89, scountng AbsoluteD P c d c P dscounted bgram ungram Interpolaton eght

83 Kneser-Ney Smoothng I Better estmate for probabltes of loer-order ungrams! Shannon game: I can t see thout my readng? Francsco s more common than glasses but Francsco alays follos San Francsco glasses The ungram s useful exactly hen e haven t seen ths bgram! Instead of P: Ho lkely s P contnuaton : Ho lkely s to appear as a novel contnuaton? For each ord, count the number of bgram types t completes Every bgram type as a novel contnuaton the frst tme t as seen P CONTINUATION µ { - :c -, > 0}

84 Kneser-Ney Smoothng II Ho many tmes does appear as a novel contnuaton: P CONTINUATION µ { - :c -, > 0} Normalzed by the total number of ord bgram types { j-, j : c j-, j > 0} P CONTINUATION = { - : c -, > 0} { j-, j : c j-, j > 0}

85 Kneser-Ney Smoothng III Alternatve metaphor: The number of # of ord types seen to precede { - : c -, > 0} normalzed by the # of ords precedng all ords: P CONTINUATION = å ' { - : c -, > 0} {' - : c' -, ' > 0} A frequent ord Francsco occurrng n only one context San ll have a lo contnuaton probablty

86 Kneser-Ney Smoothng IV P KN - = maxc -, - d,0 c - + l - P CONTINUATION λ s a normalzng constant; the probablty mass e ve dscounted l - = d c - { : c -, > 0} 93 the normalzed dscount The number of ord types that can follo - = # of ord types e dscounted = # of tmes e appled normalzed dscount

87 Kneser-Ney Smoothng: Recursve formulaton - P KN -n+ = maxc KN -n+ - c KN -n+ - d, l -n+ P KN -n+2 c KN = ìï í îï count for the hghest order contnuatoncount for loer order Contnuaton count = Number of unque sngle ord contexts for 94

88 What Actually Works? Trgrams and beyond: Ungrams, bgrams generally useless Trgrams much better hen there s enough data 4-, 5-grams really useful n MT, but not so much for speech Dscountng Absolute dscountng, Good- Turng, held-out estmaton, Wtten-Bell, etc See [Chen+Goodman] readng for tons of graphs [Graphs from Joshua Goodman]

89 N-gram Smoothng Summary Add- smoothng: OK for text categorzaton, not for language modelng The most commonly used method: Interpolated Kneser-Ney modfed to three dscounts For very large N-grams lke the Web: Stupd backoff 96

90 Data vs. Method? Havng more data s better but so s usng a better estmator Another ssue: N > 3 has huge costs n speech recognzers Entropy n-gram order 00,000 Katz 00,000 KN,000,000 Katz,000,000 KN 0,000,000 Katz 0,000,000 KN all Katz all KN

91 Tons of Data? Tons of data closes gap, for extrnsc MT evaluaton

92 Beyond N-Gram LMs Lots of deas e on t have tme to dscuss: Cachng models: recent ords more lkely to appear agan Trgger models: recent ords trgger other ords Topc models A fe recent deas Syntactc models: use tree models to capture long-dstance syntactc effects [Chelba and Jelnek, 98] Dscrmnatve models: set n-gram eghts to mprove fnal task accuracy rather than ft tranng set densty [Roark, 05, for ASR; Lang et. al., 06, for MT] Structural zeros: some n-grams are syntactcally forbdden, keep estmates at zero [Mohr and Roark, 06] Bayesan document and IR models [Daume 06]

9/10/18. CS 6120/CS4120: Natural Language Processing. Outline. Probabilistic Language Models. Probabilistic Language Modeling

9/10/18. CS 6120/CS4120: Natural Language Processing. Outline. Probabilistic Language Models. Probabilistic Language Modeling Outlne CS 6120/CS4120: Natural Language Processng Instructor: Prof. Lu Wang College of Computer and Informaton Scence Northeastern Unversty Webpage: www.ccs.neu.edu/home/luwang Probablstc language model