N-Grams and Corpus Linguistics

Size: px

Start display at page:

Download "N-Grams and Corpus Linguistics"

Colin Pope
6 years ago
Views:

1 N-Grams ad Corpus Liguistics Lecture #5 Trasitio Up to this poit e ve mostly bee discussig ords i isolatio No e re sitchig to sequeces of ords Ad e re goig to orry about assigig probabilities biliti to sequeces of ords September 9 Who Cares? Why ould you at to assig a probability to a setece or Why ould you at to predict the ext ord Lots of applicatios Real-Word Spellig Errors Metal cofusios Their/they re/there To/too/to Weather/hether eace/piece You re/your Typos that result i real ords Lave for Have 3 4 Real Word Spellig Errors Collect a set of commo pairs of cofusios Wheever a member of this set is ecoutered compute the probability of the setece i hich it appears Substitute the other possibilities ad compute the probability of the resultig setece Choose the higher oe Next Word redictio From a NY Times story... Stocks... Stocks pluged this. Stocks pluged this morig, despite a cut i iterest rates Stocks pluged this morig, despite a cut i iterest rates by the Federal Reserve, as Wall... Stocks pluged this morig, despite a cut i iterest rates by the Federal Reserve, as Wall Street bega 5 6

2 Stocks pluged this morig, despite a cut i iterest rates by the Federal Reserve, as Wall Street bega tradig for the first time sice last Stocks pluged this morig, despite a cut i iterest rates by the Federal Reserve, as Wall Street bega tradig for the first time sice last Tuesday's terrorist attacks. Huma Word redictio Clearly, at least some of us have the ability to predict future ords i a utterace. Ho? Domai koledge Sytactic ti koledge Lexical koledge 7 8 Claim A useful part of the koledge eeded to allo Word redictio ca be captured usig simple statistical techiques I particular, e'll rely o the otio of the probability of a sequece a phrase, a setece Applicatios Why do e at to predict a ord, give some precedig ords? Rak the likelihood of sequeces cotaiig various alterative hypotheses, e.g. for ASR Theatre oers say popcor/uicor sales have doubled... Assess the likelihood/goodess of a setece, e.g. for text geeratio or machie traslatio The doctor recommeded a cat sca. El doctor recommedó ua exploració del gato. 9 N-Gram Models of Laguage Use the previous N- ords i a sequece to predict the ext ord Laguage Model LM uigrams, bigrams, trigrams, Ho do e trai these models? Very large corpora Coutig Words i Corpora What is a ord? e.g., are cat ad cats the same ord? September ad Sept? zero ad oh? Is _ a ord? *?? Ho may ords are there i do t? Goa? I Japaese ad Chiese text -- ho do e idetify a ord?

3 Termiology Setece: uit of ritte laguage Utterace: uit of spoke laguage Word Form: the iflected form that appears i the corpus Lemma: a abstract form, shared by ord forms havig the same stem, part of speech, ad ord sese Types: umber of distict ords i a corpus vocabulary size Tokes: total umber of ords Corpora Corpora are olie collectios of text ad speech Bro Corpus Wall Street Joural A es Hasards DARA/NIST text/speech corpora Call Home, ATIS, sitchboard, Broadcast Nes, TDT, Commuicator TRAINS, Radio Nes 3 4 Chai Rule Example Recall the defiitio of coditioal probabilities Reritig Or Or A^ B A B B A ^ B A B B The big red dog The*big the*red the big*dog the big red Better The <Begiig of setece> ritte as The <S> The big big the the The big the big the 5 6 Geeral Case Ufortuately The ord sequece from positio to is So the probability of a sequece is That does t help sice its ulikely e ll ever gather the right statistics for the prefixes. k 3... k k 7 8 3

4 Markov Assumptio Assume that the etire prefix history is t ecessary. I other ords, a evet does t deped o all of its history, just a fixed legth ear history Markov Assumptio So for each compoet i the product replace each ith the approximatio assumig a prefix of N N 9 N-Grams The big red dog Uigrams: dog Bigrams: dog red Trigrams: dog big red Four-grams: dog the big red I geeral, e ll be dealig ith Word Some fixed prefix Caveat The formulatio Word Some fixed prefix is ot really appropriate i may applicatios. It is if e re dealig ith real time speech here e oly have access to prefixes. But if e re dealig ith text e already have the right ad left cotexts. There s o a priori reaso to stick to left cotexts. Traiig ad Testig N-Gram probabilities come from a traiig corpus overly arro corpus: probabilities do't geeralize overly geeral corpus: probabilities do't reflect task or domai A separate test corpus is used to evaluate the model, typically usig stadard metrics held out test set; developmet test set cross validatio results tested for statistical sigificace A Simple Example I at to eat Chiese food = I <start> at I to at eat to Chiese eat food Chiese 3 4 4

5 eat o A Bigram Grammar Fragmet from BER eat some eat luch eat Thai eat breakfast eat i.3.3. <start> I <start> I d <start> Tell <start> I m I at at some at Thai to eat to have to sped eat dier.5 eat Chiese. I ould.9 to be. eat at eat a eat Idia eat today eat Mexica eat tomorro eat dessert eat British...7. I do t I have at to at a British food British restaurat British cuisie British luch I at to eat British food = I <start> at I to at eat to British eat food British =.5*.3*.65*.6*.*.6 =.8 vs. I at to eat Chiese food =.5 robabilities seem to capture ``sytactic'' facts, ``orld koledge'' eat is ofte folloed by a N British food is ot too popular N-gram models ca be traied by coutig ad ormalizatio A Aside o Logs You do t really do all those multiplies. The umbers are too small ad lead to uderflos Covert the probabilities to logs ad the do additios. To get the real probability bilit if you eed it go back to the atilog. 7 8 Ho do e get the N-gram probabilities? BER Bigram Couts I at to eat Chiese food luch N-gram models ca be traied by coutig ad ormalizatio I at to eat 9 5 Chiese food 9 7 luch

6 BER Bigram robabilities Normalizatio: divide each ro's couts by appropriate uigram couts for - BER Table: Bigram robabilities I at to eat Chiese food luch Computig the bigram probability of I I CI,I/Call I p I I = 8 / 3437 =.3 Maximum Likelihood Estimatio MLE: relative frequecy of e.g. freq, freq 3 3 What do e lear about the laguage? What's beig captured ith... at I =.3 to at =.65 eat to =.6 food Chiese =.56 luch eat =.55 What about... I I =.3 I at =.5 I food =.3 I I =.3 I I I I at I at =.5 I at I at I food =.3 the kid of food I at is Geeratio just a test Choose N-Grams accordig to their probabilities ad strig them together For bigrams start by geeratig a ord that has a high probability of startig a setece, the choose a bigram that is high give the first ord selected, ad so o. See e get better ith higher-order -grams Approximatig Shakespeare As e icrease the value of N, the accuracy of the - gram model icreases, sice choice of ext ord becomes icreasigly costraied Geeratig seteces ith radom uigrams... Every eter o severally so, let Hill he late speaks; or! a more to leg less first you eter With bigrams... What meas, sir. I cofess she? the all sorts, he is trim, captai. Why dost stad forth thy caopy, forsooth; he is this palpable hit the Kig Hery

7 Trigrams Seet price, Falstaff shall die. This shall forbid it should be braded, if reo made it empty. Quadrigrams What! I ill go seek the traitor Gloucester. Will you ot tell me ho I am? There are 884,647 tokes, ith 9,66 ord form types, i about a oe millio ord Shakespeare corpus Shakespeare produced 3, bigram types out of 844 millio possible bigrams: so, 99.96% of the possible bigrams ere ever see have zero etries i the table Quadrigrams orse: What's comig out looks like Shakespeare because it is Shakespeare N-Gram Traiig Sesitivity If e repeated the Shakespeare experimet but traied our -grams o a Wall Street Joural corpus, hat ould e get? This has major implicatios for corpus selectio or desig Some Useful Observatios A small umber of evets occur ith high frequecy You ca collect reliable statistics o these evets ith relatively small samples A large umber of evets occur ith small frequecy You might have to ait a log time to gather statistics o the lo frequecy evets 39 4 Some Useful Observatios Some zeroes are really zeroes Meaig that they represet evets that ca t or should t occur O the other had, some zeroes are t really zeroes They represet lo frequecy evets that simply did t occur i the corpus Smoothig Techiques Every -gram traiig matrix is sparse, eve for very large corpora Zipf s la Solutio: estimate the likelihood of usee -grams roblems: ho do you adjust the rest of the corpus to accommodate these phatom -grams? 4 4 7

8 roblem Let s assume e re usig N-grams Ho ca e assig a probability to a sequece here oe of the compoet -grams has a value of zero Assume all the ords are ko ad have bee see Go to a loer order -gram Back off from bigrams to uigrams Replace the zero ith somethig else Add-Oe Laplace Make the zero couts. Ratioale: They re just evets you have t see yet. If you had see them, chaces are you ould oly have see them oce so make the cout equal to Add-oe Smoothig Origial BER Couts For uigrams: Add to every ord type cout Normalize by N tokes /N tokes +V types Smoothed cout adjusted for additios to N is N ci N VV Normalize by N to get the e uigram probability: For bigrams: N c i p* i V Add to every bigram c - + Icr uigram cout by vocabulary size c - + V BER Table: Bigram robabilities BER After Add-Oe Was

9 Add-Oe Smoothed BER Recostituted Discout: ratio of e couts to old e.g. add-oe smoothig chages the BER cout to at from 786 to 33 d c =.4 ad pto at from.65 to.8 roblem: add oe smoothig chages couts drastically: too much eight give to usee grams i practice, usmoothed bigrams ofte ork better! 49 5 Witte-Bell Discoutig A zero gram is just a gram you have t see yet but every gram i the corpus as usee oce so... Ho may times did e see a gram for the first time? Oce for each gram type T Est. total probability mass of usee bigrams as We ca divide the probability mass equally amog usee bigrams.or e ca coditio the probability of a usee bigram o the first ord of the bigram Discout values for Witte-Bell are much more reasoable tha Add-Oe N T T Vie traiig corpus as series of evets, oe for each toke N ad oe for each e type T 5 5 Witte-Bell Thik about the occurrece of a usee item ord, bigram, etc as a evet. The probability of such a evet ca be measured i a corpus by just lookig at ho ofte it happes. Just take the sigle ord case first. Assume a corpus of N tokes ad T types. Ho may times as a as yet usee type ecoutered? Witte Bell First compute the probability of a usee evet occurrig The distribute that probability mass amog the as yet usee types the oes ith zero couts

10 robability of a Usee Evet Distributig Evely Simple case of uigrams T is the umber of evets that are see for the first time i the corpus This is just the umber of types sice each type had to occur for a first time oce N is just the umber of observatios T N T The amout to be distributed is The umber of evets ith cout zero So distributig evely gets us T N T Z T Z N T Caveat The uigram case is eird Z is the umber of thigs ith cout zero Ok, so that s the umber of thigs e did t see at all. Huh? Fortuately it makes more sese i the N-gram case. Take Shakespeare Recall that he produced oly 9, types. So there are potetially 9,^ bigrams. Of hich oly 3k occur, so Z is 9,^ 3k Witte-Bell I the case of bigrams, ot all coditioig evets are equally promiscuous x the vs x goig So distribute the mass assiged to the zero cout bigrams accordig to their promiscuity This meas coditio the redistributio o ho may differet types occurred ith a give prefix Distributig Amog the Zeros Origial BER Couts If a bigram x i has a zero cout T x i x Z x N x T x Number of bigram types startig ith x Number of bigrams startig ith x that ere ot see Actual frequecy of bigrams begiig ith x 59 6

11 Witte-Bell Smoothed ad Recostituted Couts Good-Turig Discoutig Re-estimate amout of probability mass for zero or lo cout grams by lookig at grams ith higher couts N Estimate c* c c N c E.g. N s adjusted cout is a fuctio of the cout of grams that occur oce, N Assumes: ord bigrams follo a biomial distributio We ko umber of usee bigrams VxV-see 6 6 Backoff methods e.g. Katz 87 For e.g. a trigram model Compute uigram, bigram ad trigram probabilities I use: Where trigram uavailable back off to bigram if available, o.. uigram probability E.g A omivorous uicor Summary N-gram probabilities ca be used to estimate the likelihood Of a ord occurrig i a cotext N- Of a setece occurrig at all Smoothig techiques deal ith problems of usee ords i a corpus 63 64

Lecture 3 Language Modeling with N-Grams

Lecture 3 Language Modeling with N-Grams atural Laguage Processig CS 6320 Lecture 3 Laguage Modelig ith -Grams Istructor: Sada Harabagiu The problem Usig the otio of ord predictio for processig laguage Example: What ord is most likely to follo: