Part-of-Speech Tagging with Hidden Markov Models

Size: px

Start display at page:

Download "Part-of-Speech Tagging with Hidden Markov Models"

Gervais Lane
5 years ago
Views:

1 Part-of-Speech Taggng wth Hdden Markov Models Jonathon Read October 7, 20 Last week: probablty theory and n-gram language models Last week we dscussed some concepts from probablty theory, such as condtonal probabltes and how they can be manpulated to determne the jont probablty, usng the Chan Rule: P (A... A N ) = P (A ) P (A 2 A ) P (A 3 A A 2 )... P ( A N N = A ) whch says that the jont probablty of several events occurng n sequence, P (A... A N ), s equal to the product of the condtonal probabltes, e.g. P ( A N N = A ), of each event occurrng gven the hstory of events. To apply ths to natural language we can thnk of a sentence as beng a sequence, where each word s an event, e.g. P ( lke bananas) = P () P (lke ) P (bananas lke) Estmatng such condtonal probabltes s smply a case of calculatng relatve frequences of counts n a sutable corpus, for example: P (bananas I lke) = Ths s called Maxmum Lkelhood Estmaton (MLE). C ( lke bananas) C ( lke) Ths s fne for three-word toy sentences, but most cases (such as our somewhat unusual example, Hold the newsreader s nose squarely... ) t becomes mpossble to estmate the condtonal probabltes because the precse sequence of words wll not have occurred prevously: P (hold the newsreader s nose squarely) = P (hold) P (the hold) P (newsreader hold the) P ( s hold the newsreader) P (nose hold the newsreader s) P (squarely hold the newsreader s nose) Instead we can make a Markov assumpton that s, assume that the probablty of a word only depends on the

2 mmedately preceedng words wthn some wndow of length n: P ( w k w k ) ( P wk w k ) k n+ such that our estmate for the example becomes (f n = 2, for example): P (hold the newsreader s nose squarely) P (hold) P (the hold) P (newsreader the) P ( s newsreader) P (nose s) P (squarely nose) These estmated probabltes of sentences are useful n tasks that nvolve dentfyng words n nosy or ambguous nput. For example, speech recognton has to determne the correct nterpretaton of homophones words that sound the same but have dfferent meanngs, e.g. probablty estmates can ndcate that ther house s over there s much more lkely than there house s over ther. 2 Parts-of-speech Parts-of-speech (also known as POS, word classes, morphologcal classes, lexcal tags) are used to descrbe collectons of words that serve a smlar purpose n language. All parts-of-speech fall nto one of two categores: open- and closed-class. Open-class parts-of-speech are contnually changng, wth words gong n and out of fashon. In contrast closed-class parts-of-speech are relatvely statc and tend to perform some grammatcal functon. There are four major open classes n Englsh: Nouns typcally refer to enttes n the world, lke people, concepts and thngs (e.g. dog, language, dea). Proper nouns name specfc enttes (e.g. Unversty of Oslo). Count nouns occur n both sngular (dog) or plural forms (dogs) and can be counted (one dog, two dogs). In contrast, mass nouns, whch are used to descrbe a homogeneous concept, are not counted (e.g. *two snows). Verbs are those words that refer to actons and processes (e.g. eat, speak, thnk). Englsh verbs have a number of forms (e.g. eat, eats, eatng, eaten, ate). Adjectves descrbe qualtes of enttes, (e.g. whte, old, bad). Adverbs modfy other words, typcally verbs (hence the name), but also other adverbs and even whole phrases. Some examples are drectonal or locatve (here, downhll), others are to do wth degree (very, somewhat), and others are temporal (yesterday). There are many more closed classes, ncludng: Determners modfy nouns to make reference to an nstance or nstances of the noun e.g. a, the, some Pronouns substtute nouns, often servng to avod excessve repetton, e.g. Bll had many papers. He read them. Conjunctons connect words and phrases together (e.g. and, or, but) Prepostons ndcate relatons e.g. on, under, over, near, by at, from to, wth. Auxlares are closed-class verbs that ndcate certan semantc propertes of a man verb, such as tense (e.g. he wll go ). 2

3 So the parts-of-speech n last week s example are: Hold/VERB the/determiner newsreader/noun s/determiner nose/noun squarely/adverb The above lsts cover the parts-of-speech that tend to be taught at an elementary level n Englsh schools, but are by no means exhaustve. There are many dfferent lsts (known as tagsets) used n a varety of corpora. For nstance, n the exercses we wll be usng the Penn Treebank tagset, whch s relatvely modest n sze wth only 45 tags. Consder Jurafsky & Martn s examples, from the Penn Treebank verson of the Brown corpus:. The/DT grand/jj jury/nn commented/vbd on/in a/dt number/nn of/in other/jj topcs/nns./. 2. There/EX are/vbp 70/CD chldren/nns there/rb Most parts-of-speech n example () are nstances of classes dscussed above, but tags are made more specfc by the ncluson of extra letters. For nstance, NN ndcates a sngular or mass noun, whereas NNS would ndcate a plural noun; furthermore VBD ndcates a verb n past tense, whereas VB would be ts base form. Example (2) shows the use of EX to mark an exstental there. For a full lst of tags see Fgure 5.6 on page 65 n Jurafsky & Martn. 3 Part-of-speech taggng Part-of-speech taggng s the process of labelng each word n a text wth the approprate part-of-speech. The nput to a tagger s a strng of words and the desred tagset. Part-of-speech nformaton s very mportant for a number of tasks n natural language processng: Parsng s the task of determnng the syntactc structure of sentences recognsng noun phrases, verb phrases, etc. Determnng parts-of-speech s a necessary prequste. Lemmatsaton nvolves fndng the canoncal form of a word. Knowng the word s part-of-speech can ad ths, because t can tell us what affxes mght be have been appled to the word. Word sense dsambguaton s needed when a word can have more than one sense (e.g. They felt the plane bank vs. Shares n the bank fell ). Part-of-speech nformaton can help n some nstances, such as ths example, where a plane bankng changng drecton s a verb, whle the other example s of a noun the fnancal entty. Machne translaton can beneft n a smlar manner, for example when translatng a phrase contanng the Norwegan word shy, knowng whether t s a noun, adjectve or verb can tell us whether to translate t as cloud, shy or avod. However, t s not a trval task because there are frequently several possble parts of speech for a word. Part-of speech taggng s therefore a dsambguaton task whch nvolves determnng the correct tag for a partcular occurrence of a word gven ts context. Rule-based approaches tend to employ a two stage process. Frstly, a lexcon of words and ther known parts-ofspeech s consulted to enumerate all possbltes for words n the nput. Secondly a large set of constrants are appled that one-by-one elmnate all possble readngs that are nconsstent wth the context. 4 HMM part-of-speech taggng We can vew part-of-speech taggng as a sequence classfcaton task, wheren we are gven a sequence of observed words w n and determne a correspondng sequence of classes ˆt n. We want to choose, from all sequences of n tags t n the sequence whch has the hghest probablty P (t n w n ): ˆt n = arg max P (t n t n w n ) 3

4 A quck note on notaton: arg max x f (x) means the x such that f (x) s maxmsed. Whle the above equaton s vald t s not mmedately clear how to use t, because we can t drectly estmate the probablty of a sequence of tags gven a sequence of words. We can begn to make the equaton computable by applyng Bayes rule, P (x y) = P (y x) P (x) P (y) whch enables us to transform the condtonal probablty of a sequence of tags gven a sequence of words nto somethng more practcal: ˆt n = arg max P (t n t n w n P (w n t n ) P (t n ) ) = arg max t n P (w n) Also, as the denomnator P (w n ) wll be appled to all possble t n t has no effect on the arg max output, and can be safely dscarded to further smplfy the computaton, leavng us wth: ˆt n = arg max P (w t n n t n ) P (t n ) The two terms are the pror probablty of the tag sequence P (t n ) the probablty of seeng the tag sequence rrespectve of the word sequence and the lkelhood of the word sequence gven such a sequence P (w n t n ). But recall last week s dscusson of the creatvty of language how can we estmate the probabltes of sequences of words and tags f we re unlkely to observe most utterances n a corpus? Smlar to those made last week, we make smplyng assumptons. Frst, that the probablty of a word appearng s only dependent on ts own part-of-speech tag, and s not nfluenced by other words and tags: P (w n t n ) P (w t ) And secondly that the probablty of a tag s only dependent on the mmedately prevous tag (as apposed to the entre tag sequence): P (t n ) P (t t ) Whch leaves us wth a tractable formulaton for the search problem: ˆt t = arg max P (t n t n w n ) arg max t n P (w t )P (t t ) We can now compute maxmum lkelhood estmates of the terms usng a tranng corpus of prevously tagged text, usng counts of observed tags and words. For example, one mght have the ntuton that determners (DT) are frequently followed by common nouns, e.g. that/dt flght/nn. Ths can be assessed wth the maxmum lkelhood estmate of the tag transton probablty: P (t t ) = C(t, t ) C(t ) C(DT, NN) P (NN DT) = C(DT) = = 0.49 The word lkelhood probabltes, P (w t ), represent the assocaton of a gven tag wth a gven word. Suppose we are nterested n the lkelhood of the verb s when the tag s VBZ. P (t w ) = C(t, w ) C(t ) C(VBZ, s) P (s VBZ) = C(VBZ) = =

5 5 An example Here s an example of a search n acton, focusng on resolvng the ambguty presented by the word race n the sentence Secretarat s expected to race tomorrow. The dagram below presents two ntepretatons that are mostly smlar the only dfferences n probabltes are hghlghted n bold. So to smplfy the example, we ll just consder these probabltes. Frstly, the tag transton probabltes ndcate that a verb followng TO s about 500 tmes more lkely than a noun: P (NN TO) = P (VB TO) = 0.83 Turnng to P (w t ), the lexcal lkelhood of race beng a verb or a noun, t seems that the verb sense of race s less lkely than the noun sense. P (race NN) = P (race VB) = Fnally we represent the tag sequence probablty for the next tag (NR): P (NR VB) = P (NR NN) = Fndng the product of the lexcal lkelhoods and tag sequence probabltes, we fnd that the sequence wth a verb s hgher, despte the noun sense beng more lkely for race: P (VB TO)P (NR VB)P (race VB) = P (NN TO)P (NR NN)P (race NN) =

Note on EM-training of IBM-model 1

Note on EM-training of IBM-model 1 Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are