LECTURER: BURCU CAN Spring - PDF Free Download

LECTURER: BURCU CAN 2018-2019 Spring

Open class (lexical) words Nouns Verbs Adjectives yellow Proper Common Main Adverbs slowly IBM Italy cat / cats snow see registered Numbers more 122,312 Closed class (functional) Modals one Determiners the some can Prepositions to with Conjunctions and or had Particles off up Pronouns he its more

English PoS Tags CC conjunction, coordinating and both but either or CD numeral, cardinal mid-1890 nine-thirty 0.5 one DT determiner a all an every no that the EX existential there there FW foreign word gemeinschaft hund ich jeux IN preposition or conjunction, subordinating among whether out on by if JJ adjective or numeral, ordinal third ill-mannered regrettable JJR adjective, comparative braver cheaper taller JJS adjective, superlative bravest cheapest tallest MD modal auxiliary can may might will would NN noun, common, singular or mass cabbage thermostat investment subhumanity NNP noun, proper, singular Motown Cougar Yvette Liverpool NNPS noun, proper, plural Americans Materials States NNS noun, common, plural undergraduates bric-a-brac averages POS genitive marker ' 's PRP pronoun, personal hers himself it we them PRP$ pronoun, possessive her his mine my our ours their thy your RB adverb occasionally maddeningly adventurously RBR adverb, comparative further gloomier heavier less-perfectly RBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open through TO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heck VB verb, base form ask bring fire see take VBD verb, past tense pleaded swiped registered saw VBG verb, present participle or gerund stirring focusing approaching erasing VBN verb, past participle dilapidated imitated reunifed unsettled VBP verb, present tense, not 3rd person singular twist appear comprise mold postpone VBZ verb, present tense, 3rd person singular bases reconstructs marks uses WDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whom WP$ WH-pronoun, possessive whose WRB Wh-adverb however whenever where why

TURKISH PoS Tags Disambiguating Main POS tags for Turkish, Ehsani et al. 2012

Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj Uses: text-to-speech (how do we pronounce lead?) can write regexps like (Det) Adj* N+ over the output if you know the tag, you can back off to it in other tasks 5

Useful: Text-to-speech: record, lead (in English); yüz, bodrum (in Turkish) Lemmatization: saw[v] see, saw[n] saw Useful as a pre-processing step for parsing Less tag ambiguity means fewer parses However, some tag choices are better decided by parsers

Supervised Rule-based systems Statistical PoS Tagging Neural Networks (RNNs) Unsupervised Statistical PoS Tagging Neural Networks (RNNs) Partly-supervised

Classification Decision Trees Naïve Bayes Logistic Regression / Maximum Entropy (MaxEnt) Perceptron or Neural Networks Support Vector Machines Nearest-Neighbour

Sequence Labelling Labels of tokens are dependent on the labels of other tokens Two standard models: Hidden Markov Models Conditional Random Field (CRF)

Input: the lead paint is unsafe Output: the/det lead/n paint/n is/v unsafe/adj How many tags are correct? About 97% currently But baseline is already 90% Baseline is the stupidest possible method Tag every word with its most frequent tag Tag unknown words as nouns 10

Example VBD VB VBN VBZ VBP VBZ NNP NNS NN NNS CD NN Fed raises interest rates 0.5 percent

correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more)? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too 12

What Should We Look At? correct tags PN Verb Det Noun Prep Noun Prep Det Noun Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more)? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too 600.465 - Intro to NLP - J. Eisner 13

Introduce Hidden Markov Models (HMMs) for part-of-speech tagging/sequence classification Cover the three fundamental questions of HMMs: How do we fit the model parameters of an HMM? Given an HMM, how do we efficiently calculate the likelihood of an observation w? Given an HMM and an observation w, how do we efficiently calculate the most likely state sequence for w?

Green circles are hidden states Dependent only on the previous state (bigram)

Purple nodes are observed states Dependent only on their corresponding hidden state

S S S S S K K K K K {S, K, P, A, B} S : {s 1 s N } are the values for the hidden states K : {k 1 k M } are the values for the observations

S A S A S A S A S B B B K K K K K {S, K, P, A, B} P = {p i } are the initial state probabilities A = {a ij } are the state transition probabilities B = {b ik } are the observation state probabilities

Det 0.95 Noun 0.9 0.5 0.5 start 0.4 PropNoun 0.05 0.8 0.25 0.25 Verb stop

0.5 start 22 Det Noun 0.5 0.95 0.9 0.05 Verb 0.25 PropNoun 0.8 0.4 0.25 P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*=0.0076 stop

the a a the the a the that Det 0.4 0.5 start 23 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

the a a the the a the that Det 0.4 0.5 start 24 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop start 25

the a a the the a the that Det 0.4 0.5 start 26 John 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

the a a the the a the that Det 0.4 0.5 start 27 John 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 28 the a a the the a the that Det 0.4 John bit 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 29 the a a the the a the that Det 0.4 John bit 0.95 Tom John Mary Alice Jerry PropNoun cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 30 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun John bit the cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 31 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun John bit the cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 32 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun John bit the apple cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

0.5 start 33 the a a the the a the that Det 0.4 0.95 Tom John Mary Alice Jerry PropNoun John bit the apple cat dog car pen bed apple Noun 0.25 0.05 0.8 0.25 0.9 bit ate played saw hit gave Verb 0.5 stop

We want a model of sequences s and observations w s 0 s 1 s 2 s n w 1 w 2 w n Assumptions: States are tag n-grams Usually a dedicated start and end state / word Tag/state sequence is generated by a Markov model Words are chosen independently, conditioned only on the tag/state These are totally broken assumptions: why?

Transitions P(s s ) encode well-formed tag sequences In a bigram tagger, states = tags < > s 0 < t 1 > < t 2 > < t n > s 1 s 2 s n w 1 w 2 w n

Use standard smoothing methods to estimate transitions: ) ˆ( ) (1 ) ˆ( ), ˆ( ), ( 2 1 1 1 2 1 2 2 1 i i i i i i i i i t P t t P t t t P t t t P l l l l - - + + = - - - - -

Emissions are trickier: Words we ve never seen before Words which occur with tags we ve never seen Issue: words aren t black boxes: 343,127.23 11-year Minteria reintroducibly

Can do surprisingly well just looking at a word by itself: Word the: the DT Lowercased word Importantly: importantly RB Prefixes unfathomable: un- JJ Suffixes Importantly: -ly RB Capitalization Meridian: CAP NNP Word shapes 35-year: d-x JJ Then build a MaxEnt (or whatever) model to predict tag Maxent P(t w): 93.7% / 82.6% (See the next class for Maxent models)

Consider all possible state sequences, Q, of length T that the model could have traversed in generating the given observation sequence. Compute the probability of a given state sequence from A, and multiply it by the probabilities of generating each of given observations in each of the corresponding states in this sequence to get P(O,Q λ) = P(O Q, λ) P(Q λ). Sum this over all possible state sequences to get P(O λ). 41 Computationally complex: O(TN T )

Due to the Markov assumption, the probability of being in any state at any given time t only relies on the probability of being in each of the possible states at time t 1. Forward Algorithm: Uses dynamic programming to exploit this fact to efficiently compute observation likelihood in O(TN 2 ) time. Compute a forward trellis that compactly and implicitly encodes information about all possible state paths.

s 1 s 2 s 0 s F s N t 1 t 2 t 3 t T-1 t T Continue forward in time until reaching final time point and sum probability of ending in final state.

Let a t (j) be the probability of being in state j after seeing the first t observations (by summing over all initial paths leading to j). a ( ) (,,..., l t j = P o1 o2 ot qt = s j )

s 1 s 2 s N a t-1 (i) a 1j a 2j a 2j a Nj s j a t (i) Consider all possible ways of getting to s j at time t by coming from all possible states s i and determine probability of each. Sum these to get the total probability of being in state s j at time t while accounting for the first t 1 observations. Then multiply by the probability of actually observing o t in s j.

Initialization Recursion Termination N j o b a j j j = 1 ) ( ) ( 1 0 a 1 T t N j o b a i j t j N i ij t t < ú û ù ê ë é = å = - 1, 1 ) ( ) ( ) ( 1 1 a a å = + = = N i if T F T a i s O P 1 1 ) ( ) ( ) ( a a l

Requires only O(TN 2 ) time to compute the probability of an observed sequence given a model. Exploits the fact that all state sequences must merge into one of the N possible states at any point in time and the Markov assumption that only the last state effects the next one.

What is the most likely sequence of tags t for the given sequence of words w?

What is the most likely sequence of tags t for the given sequence of words w? Choosing the best tag sequence T=t 1,t 2,,t n for a given word sequence W = w 1,w 2,,w n (sentence): By Bayes Rule: ^ T = arg max P( T ^ T = TÎt arg max TÎt W ) P( W T ) P( T ) P( W ) Since P(W) will be the same for each tag sequence: ^ T = arg max P( W TÎt T ) P( T )

If we assume a tagged corpus and a trigram language model, then P(T) can be approximated as: To evaluate this formula is simple, we get from simple word counting (and smoothing). Õ = - - n i i i i t t t P t t P t P 3 1 2 1 2 1 ) ( ) ( ) (

To evaluate P(W T), we will make the simplifying assumption that the word depends only on its tag: So, we want the tag sequence that maximizes the following quantity. The best tag sequence can be found by Viterbi algorithm. Õ = n i w i t i P 1 ) ( ú û ù ê ë é Õ Õ = = - - n i i i n i i i i t w P t t t P t t P t P 1 3 1 2 1 2 1 ) ( ) ( ) ( ) (

Given these two multinomials, we can score any word / tag sequence pair <, > <,NNP> <NNP, VBZ> <VBZ, NN> <NN, NNS> <NNS, CD> <CD, NN> <STOP> NNP VBZ NN NNS CD NN. Fed raises interest rates 0.5 percent. P(NNP <, >) P(Fed NNP) P(VBZ <NNP, >) P(raises VBZ) P(NN VBZ,NNP).. In principle, we re done list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence) NNP VBZ NN NNS CD NN NNP NNS NN NNS CD NN NNP VBZ VB NNS CD NN logp = -23 logp = -29 logp = -27

Too many trajectories (state sequences) to list Option 1: Beam Search <> Fed:NNP Fed:VBN Fed:VBD Fed:NNP raises:nns Fed:NNP raises:vbz Fed:VBN raises:nns Fed:VBN raises:vbz A beam is a set of partial hypotheses Start with just the single empty trajectory At each derivation step: Consider all continuations of previous hypotheses Discard most, keep top k, or those within a factor of the best, (or some combination)

We want a way of computing exactly: -Most likely state sequence given the model and some w BUT! There are exponentially many possible state sequences T n We can, however efficiently calculate using dynamic programming (DP) The DP algorithm used for efficient state-sequence inference uses a trellis of paths through state space It is an instance of what in NLP is called the Viterbi Algorithm

Dynamic program for computing di( s) = max P( s0... si- 1s, w1... wi -1) s0... si -1s The score of a best path up to position i ending in state s d 0 ( s) δ i (s) = max s' ì 1 if s =<, > = í î0 otherwise P(s s')p(w i 1 s')δ i 1 (s') Also store a back-trace: most likely previous state for each state y i( s) = arg max P( s s' ) P( w s' ) di- 1( s' ) s' Iterate on i, storing partial results as you go

Fish sleep.

0.2 0.8 start 0.8 0.7 noun verb end 0.2

A two-word language: fish and sleep Suppose in our training corpus, fish appears 8 times as a noun and 5 times as a verb sleep appears 2 as a noun and 5 times as a verb Emission probabilities: Noun P(fish noun) : 0.8 P(sleep noun) : 0.2 Verb P(fish verb) : 0.5 P(sleep verb) :0.5

0 1 2 3 start verb noun end

0.2 start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 verb 0 noun 0 end 0

0.2 start 0.8 0.8 noun verb 0.7 end 0.2 Token 1: fish 0 1 2 3 start 1 0 verb 0.2 *.5 noun 0.8 *.8 end 0 0

0.2 start 0.8 0.8 noun verb 0.7 end 0.2 Token 1: fish 0 1 2 3 start 1 0 verb 0.1 noun 0.64 end 0 0

0.2 Token 2: sleep (if fish is verb) start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.1*.1*.5 noun 0.64.1*.2*.2 end 0 0 -

0.2 Token 2: sleep (if fish is verb) start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.005 noun 0.64.004 end 0 0 -

0.2 Token 2: sleep (if fish is a noun) start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.005.64*.8*.5 noun 0.64.004.64*.1*.2 end 0 0 -

0.2 Token 2: sleep (if fish is a noun) start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 -

0.2 Token 2: sleep take maximum, set back pointers start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 -

0.2 Token 2: sleep take maximum, set back pointers start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 verb 0.1.256 noun 0.64.0128 end 0 0 -

0.2 start 0.8 0.8 noun verb 0.7 end 0.2 Token 3: end 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1

0.2 Token 3: end take maximum, set back pointers start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1

0.2 Decode: fish = noun sleep = verb start 0.8 0.8 noun verb 0.7 end 0.2 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7

Choose the most common tag 90.3% with a bad unknown word model 93.7% with a good one TnT (Brants, 2000): A carefully smoothed trigram tagger Suffix trees for emissions 96.7% on WSJ text A Fully Bayesian Approach to Unsupervised PoS Tagging (Goldwater, 2007) Painless Unsupervised Learning with Features (Berg et al., 2010) Unsupervised PoS Tagging with Anchor HMMs (Stratos et al., 2016) Noise in the data Many errors in the training and test corpora Probably about 2% guaranteed error from noise

Roadmap of (known / unknown) accuracies: Most freq tag: ~90% / ~50% Trigram HMM: ~95% / ~55% Maxent P(t w): 93.7% / 82.6% TnT (HMM++): 96.2% / 86.0% MEMM tagger: 96.9% / 86.9% Bidirectional dependencies: 97.2% / 89.0% Most errors on unknown words Upper bound: ~98% (human agreement)

Better features! RB PRP VBD IN RB IN PRP VBD. They left as soon as he arrived. We could fix this with a feature that looked at the next word JJ NNP NNS VBD VBN. Intrinsic flaws remained undetected. We could fix this by linking capitalized words to their lowercase versions More general solution: Maximum-entropy Markov models Reality check: Taggers are already pretty good on WSJ journal text What the world needs is taggers that work on other text!

Dan Klein, Chris Manning, Jason Eisner slides Heng Ji, POS Tagging and Syntactic Parsing Raymond Mooney, POS Tagging and HMMs