Statistical methods in NLP, lecture 7 Tagging and parsing

Size: px

Start display at page:

Download "Statistical methods in NLP, lecture 7 Tagging and parsing"

Winifred Byrd
5 years ago
Views:

1 Statistical methods in NLP, lecture 7 Tagging and parsing Richard Johansson February 25, 2014

2 overview of today's lecture HMM tagging recap assignment 3 PCFG recap dependency parsing VG assignment 1

3 overview HMM tagging recap PCFGs, dependency parsing, and VG assignment 1 the next few weeks

4 statistical taggers the typical probabilistic formulation of a tagger starts from Bayes' rule: P(W T )P(T ) arg max P(T W ) = arg max T T P(W ) = arg max P(W T )P(T ) T P(T ) is like a language model, but for tag sequences instead of word sequences

5 making the probabilities practical we need to make assumptions about P(T ) and P(W T ) in a bigram tagger, the probability of the next tag depends only on the previous tag (Markov assumption): P(t n t 1,..., t n 1) P(t n t n 1) this is called the transition probability the probability of a word depends only on its tag: P(w n tags, other words) P(w n t n ) this is called the emission probability

6 hidden Markov models P(t n t n 1) P(w n t n ) a model where we have an unknown underlying sequence is called a hidden Markov model (HMM)

7 generative story in hidden Markov models

8 generative story in hidden Markov models

9 generative story in hidden Markov models

10 generative story in hidden Markov models

11 generative story in hidden Markov models

12 how can we estimate the probabilities? to estimate P(t n t n 1) and P(w n t n ), we need a corpus where the part-of-speech tags have been annotated (by humans) The DT rifles NNS were VBD n't RB loaded VBN.. As IN interest NN rates NNS rose VBD,,...

13 estimating the probabilities just like we did with the Naive Bayes classier, we estimate the probabilities by counting frequencies (MLE): count(verb, noun) PMLE (noun verb) = count(verb) PMLE (cat noun) = count(noun: cat) count(noun)

14 smoothing... the tag transition model may need to be smoothed, just like the language models in particular, if the training corpus is small or the tagset contains many tags here is how we do linear interpolation smoothing of the tag probabilities: PLI (tn tn 1) = λ 2 count(tn 1, tn) count(tn 1) + λ 1 count(tn) count(any tag) usually there is some special treatment for the emission probability P(w n t n ) if w n is unseen in the training corpus taking for instance punctuation, capitalization, numbers, suxes into account

15 tagging how do we use our probability model to tag? conceptually: enumerate all possible tag sequences; use the probabilities to nd the best one however in long sentences, the number of possible tag sequences is very large the Viterbi algorithm nds the most probable underlying tag sequence because of the Markov assumption, this algorithm is very ecient

16 the Viterbi algorithm for each possible tag t i of a word w i, we compute the best tag sequence leading to t i for instance: for the word saw, we nd the best sequence ending with saw as a verb, and the best ending with saw as a noun

17 the trick to compute the best path ending with saw as a verb, consider the best paths for the previous word and the transition probabilities assume the previous word is e.g. man, which can be a noun or a verb select the highest of the LP of the best path ending in man as a verb + the LP of the transition verb verb the LP of the best path ending in man as a noun + the LP of the transition noun verb

18 tagging a sentence apply the Viterbi algorithm step by step after the last token of the sentence, add a special dummy end token this token will emit a dummy end tag with probability 1 the best tag sequence for the whole sentence is the best path ending in the dummy tag nally, retrace your steps from the dummy item to get the tags so you need backpointers

19 Viterbi example

20 Viterbi example

21 Viterbi example

22 Viterbi example

23 Viterbi example

24 Viterbi example

25 Viterbi example

26 Viterbi example

27 Viterbi example

28 Viterbi example

29 Viterbi example

30 Viterbi example

31 using more context tagging accuracy can possibly be improved by using more contextual information in a trigram tagger, we use transition probabilities such as P(t n t n 1, t n 2) smoothing becomes more important as you use more context

32 Search spaces... example: Will plays golf NN MD NNP VBZ NNS NN VB 2 1 <E> / /NN /MD /NNP NN/VBZ NN/NNS MD/VBZ MD/NNS NNP/VBZ NNP/NNS VBZ/NN VBZ/VB NNS/NN NNS/VB NN/<E> VB/<E> <E>/<E>

33 assignment 3 write a bigram part-of-speech tagger in Python estimate the emission and transition probabilities implement the Viterbi algorithm evaluate the tagger on a test set

34 overview HMM tagging recap PCFGs, dependency parsing, and VG assignment 1 the next few weeks

35 overview HMM tagging recap PCFGs, dependency parsing, and VG assignment 1 the next few weeks

36 the computer assignments assignment 3: tagger implementation February 27 and March 4 report deadline: March 18

37 next lectures March 6: unsupervised and semisupervised methods March 11: machine translation (with Prasanth) March 18 and 20: VG assignment lab sessions (and catchup)

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.