Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Similar documents
Lecture 12: Algorithms for HMMs

Statistical Methods for NLP

Lecture 12: Algorithms for HMMs

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CS838-1 Advanced NLP: Hidden Markov Models

Midterm sample questions

Lecture 13: Structured Prediction

10/17/04. Today s Main Points

Hidden Markov Models

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Graphical models for part of speech tagging

Hidden Markov Models (HMMs)

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Text Mining. March 3, March 3, / 49

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Models

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Sequence Labeling: HMMs & Structured Perceptron

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Statistical Methods for NLP

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Statistical methods in NLP, lecture 7 Tagging and parsing

COMP90051 Statistical Machine Learning

Hidden Markov Models in Language Processing

The Noisy Channel Model and Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 9: Hidden Markov Model

Intelligent Systems (AI-2)

Machine Learning for natural language processing

Lecture 3: ASR: HMMs, Forward, Viterbi

LECTURER: BURCU CAN Spring

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Part-of-Speech Tagging

Hidden Markov Models

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 7: Sequence Labeling

HMM and Part of Speech Tagging. Adam Meyers New York University

Maxent Models and Discriminative Estimation

Log-Linear Models, MEMMs, and CRFs

A brief introduction to Conditional Random Fields

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Sequences and Information

Intelligent Systems (AI-2)

Probabilistic Context-free Grammars

STA 4273H: Statistical Machine Learning

Hidden Markov Models

Applied Natural Language Processing

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Probabilistic Graphical Models

8: Hidden Markov Models

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Hidden Markov Models

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Statistical Methods for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Probabilistic Graphical Models

Sequence modelling. Marco Saerens (UCL) Slides references

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Conditional Random Field

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Probabilistic Models for Sequence Labeling

Sequential Data Modeling - The Structured Perceptron

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Fun with weighted FSTs

Introduction to Machine Learning CMU-10701

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Natural Language Processing

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

with Local Dependencies

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

order is number of previous outputs

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Advanced Natural Language Processing Syntactic Parsing

STA 414/2104: Machine Learning

Conditional Random Fields

Data-Intensive Computing with MapReduce

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Conditional Random Fields: An Introduction

Spatial Role Labeling CS365 Course Project

NLP Programming Tutorial 11 - The Structured Perceptron

Lecture 15. Probabilistic Models on Graph

Transcription:

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21 February 2018

What is part of speech tagging? Given a string: This is a simple sentence Identify parts of speech (syntactic categories): This/DET is/vb a/det simple/adj sentence/noun Nathan Schneider ENLP Lecture 11 1

Why do we care about POS tagging? POS tagging is a first step towards syntactic analysis (which in turn, is often useful for semantic analysis). Simpler models and often faster than full parsing, but sometimes enough to be useful. For example, POS tags can be useful features in text classification (see previous lecture) or word sense disambiguation (see later in course). Illustrates the use of hidden Markov models (HMMs), which are also used for many other tagging (sequence labelling) tasks. Nathan Schneider ENLP Lecture 11 2

Examples of other tagging tasks Named entity recognition: e.g., label words as belonging to persons, organizations, locations, or none of the above: Barack/PER Obama/PER spoke/non from/non the/non White/LOC House/LOC today/non./non Information field segmentation: Given specific type of text (classified advert, bibiography entry), identify which words belong to which fields (price/size/location, author/title/year) 3BR/SIZE flat/type in/non Bruntsfield/LOC,/NON near/loc main/loc roads/loc./non Bright/FEAT,/NON well/feat maintained/feat... Nathan Schneider ENLP Lecture 11 3

Sequence labelling: key features In all of these tasks, deciding the correct label depends on the word to be labeled NER: Smith is probably a person. POS tagging: chair is probably a noun. the labels of surrounding words NER: if following word is an organization (say Corp.), then this word is more likely to be organization too. POS tagging: if preceding word is a modal verb (say will) then this word is more likely to be a verb. HMM combines these sources of information probabilistically. Nathan Schneider ENLP Lecture 11 4

Parts of Speech: reminder Open class words (or content words) nouns, verbs, adjectives, adverbs mostly content-bearing: they refer to objects, actions, and features in the world open class, since there is no limit to what these words are, new ones are added all the time (email, website). Closed class words (or function words) pronouns, determiners, prepositions, connectives,... there is a limited number of these mostly functional: to tie the concepts of a sentence together Nathan Schneider ENLP Lecture 11 5

How many parts of speech? Both linguistic and practical considerations Corpus annotators decide. Distinguish between proper nouns (names) and common nouns? singular and plural nouns? past and present tense verbs? auxiliary and main verbs? etc Commonly used tagsets for English usually have 40-100 tags. For example, the Penn Treebank has 45 tags. Nathan Schneider ENLP Lecture 11 6

J&M Fig 5.6: Penn Treebank POS tags

POS tags in other languages Morphologically rich languages often have compound morphosyntactic tags Noun+A3sg+P2sg+Nom (J&M, p.196) Hundreds or thousands of possible combinations Predicting these requires more complex methods than what we will discuss (e.g., may combine an FST with a probabilistic disambiguation system) Nathan Schneider ENLP Lecture 11 8

Why is POS tagging hard? The usual reasons! Ambiguity: glass of water/noun vs. water/verb the plants lie/verb down vs. tell a lie/noun wind/verb down vs. a mighty wind/noun (homographs) How about time flies like an arrow? Sparse data: Words we haven t seen before (at all, or in this context) Word-Tag pairs we haven t seen before (e.g., if we verb a noun) Nathan Schneider ENLP Lecture 11 9

Relevant knowledge for POS tagging Remember, we want a model that decides tags based on The word itself Some words may only be nouns, e.g. arrow Some words are ambiguous, e.g. like, flies Probabilities may help, if one tag is more likely than another Tags of surrounding words two determiners rarely follow each other two base form verbs rarely follow each other determiner is almost always followed by adjective or noun Nathan Schneider ENLP Lecture 11 10

A probabilistic model for tagging To incorporate these sources of information, we imagine that the sentences we observe were generated probabilistically as follows. To generate sentence of length n: Let t 0 =<s> For i = 1 to n Choose a tag conditioned on previous tag: P (t i t i 1 ) Choose a word conditioned on its tag: P (w i t i ) So, the model assumes: Each tag depends only on previous tag: a bigram tag model. Words are independent given tags Nathan Schneider ENLP Lecture 11 11

A probabilistic model for tagging In math: P (T, W ) = n P (t i t i 1 ) P (w i t i ) i=1 Nathan Schneider ENLP Lecture 11 12

A probabilistic model for tagging In math: P (T, W ) = n P (t i t i 1 ) P (w i t i ) i=1 P (</s> t n ) where w 0 = <s> and W = T = n Nathan Schneider ENLP Lecture 11 13

A probabilistic model for tagging In math: P (T, W ) = n P (t i t i 1 ) P (w i t i ) i=1 P (</s> t n ) where w 0 = <s> and W = T = n This can be thought of as a language model over words + tags. (Kind of a hybrid of a bigram language model and naïve Bayes.) But typically, we don t know the tags i.e. they re hidden. It is therefore a bigram hidden Markov model (HMM). Nathan Schneider ENLP Lecture 11 14

Probabilistic finite-state machine One way to view the model: sentences are generated by walking through states in a graph. Each state represents a tag. Prob of moving from state s to s (transition probability): P (t i = s t i 1 = s) Nathan Schneider ENLP Lecture 11 15

Example transition probabilities t i 1 \t i NNP MD VB JJ NN... <s> 0.2767 0.0006 0.0031 0.0453 0.0449... NNP 0.3777 0.0110 0.0009 0.0084 0.0584... MD 0.0008 0.0002 0.7968 0.0005 0.0008... VB 0.0322 0.0005 0.0050 0.0837 0.0615... JJ 0.0306 0.0004 0.0001 0.0733 0.4509........................ Probabilities estimated from tagged WSJ corpus, showing, e.g.: Proper nouns (NNP) often begin sentences: P (NNP <s>) 0.28 Modal verbs (MD) nearly always followed by bare verbs (VB). Adjectives (JJ) are often followed by nouns (NN). Table excerpted from J&M draft 3rd edition, Fig 8.5 Nathan Schneider ENLP Lecture 11 16

Example transition probabilities t i 1 \t i NNP MD VB JJ NN... <s> 0.2767 0.0006 0.0031 0.0453 0.0449... NNP 0.3777 0.0110 0.0009 0.0084 0.0584... MD 0.0008 0.0002 0.7968 0.0005 0.0008... VB 0.0322 0.0005 0.0050 0.0837 0.0615... JJ 0.0306 0.0004 0.0001 0.0733 0.4509........................ This table is incomplete! In the full table, every row must sum up to 1 because it is a distribution over the next state (given previous). Table excerpted from J&M draft 3rd edition, Fig 8.5 Nathan Schneider ENLP Lecture 11 17

Probabilistic finite-state machine: outputs When passing through each state, emit a word. like flies VB Prob of emitting w from state s (emission or output probability): P (w i = w t i = s) Nathan Schneider ENLP Lecture 11 18

Example output probabilities t i \w i Janet will back the... NNP 0.000032 0 0 0.000048... MD 0 0.308431 0 0... VB 0 0.000028 0.000672 0... DT 0 0 0 0.506099..................... MLE probabilities from tagged WSJ corpus, showing, e.g.: 0.0032% of proper nouns are Janet: P (Janet NNP) = 0.000032 About half of determiners (DT) are the. the can also be a proper noun. (Annotation error?) Again, in full table, rows would sum to 1. From J&M draft 3rd edition, Fig 8.6 Nathan Schneider ENLP Lecture 11 19

Graphical Model Diagram In graphical model notation, circles = random variables, and each arrow = a conditional probability factor in the joint likelihood: = a lookup in the transition distribution, = a lookup in the emission distribution. http://www.cs.virginia.edu/~hw5x/course/cs6501-text-mining/_site/mps/mp3.html Nathan Schneider ENLP Lecture 11 20

What can we do with this model? If we know the transition and output probabilities, we can compute the probability of a tagged sentence. That is, suppose we have sentence W = w 1... w n and its tags T = t 1... t n. what is the probability that our probabilistic FSM would generate exactly that sequence of words and tags, if we stepped through at random? Nathan Schneider ENLP Lecture 11 21

What can we do with this model? If we know the transition and output probabilities, we can compute the probability of a tagged sentence. suppose we have sentence W = w 1... w n and its tags T = t 1... t n. what is the probability that our probabilistic FSM would generate exactly that sequence of words and tags, if we stepped through at random? This is the joint probability P (W, T ) = n P (t i t i 1 )P (w i t i ) i=1 P (</s> t n ) Nathan Schneider ENLP Lecture 11 22

Example: computing joint prob. P (W, T ) What s the probability of this tagged sentence? This/DET is/vb a/det simple/jj sentence/nn Nathan Schneider ENLP Lecture 11 23

Example: computing joint prob. P (W, T ) What s the probability of this tagged sentence? This/DET is/vb a/det simple/jj sentence/nn First, add begin- and end-of-sentence <s> and </s>. Then: [ n ] P (W, T ) = i=1 P (t i t i 1 )P (w i t i ) P (</s> t n ) = P (DET <s>)p (VB DET)P (DET VB)P (JJ DET)P (NN JJ)P (</s> NN) P (This DET)P (is VB)P (a DET)P (simple JJ)P (sentence NN) Then, plug in the probabilities we estimated from our corpus. Nathan Schneider ENLP Lecture 11 24

But... tagging? Normally, we want to use the model to find the best tag sequence for an untagged sentence. Thus, the name of the model: hidden Markov model Markov: because of Markov independence assumption (each tag/state only depends on fixed number of previous tags/states here, just one). hidden: because at test time we only see the words/emissions; the tags/states are hidden (or latent) variables. FSM view: given a sequence of words, what is the most probable state path that generated them? Nathan Schneider ENLP Lecture 11 25

Hidden Markov Model (HMM) HMM is actually a very general model for sequences. Elements of an HMM: a set of states (here: the tags) an output alphabet (here: words) intitial state (here: beginning of sentence) state transition probabilities (here: P (t i t i 1 )) symbol emission probabilities (here: P (w i t i )) Nathan Schneider ENLP Lecture 11 26

Relationship to previous models N-gram model: a model for sequences that also makes a Markov assumption, but has no hidden variables. Naïve Bayes: a model with hidden variables (the classes) but no sequential dependencies. HMM: a model for sequences with hidden variables. Like many other models with hidden variables, we will use Bayes Rule to help us infer the values of those variables. (In NLP, we usually assume hidden variables are observed during training though there are unsupervised methods that do not.) Nathan Schneider ENLP Lecture 11 27

Side note for those interested: Relationship to other models Naïve Bayes: a generative model (use Bayes Rule, strong independence assumptions) for classification. MaxEnt: a discriminative model (model P (y x) directly, use arbitrary features) for classification. HMM: a generative model for sequences with hidden variables. MEMM: a discriminative model for sequences with hidden variables. Other sequence models can also use more features than HMM: e.g., Conditional Random Field (CRF) or structured perceptron. Nathan Schneider ENLP Lecture 11 28

Formalizing the tagging problem Find the best tag sequence T for an untagged sentence W : arg max T P (T W ) Nathan Schneider ENLP Lecture 11 29

Formalizing the tagging problem Find the best tag sequence T for an untagged sentence W : Bayes rule gives us: arg max T P (T W ) = P (T W ) P (W T ) P (T ) P (W ) We can drop P (W ) if we are only interested in arg max T : arg max T P (T W ) = arg max T P (W T ) P (T ) Nathan Schneider ENLP Lecture 11 30

Decomposing the model Now we need to compute P (W T ) and P (T ) (actually, their product P (W T )P (T ) = P (W, T )). We already defined how! P (T ) is the state transition sequence: P (T ) = i P (t i t i 1 ) P (W T ) are the emission probabilities: P (W T ) = i P (w i t i ) Nathan Schneider ENLP Lecture 11 31

Search for the best tag sequence We have defined a model, but how do we use it? given: word sequence W wanted: best tag sequence T For any specific tag sequence T, it is easy to compute P (W, T ) = P (W T )P (T ). P (W T ) P (T ) = i P (w i t i ) P (t i t i 1 ) So, can t we just enumerate all possible T, compute their probabilites, and choose the best one? Nathan Schneider ENLP Lecture 11 32

Enumeration won t work Suppose we have c possible tags for each of the n words in the sentence. How many possible tag sequences? Nathan Schneider ENLP Lecture 11 33

Enumeration won t work Suppose we have c possible tags for each of the n words in the sentence. How many possible tag sequences? There are c n possible tag sequences: the number grows exponentially in the length n. For all but small n, too many sequences to efficiently enumerate. Nathan Schneider ENLP Lecture 11 34

The Viterbi algorithm We ll use a dynamic programming algorithm to solve the problem. Dynamic programming algorithms order the computation efficiently so partial values can be computed once and reused. The Viterbi algorithm finds the best tag sequence without explicitly enumerating all sequences. Partial results are stored in a chart to avoid recomputing them. Details next time. Nathan Schneider ENLP Lecture 11 35

Viterbi as a decoder The problem of finding the best tag sequence for a sentence is sometimes called decoding. Because, like spelling correction etc., HMM can also be viewed as a noisy channel model. Someone wants to send us a sequence of tags: P (T ) During encoding, noise converts each tag to a word: P (W T ) We try to decode the observed words back to the original tags. In fact, decoding is a general term in NLP for inferring the hidden variables in a test instance (so, finding correct spelling of a misspelled word is also decoding). Nathan Schneider ENLP Lecture 11 36

Computing marginal prob. P (W ) Recall that the HMM can be thought of as a language model over words AND tags. What about estimating probabilities of JUST the words of a sentence? Nathan Schneider ENLP Lecture 11 37

Computing marginal prob. P (W ) Recall that the HMM can be thought of as a language model over words AND tags. What about estimating probabilities of JUST the words of a sentence? P (W ) = T P (W, T ) Nathan Schneider ENLP Lecture 11 38

Computing marginal prob. P (W ) Recall that the HMM can be thought of as a language model over words AND tags. What about estimating probabilities of JUST the words of a sentence? P (W ) = T P (W, T ) Again, cannot enumerate all possible taggings T. Instead, use the forward algorithm (dynamic programming algorithm closely related to Viterbi see textbook if interested). Could be used to measure perplexity of held-out data. Nathan Schneider ENLP Lecture 11 39

Supervised learning The HMM consists of transition probabilities emission probabilities How can these be learned? Nathan Schneider ENLP Lecture 11 40

Supervised learning The HMM consists of transition probabilities emission probabilities How can these be estimated? Treebank! From counts in a treebank such as the Penn Nathan Schneider ENLP Lecture 11 41

Supervised learning The HMM consists of transition probabilities: given tag t, what is the probability that t follows? emission probabilities: given tag t, what is the probability that the word is w? How can these be estimated? Treebank! From counts in a treebank such as the Penn Nathan Schneider ENLP Lecture 11 42

Do transition & emission probs. need smoothing? Nathan Schneider ENLP Lecture 11 43

Do transition & emission probs. need smoothing? Emissions: yes, because if there is any word w in the test data such that P (w i = w t i = t) = 0 for all tags t, the whole joint probability will go to 0. Transitions: not necessarily, but if any transition probabilities are estimated as 0, that tag bigram will never be predicted. What are some transitions that should NEVER occur in a bigram HMM? Nathan Schneider ENLP Lecture 11 44

Do transition & emission probs. need smoothing? Emissions: yes, because if there is any word w in the test data such that P (w i = w t i = t) = 0 for all tags t, the whole joint probability will go to 0. Transitions: not necessarily, but if any transition probabilities are estimated as 0, that tag bigram will never be predicted. What are some transitions that should NEVER occur in a bigram HMM? <s> </s> <s> </s> Nathan Schneider ENLP Lecture 11 45

Unsupervised learning With the number of hidden tags specified but no tagged training data, the learning is unsupervised. The Forward-Backward algorithm, a.k.a. Baum-Welch EM, clusters the data into tags that will give the training data high probability under the HMM. This is used in speech recognition. See the textbook for details if interested. Nathan Schneider ENLP Lecture 11 46

Higher-order HMMs The Markov part means we ignore history of more than a fixed number of words. Equations thus far have been for bigram HMM: i.e., transitions are P (t i t i 1 ). But as with language models, we can increase the N in the N-gram: trigram HMM transition probabilities are P (t i t i 2, t i 1 ), etc. As usual, smoothing the transition distributions becomes more important with higher-order models. Nathan Schneider ENLP Lecture 11 47

Summary Part-of-speech tagging is a sequence labelling task. HMM uses two sources of information to help resolve ambiguity in a word s POS tag: The words itself The tags assigned to surrounding words Can be viewed as a probabilistic FSM. Given a tagged sentence, easy to compute its probability. But finding the best tag sequence will need a clever algorithm. Nathan Schneider ENLP Lecture 11 48