ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Similar documents
ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Graphical models for part of speech tagging

Statistical methods in NLP, lecture 7 Tagging and parsing

Statistical Methods for NLP

Lecture 13: Structured Prediction

10/17/04. Today s Main Points

Sequence Labeling: HMMs & Structured Perceptron

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Log-Linear Models, MEMMs, and CRFs

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

The Noisy Channel Model and Markov Models

Lecture 3: ASR: HMMs, Forward, Viterbi

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 12: Algorithms for HMMs

Machine Learning for natural language processing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

TnT Part of Speech Tagger

Lecture 7: Sequence Labeling

Sequential Data Modeling - The Structured Perceptron

Lecture 12: Algorithms for HMMs

CRF Word Alignment & Noisy Channel Translation

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Part-of-Speech Tagging

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Processing/Speech, NLP and the Web

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Fun with weighted FSTs

HMM and Part of Speech Tagging. Adam Meyers New York University

Lecture 9: Hidden Markov Model

Dynamic Programming: Hidden Markov Models

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Text Mining. March 3, March 3, / 49

8: Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

8: Hidden Markov Models

Language Processing with Perl and Prolog

1. Markov models. 1.1 Markov-chain

Natural Language Processing

Probabilistic Context-free Grammars

CS838-1 Advanced NLP: Hidden Markov Models

NLP Programming Tutorial 11 - The Structured Perceptron

Machine Learning for natural language processing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

CSE 490 U Natural Language Processing Spring 2016

Conditional Random Fields

IBM Model 1 for Machine Translation

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS460/626 : Natural Language

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Machine Learning for natural language processing

Machine Learning for natural language processing

Hidden Markov Models and Gaussian Mixture Models

Intelligent Systems (AI-2)

Midterm sample questions

Part-of-Speech Tagging

Hidden Markov Models

Word Alignment III: Fertility Models & CRFs. February 3, 2015

A gentle introduction to Hidden Markov Models

Intelligent Systems (AI-2)

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Information Extraction from Text

Hidden Markov Models

Lecture 12: EM Algorithm

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Hidden Markov Models in Language Processing

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Maschinelle Sprachverarbeitung

COMP90051 Statistical Machine Learning

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

CS388: Natural Language Processing Lecture 4: Sequence Models I

Maschinelle Sprachverarbeitung

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Statistical Methods for NLP

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Chapter 3: Basics of Language Modelling

N-gram Language Modeling

Undirected Graphical Models

LECTURER: BURCU CAN Spring

A brief introduction to Conditional Random Fields

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Maxent Models and Discriminative Estimation

lecture 6: modeling sequences (final part)

What s so Hard about Natural Language Understanding?

CSC321 Lecture 7 Neural language models

CSE 447/547 Natural Language Processing Winter 2018

STA 4273H: Statistical Machine Learning

Statistical Methods for NLP

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Variational Decoding for Statistical Machine Translation

Transcription:

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk

The POS Tagging Problem 2 England NNP s POS fencers NNS won VBD gold NN on IN day NN 4 CD in IN Delhi NNP with IN a DT medal JJ -winning JJ performance NN.. This DT is VBZ Dr. NNP Black NNP s POS second JJ gold NN of IN the DT Games NNP.. Problem is difficult because of ambiguity

Part-of-Speech (POS) Tagging 3 Task: given a set of POS tags and a sentence, assign a POS tag to each word What knowledge is required and where does it come from? tag dictionary plus contextual statistical models dictionary and probabilities are obtained from labelled data What s the algorithm for assigning the tags? the Viterbi algorithm for labelled sequences

Probabilistic Formulation 4 y = arg max y Y P(y x) where x = (x 1,...,x n ) is a sentence and y = (y 1,...,y n ) Y is a possible tag sequence for x Two problems: where do the probabilities come from? (age-old question in statistical approaches to AI) how do we find the arg max? Problem 1 is the problem of model estimation Problem 2 is the search problem

Statistical Methods in NLP (and AI) 5 In 1990 less than 5% of papers at an ACL conference used statistical methods Now it s more like 95% How did this paradigm change come about?

Our Statistical NLP Hero 6 Fred Jelinek (1932-2010)

An Historical Aside 7 Speech recognition originally used a rule-based approach based on linguistic expertise work in the 70s at IBM showed that a data-driven approach worked much better Statistical MT IBM applied similar statistical models to translation in the early 90s initially a lot of scepticism and resistance, but now the dominant approach (and used by Google)

Noisy Channel Model 8 SOURCE words noisy words DECODER guess at original words

Speech Recognition as a Noisy Channel 9 SOURCE Language Model P(W) NOISY CHANNEL DECODER W A W Acoustic Model W = arg max P(W A) W P(A W) Speaker has word sequence W W is articulated as acoustic sequence A This process introduces noise: variation in pronunciation acoustic variation due to microphone etc. Bayes theorem gives us: W = arg max W = arg max W P(W A) P(A W) } {{ } likelihood P(W) }{{} prior

Machine Translation as a Noisy Channel 10 SOURCE Language Model P(e) NOISY CHANNEL DECODER e f e Translation Model e = arg max e P(e f) P(f e) Translating French sentence (f) to English sentence (e) French speaker has English sentence in mind (P(e)) English sentence comes out as French via the noisy channel (P(f e))

POS Tagging 11 Can use the same mathematics of the noisy channel to model the POS tagging problem Breaking the problem into two parts makes the modelling easier can focus on tag transition and word probabilities separately allows convenient independence assumptions to be made T = arg max P(T W) T = arg max P(W T)P(T) T

Hidden Markov Models (HMMs) 12 P(T W) = P(W T)P(T) P(W) arg max T P(T W) = arg max T P(W T)P(T) Using Chain Rule and (Markov) independence assumptions: (Bayes theorem) (W is constant) P(W T) = P(w 1,...,w n t 1,...,t n ) = P(w 1 t 1,...,t n )P(w 2 w 1,t 1,...,t n )P(w 3 w 2,w 1,t 1,...,t n ) = P(w n w n 1,...,w 1,t 1,...,t n ) n i=1 P(w i t i ) P(T) = P(t 1,...,t n ) = P(t 1 )P(t 2 t 1 )P(t 3 t 2, t 1 )...P(t n t n 1,...,t 1 ) n i=1 P(t i t i 1 )

N-gram Generative Taggers 13 A tagger which conditions on the previous tag is called abigram tagger Trigram taggers are typically used (condition on previous 2 tags) HMM taggers use a generative model, so-called because the tags and words can be thought of as being generated according to some stochastic process More sophisticated discriminative models (e.g. CRFs) can condition on more aspects of the context, e.g. suffix information

Parameter Estimation 14 Two sets of parameters: P(t i t i 1 ) tag transition probabilities P(w i t i ) word emission probabilities Note not P(t i w i ) (reversed because of use of Bayes theorem) one of the original papers on stochastic POS tagging reportedly got this wrong Estimation based on counting from manually labelled corpora so we have a supervised machine learning approach For this problem, simple counting (relative frequency) method gives maximum likelihood estimates

Relative Frequency Estimation 15 ˆP(t i t i 1 ) = f(t i 1,t i ) f(t i 1 ) where f(t i 1, t i ) is the number of times t i follows t i 1 in the training data; and f(t i 1 ) is the number of times t i 1 appears in the data ˆP(w i t i ) = f(w i,t i ) f(t i ) where f(w i, t i ) is the number of times w i has tag t i in the training data It turns out that for an HMM the intuitive relative frequency estimates are the estimates which maximise the probability of the training data What if the numerator (or denominator) is zero?

Search 16 Why is there a search problem? there are an exponential number of tag sequences for a sentence (exponential in the length) finding the highest scoring sequence of tags is complicated by the n-th order Markov assumption (n>0) More on this next time

Other Models for POS Tagging 17 Generative models suffer from the need for restrictive independence assumptions how would you modify the generative process to account for the fact that a word ending in ing is likely to be VBG? Discriminative models, e.g. Conditional Random Fields, are similar to HMMs but model the conditional probability P(T W) directly, rather than via Bayes and a generative story

Reading for Today s Lecture 18 Jurafsky and Martin, Speech and Language Processing, Chapter on Word Classes and Part of Speech Tagging Manning and Schutze, Foundations of Statistical Natural Language Processing, Chapter on Part of Speech Tagging and also Mathematical Foundations Historical: A statistical approach to machine translation, Peter Brown et al., 1990