Lecture 9: Hidden Markov Model

Similar documents
CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 13: Structured Prediction

Lecture 11: Viterbi and Forward Algorithms

Lecture 12: EM Algorithm

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Today. Finish word sense disambigua1on. Midterm Review

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Lecture 3: ASR: HMMs, Forward, Viterbi

Statistical methods in NLP, lecture 7 Tagging and parsing

10/17/04. Today s Main Points

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Sequence Labeling: HMMs & Structured Perceptron

Dynamic Programming: Hidden Markov Models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Midterm sample questions

Statistical Methods for NLP

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Machine Learning for natural language processing

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part-of-Speech Tagging

HMM and Part of Speech Tagging. Adam Meyers New York University

LECTURER: BURCU CAN Spring

Text Mining. March 3, March 3, / 49

Statistical Methods for NLP

CS838-1 Advanced NLP: Hidden Markov Models

Structured Output Prediction: Generative Models

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Lecture 7: Sequence Labeling

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Part-of-Speech Tagging + Neural Networks CS 287

Hidden Markov Models

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Hidden Markov Models

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov Models (HMMs)

Hidden Markov Models in Language Processing

Part-of-Speech Tagging

Language Processing with Perl and Prolog

Statistical Methods for NLP

Probabilistic Context-free Grammars

TnT Part of Speech Tagger

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Hidden Markov Models

CS460/626 : Natural Language

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

POS-Tagging. Fabian M. Suchanek

Lecture 6: Part-of-speech tagging

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Spectral Unsupervised Parsing with Additive Tree Metrics

Maschinelle Sprachverarbeitung

AN INTRODUCTION TO TOPIC MODELS

Maschinelle Sprachverarbeitung

COMP90051 Statistical Machine Learning

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

A brief introduction to Conditional Random Fields

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Processing/Speech, NLP and the Web

8: Hidden Markov Models

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Extracting Information from Text

Degree in Mathematics

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Sequences and Information

The Noisy Channel Model and Markov Models

Fun with weighted FSTs

CS460/626 : Natural Language

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

Statistical NLP: Hidden Markov Models. Updated 12/15

Probabilistic Context Free Grammars. Many slides from Michael Collins

Lecture 5 Neural models for NLP

Collapsed Variational Bayesian Inference for Hidden Markov Models

Advanced Natural Language Processing Syntactic Parsing

Web Search and Text Mining. Lecture 23: Hidden Markov Models

Probabilistic Graphical Models

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Applied Natural Language Processing

NLP Programming Tutorial 11 - The Structured Perceptron

CS388: Natural Language Processing Lecture 4: Sequence Models I

Sequential Data Modeling - The Structured Perceptron

DT2118 Speech and Speaker Recognition

Latent Variable Models in NLP

1. Markov models. 1.1 Markov-chain

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Chapter 14 (Partially) Unsupervised Parsing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Natural Language Processing

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Multiword Expression Identification with Tree Substitution Grammars

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Transcription:

Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501 Natural Language Processing 1

This lecture v Hidden Markov Model v Different views of HMM v HMM in supervised learning setting CS6501 Natural Language Processing 2

Recap: Parts of Speech v Traditional parts of speech v ~ 8 of them CS6501 Natural Language Processing 3

Recap: Tagset v Penn TreeBank tagset, 45 tags: v PRP$, WRB, WP$, VBG v Penn POS annotations: The/DT grand/jj jury/nn commmented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. v Universal Tag set, 12 tags v NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM, CONJ, PRT,., X CS6501 Natural Language Processing 4

Recap: POS Tagging v.s. Word clustering v Words often have more than one POS: back v The back door = JJ v On my back = NN v Win the voters back = RB v Promised to back the bill = VB v Syntax v.s. Semantics (details later) These examples from Dekang Lin CS6501 Natural Language Processing 5

Recap: POS tag sequences v Some tag sequences more likely occur than others v POS Ngram view https://books.google.com/ngrams/graph?co ntent=_adj_+_noun_%2c_adv_+_nou N_%2C+_ADV_+_VERB_ Existing methods often model POS tagging as a sequence tagging problem CS6501 Natural Language Processing 6

Evaluation v How many words in the unseen test data can be tagged correctly? v Usually evaluated on Penn Treebank v State of the art ~97% v Trivial baseline (most likely tag) ~94% v Human performance ~97% CS6501 Natural Language Processing 7

Building a POS tagger v Supervised learning v Assume linguistics have annotated several examples Tag set: DT, JJ, NN, VBD POS Tagger The/DT grand/jj jury/nn commented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. CS6501 Natural Language Processing 8

POS induction v Unsupervised learning v Assume we only have an unannotated corpus Tag set: DT, JJ, NN, VBD POS Tagger The grand jury commented on a number of other topics. CS6501 Natural Language Processing 9

TODAY: Hidden Markov Model v We focus on supervised learning setting v What is the most likely sequence of tags for the given sequence of words w v We will talk about other ML models for this type of prediction tasks later. CS6501 Natural Language Processing 10

Let s try Don t worry! There is no problem with your eyes or computer. a/dt d6g/nn 0s/VBZ chas05g/vbg a/dt cat/nn./. a/dt f6x/nn 0s/VBZ 9u5505g/VBG./. a/dt b6y/nn 0s/VBZ s05g05g/vbg./. a/dt ha77y/jj b09d/nn What is the POS tag sequence of the following sentence? a ha77y cat was s05g05g. CS6501 Natural Language Processing 11

Let s try v a/dt d6g/nn 0s/VBZ chas05g/vbg a/dt cat/nn./. a/dt dog/nn is/vbz chasing/vbg a/dt cat/nn./. v a/dt f6x/nn 0s/VBZ 9u5505g/VBG./. a/dt fox/nn is/vbz running/vbg./. v a/dt b6y/nn 0s/VBZ s05g05g/vbg./. a/dt boy/nn is/vbz singing/vbg./. v a/dt ha77y/jj b09d/nn a/dt happy/jj bird/nn v a ha77y cat was s05g05g. a happy cat was singing. CS6501 Natural Language Processing 12

How you predict the tags? v Two types of information are useful v Relations between words and tags v Relations between tags and tags v DT NN, DT JJ NN CS6501 Natural Language Processing 13

Statistical POS tagging v What is the most likely sequence of tags for the given sequence of words w P( DT JJ NN a smart dog) = P(DD JJ NN a smart dog) / P (a smart dog) P(DD JJ NN a smart dog) = P(DD JJ NN) P(a smart dog DD JJ NN ) CS6501 Natural Language Processing 14

Transition Probability v Joint probability P(t, w) = P t P(w t) v P t = P t +, t,, t. = P t + P t, t + P t 1 t,, t + P t. t + t.2+ P t + P t, t + P t 1 t, P(t. t.2+ ) = Π. 78+ P t 7 t 72+ Markov assumption v Bigram model over POS tags! (similarly, we can define a n-gram model over POS tags, usually we called high-order HMM) CS6501 Natural Language Processing 15

Emission Probability v Joint probability P(t, w) = P t P(w t) v Assume words only depend on their POS-tag v P w t P w + t + P w, t, P(w. t. ) = Π. 78+ P w 7 t 7 Independent assumption i.e., P(a smart dog DD JJ NN ) = P(a DD) P(smart JJ ) P( dog NN ) CS6501 Natural Language Processing 16

Put them together v Joint probability P(t, w) = P t P(w t) v P t, w = P t + P t, t + P t 1 t, P t. t.2+ P w + t + P w, t, P(w. t. ) = Π. 78+ P w 7 t 7 P t 7 t 72+ e.g., P(a smart dog, DD JJ NN ) = P(a DD) P(smart JJ ) P( dog NN ) P(DD start) P(JJ DD) P(NN JJ ) CS6501 Natural Language Processing 17

Put them together v Two independent assumptions v Approximate P(t) by a bi(or N)-gram model v Assume each word depends only on its POStag initial probability p(t + ) CS6501 Natural Language Processing 18

HMMs as probabilistic FSA Julia Hockenmaier: Intro to NLP CS6501 Natural Language Processing 19

Table representation Let λ = {A, B, π} represents all parameters CS6501 Natural Language Processing 20

Hidden Markov Models (formal) v States T = t 1, t 2 t N; v Observations W= w 1, w 2 w N; v Each observation is a symbol from a vocabulary V = {v 1,v 2, v V } v Transition probabilities v Transition probability matrix A = {a ij } a 7V = P t 7 = j t 72+ = i 1 i, j N v Observation likelihoods v Output probability matrix B={b i (k)} b 7 (k) = P w 7 = v _ t 7 = i v Special initial probability vector π π 7 = P t + = i 1 i N CS6501 Natural Language Processing 21

How to build a second-order HMM? v Second-order HMM v Trigram model over POS tags vp t = Π. 78+ P t 7 t 72+, t 72, vp w, t = Π. 78+ P t 7 t 72+, t 72, P(w 7 t 7 ) CS6501 Natural Language Processing 22

Probabilistic FSA for second-order HMM Julia Hockenmaier: Intro to NLP CS6501 Natural Language Processing 23

Prediction in generative model v Inference: What is the most likely sequence of tags for the given sequence of words w initial probability p(t + ) v What are the latent states that most likely generate the sequence of word w CS6501 Natural Language Processing 24

Example: The Verb race v Secretariat/NNP is/vbz expected/vbn to/to race/vb tomorrow/nr v People/NNScontinue/VB to/to inquire/vb the/dt reason/nn for/in the/dt race/nn for/in outer/jj space/nn v How do we pick the right tag? CS6501 Natural Language Processing 25

Disambiguating race CS6501 Natural Language Processing 26

Disambiguating race v P(NN TO) =.00047 v P(VB TO) =.83 v P(race NN) =.00057 v P(race VB) =.00012 v P(NR VB) =.0027 v P(NR NN) =.0012 v P(VB TO)P(NR VB)P(race VB) =.00000027 v P(NN TO)P(NR NN)P(race NN)=.00000000032 v So we (correctly) choose the verb reading, CS6501 Natural Language Processing 27

Jason and his Ice Creams v You are a climatologist in the year 2799 v Studying global warming v You can t find any records of the weather in Baltimore, MA for summer of 2007 v But you find Jason Eisner s diary v Which lists how many ice-creams Jason ate every date that summer v Our job: figure out how hot it was http://videolectures.net/hltss2010_eisner_plm/ http://www.cs.jhu.edu/~jason/papers/eisner.hmm.xls CS6501 Natural Language Processing 28

(C)old day v.s. (H)ot day #cones " p( C) p( H) p( START) p(1 ) 0.7 0.1 p(2 ) 0.2 0.2 p(3 ) 0.1 0.7 (C ) 0.8 0.1 0.5 (H ) 0.1 0.8 0.5 ) 0.1 0.1 0 3.5 3 Weather States that Best Explain Ice Cream Consumption Ice Creams p(h) 2.5 2 1.5 1 0.5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Diary Day CS6501 Natural Language Processing 29

Three basic problems for HMMs v Likelihood of the input: v Compute P(w λ) for the input w and HMM λ v Decoding (tagging) the input: v Find the best tag sequence argmax d P(t w, λ) v Estimation (learning): v Find the best model parameters How likely the sentence I love cat occurs v Case 1: supervised tags are annotated POS tags of I love cat occurs How to learn the model? v Case 2: unsupervised -- only unannotated text CS6501 Natural Language Processing 30

Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 Natural Language Processing 31

Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 Natural Language Processing 32

Learning from Labeled Data v Let play a game! v We count how often we see t 72+ t 7 and w e t 7 then normalize. CS6501 Natural Language Processing 33

Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters We need dynamic programming v Case 1: supervised tags are annotated for vmaximum the other likelihood problems estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501 Natural Language Processing 34