Today. Finish word sense disambigua1on. Midterm Review

Similar documents
CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 9: Hidden Markov Model

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

LECTURER: BURCU CAN Spring

Sequence Labeling: HMMs & Structured Perceptron

Parsing with Context-Free Grammars

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Probabilistic Context-free Grammars

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Structured Output Prediction: Generative Models

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CS460/626 : Natural Language

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Text Mining. March 3, March 3, / 49

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Lecture 13: Structured Prediction

Part-of-Speech Tagging

Parsing with Context-Free Grammars

Processing/Speech, NLP and the Web

Probabilistic Context Free Grammars. Many slides from Michael Collins

Statistical Methods for NLP

10/17/04. Today s Main Points

POS-Tagging. Fabian M. Suchanek

HMM and Part of Speech Tagging. Adam Meyers New York University

Statistical Methods for NLP

Statistical Methods for NLP

NLP Programming Tutorial 11 - The Structured Perceptron

Statistical methods in NLP, lecture 7 Tagging and parsing

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Machine Learning for natural language processing

Hidden Markov Models

Midterm sample questions

Intelligent Systems (AI-2)

Lecture 3: ASR: HMMs, Forward, Viterbi

Multiword Expression Identification with Tree Substitution Grammars

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

A brief introduction to Conditional Random Fields

Sta$s$cal sequence recogni$on

Maschinelle Sprachverarbeitung

Part-of-Speech Tagging

Maschinelle Sprachverarbeitung

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

A gentle introduction to Hidden Markov Models

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Bayesian networks Lecture 18. David Sontag New York University

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Intelligent Systems (AI-2)

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Language Processing with Perl and Prolog

Statistical NLP Spring 2010

Maxent Models and Discriminative Estimation

Dynamic Programming: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models

CSE 490 U Natural Language Processing Spring 2016

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Advanced Natural Language Processing Syntactic Parsing

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Linear Classifiers IV

CS388: Natural Language Processing Lecture 4: Sequence Models I

Natural Language Processing

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Hidden Markov Models (HMMs)

Chapter 14 (Partially) Unsupervised Parsing

Spectral Unsupervised Parsing with Additive Tree Metrics

Sequential Data Modeling - The Structured Perceptron

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

Natural Language Processing. Lecture 13: More on CFG Parsing

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

A* Search. 1 Dijkstra Shortest Path

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Statistical NLP: Hidden Markov Models. Updated 12/15

8: Hidden Markov Models

Context- Free Parsing with CKY. October 16, 2014

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Natural Language Processing

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

DT2118 Speech and Speaker Recognition

Introduction to Artificial Intelligence (AI)

Transcription:

Today Finish word sense disambigua1on Midterm Review The midterm is Thursday during class 1me. H students and 20 regular students are assigned to 517 Hamilton. Make-up for those with same 1me midterm is Thurs 6pm. For those with serious problems, second make-up Mon 6pm, but will be harder. RECOMMENDATION: Thurs 6pm if you can. Review ques1ons and answer on NLP website. On Thursday, I will not be here. I am at a mee1ng in Maryland. Tas will proctor

Naïve Bayes Test On a corpus of examples of uses of the word line, naïve Bayes achieved about 73% correct Good?

Decision Lists: another popular method A case statement.

Learning Decision Lists Restrict the lists to rules that test a single feature (1-decisionlist rules) Evaluate each possible test and rank them based on how well they work. Glue the top-n tests together and call that your decision list.

Yarowsky On a binary (homonymy) dis1nc1on used the following metric to rank the tests P(Sense 1 Feature) P(Sense 2 Feature) This gives about 95% on this test

WSD Evaluations and baselines In vivo versus in vitro evalua1on In vitro evalua1on is most common now Exact match accuracy % of words tagged iden1cally with manual sense tags Usually evaluate using held-out data from same labeled corpus Problems? Why do we do it anyhow? Baselines Most frequent sense The Lesk algorithm

Most Frequent Sense Wordnet senses are ordered in frequency order So most frequent sense in wordnet = take the first sense Sense frequencies come from SemCor

Ceiling Human inter-annotator agreement Compare annota1ons of two humans On same data Given same tagging guidelines Human agreements on all-words corpora with Wordnet style senses 75%-80%

Problems Given these general ML approaches, how many classifiers do I need to perform WSD robustly One for each ambiguous word in the language How do you decide what set of tags/ labels/senses to use for a given word? Depends on the applica1on

WordNet Bass Tagging with this set of senses is an impossibly hard task that s probably overkill for any realis1c applica1on 1. bass - (the lowest part of the musical range) 2. bass, bass part - (the lowest part in polyphonic music) 3. bass, basso - (an adult male singer with the lowest voice) 4. sea bass, bass - (flesh of lean-fleshed saltwater fish of the family Serranidae) 5. freshwater bass, bass - (any of various North American lean-fleshed freshwater fishes especially of the genus Micropterus) 6. bass, bass voice, basso - (the lowest adult male singing voice) 7. bass - (the member with the lowest range of a family of musical instruments) 8. bass -(nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes)

Senseval History ACL-SIGLEX workshop (1997) Yarowsky and Resnik paper SENSEVAL-I (1998) Lexical Sample for English, French, and Italian SENSEVAL-II (Toulouse, 2001) Lexical Sample and All Words Organiza1on: Kilkgarriff (Brighton) SENSEVAL-III (2004) SENSEVAL-IV -> SEMEVAL (2007) SEMEVAL (2010) SEMEVAL 2017: hpp://alt.qcri.org/semeval2017/index.php?id=tasks SLIDE ADAPTED FROM CHRIS MANNING

WSD Performance Varies widely depending on how difficult the disambigua1on task is Accuracies of over 90% are commonly reported on some of the classic, oren fairly easy, WSD tasks (pike, star, interest) Senseval brought careful evalua1on of difficult WSD (many senses, different POS) Senseval 1: more fine grained senses, wider range of types: Overall: about 75% accuracy Nouns: about 80% accuracy Verbs: about 70% accuracy

Summary Lexical Seman1cs Homonymy, Polysemy, Synonymy Thema1c roles Computa1onal resource for lexical seman1cs WordNet Task Word sense disambigua1on Arer midterm: seman1c parsing, distribu1onal seman1cs, neural nets

Requested topics POS tagging HMM Early parsing algorithm

POS tagging

POS tagging as a sequence classikication task We are given a sentence (an observa1on or sequence of observa1ons ) Secretariat is expected to race tomorrow 10/16/17 What is the best sequence of tags which corresponds to this sequence of observa1ons? Probabilis1c view: Consider all possible sequences of tags Choose the tag sequence which is most probable given the observa1on sequence of n words w1 wn. 16

Getting to HMM Out of all sequences of n tags t 1 t n want the single tag sequence such that P(t 1 t n w 1 w n ) is highest. 10/16/17 Hat ^ means our es1mate of the best one Argmax x f(x) means the x such that f(x) is maximized 17

Using Bayes Rule 10/16/17 18

x y 10/16/17 19

Likelihood and prior n

Two kinds of probabilities (1) Tag transi1on probabili1es p(t i t i-1 ) Determiners likely to precede adjs and nouns That/DT flight/nn The/DT yellow/jj hat/nn So we expect P(NN DT) and P(JJ DT) to be high But P(DT JJ) to be low Compute P(NN DT) by coun1ng in a labeled corpus:

Two kinds of probabilities (2) Word likelihood probabili1es p(w i t i ) VBZ (3sg Pres verb) likely to be is Compute P(is VBZ) by coun1ng in a labeled corpus: 10/16/17 22

An Example: the verb race Secretariat/NNP is/vbz expected/vbn to/to race/vb tomorrow/nr People/NNS con1nue/vb to/to inquire/vb the/dt reason/nn for/in the/dt race/nn for/in outer/jj space/nn How do we pick the right tag? 10/16/17 23

Disambiguating race 10/16/17 24

Disambiguating race 10/16/17 25

Disambiguating race 10/16/17 26

Disambiguating race 10/16/17 27

P(NN TO) =.00047 P(VB TO) =.83 P(race NN) =.00057 P(race VB) =.00012 P(NR VB) =.0027 P(NR NN) =.0012 P(VB TO)P(NR VB)P(race VB) =.00000027 P(NN TO)P(NR NN)P(race NN)=.00000000032 So we (correctly) choose the verb reading, 10/16/17 28

Problem Observa1on likelihood: Promise Count (#promise, VB)/#all verbs I promise to back the bill (N). I promise to back the bill (V)

HMMS

Hidden Markov Models We don t observe POS tags We infer them from the words we see 10/16/17 Observed events Hidden events 31

Hidden Markov Model For Markov chains, the output symbols are the same as the states. See hot weather: we re in state hot But in part-of-speech tagging (and other things) The output symbols are words The hidden states are part-of-speech tags So we need an extension! A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states. This means we don t know which state we are in. 10/16/17 32

Hidden Markov Models States Q = q 1, q 2 q N; Observa1ons O= o 1, o 2 o N; Each observa1on is a symbol from a vocabulary V = {v 1,v 2, v V } Transi1on probabili1es Transition probability matrix A = {a ij } a ij = P(q t = j q t 1 = i) 1 i, j N Observa1on likelihoods Output probability matrix B={b i (k)} b i (k) = P(X t = o k q t = i) Special ini1al probability vector π π i = P(q 1 = i) 1 i N

Hidden Markov Models Some constraints N j=1 a ij =1; 1 i N π i = P(q 1 = i) 1 i N 10/16/17 M b i (k) =1 π j =1 k=1 N j=1 34

Assumptions Markov assump=on: Output-independence assump=on P(q i q 1...q i 1 ) = P(q i q i 1 ) 10/16/17 P(o t O 1 t 1,q 1 t ) = P(o t q t ) 35

Three fundamental Problems for HMMs Likelihood: Given an HMM λ = (A,B) and an observa1on sequence O, determine the likelihood P(O, λ). Decoding: Given an observa1on sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q. Learning: Given an observa1on sequence O and the set of states in the HMM, learn the HMM parameters A and B. What kind of data would we need to learn the HMM parameters? 10/16/17 36

Decoding The best hidden sequence Weather sequence in the ice cream task POS sequence given an input sentence We could use argmax over the probability of each possible hidden state sequence Why not? Viterbi algorithm Dynamic programming algorithm Uses a dynamic programming trellis Each trellis cell represents, v t (j), represents the probability that the HMM is in state j arer seeing the first t observa1ons and passing through the most likely state sequence 10/16/17 37

Viterbi intuition: we are looking for the best path S 1 S 2 S 3 S 4 S 5 RB 10/16/17 NN VBN VBD TO JJ VB DT NNP VB NN promised to back the bill 38 Slide from Dekang Lin

Intuition The value in each cell is computed by taking the MAX over all paths that lead to this cell. 10/16/17 An extension of a path from state i at 1me t-1 is computed by mul1plying: 39

The Viterbi Algorithm 10/16/17 40

The A matrix for the POS HMM 10/16/17 What is P(VB TO)? What is P(NN TO)? Why does this make sense? What is P(TO VB)? What is P(TO NN)? Why does this make sense? 41

The B matrix for the POS HMM 10/16/17 Look at P(want VB) and P(want NN). Give an explanation for the difference in the probabilities. 42

Problem I want to race (possible states: PPS VB TO NN)

Viterbi example 10/16/17 t=1 45

X Viterbi example 10/16/17 t=1 46

X Viterbi example J=NN 10/16/17 I=S t=1 47

The A matrix for the POS HMM 10/16/17 48

Viterbi example J=NN X.041X 10/16/17 I=S t=1 49

The B matrix for the POS HMM 10/16/17 Look at P(want VB) and P(want NN). Give an explanation for the difference in the probabilities. 50

Viterbi 0 example J=NN X.041X 0 10/16/17 I=S t=1 51

Viterbi 0 example J=NN 0 X.041X 0 0 10/16/17.025 I=S t=1 52

Viterbi 0 example J=NN 0 Show the 4 formulas you would use to compute the value at this node and the max. 0 10/16/17.025 I=S t=1 53

Early Algorithm

Earley Algorithm March through chart ler-to-right. At each step, apply 1 of 3 operators Predictor Create new states represen1ng top-down expecta1ons Scanner Match word predic1ons (rule with word arer dot) to words Completer When a state is complete, see what rules were looking for that completed cons1tuent 55

Predictor Given a state With a non-terminal to right of dot (not a part-of-speech category) Create a new state for each expansion of the non-terminal Place these new states into same chart entry as generated state, beginning and ending where genera1ng state ends. So predictor looking at S ->. VP [0,0] results in VP ->. Verb [0,0] VP ->. Verb NP [0,0] 56

Scanner Given a state With a non-terminal to right of dot that is a part-of-speech category If the next word in the input matches this POS Create a new state with dot moved over the non-terminal So scanner looking at VP ->. Verb NP [0,0] If the next word, book, can be a verb, add new state: VP -> Verb. NP [0,1] Add this state to chart entry following current one Note: Earley algorithm uses top-down input to disambiguate POS! Only POS predicted by some state can get added to chart! 57

Completer Applied to a state when its dot has reached right end of role. Parser has discovered a category over some span of input. Find and advance all previous states that were looking for this category copy state, move dot, insert in current chart entry Given: NP -> Det Nominal. [1,3] VP -> Verb. NP [0,1] Add VP -> Verb NP. [0,3] 58

How do we know we are done? Find an S state in the final column that spans from 0 to n+1 and is complete. If that s the case you re done. S > α [0,n+1] 59

Earley More specifically 1. Predict all the states you can upfront 2. Read a word 1. Extend states based on matches 2. Add new predic1ons 3. Go to 2 3. Look at N+1 to see if you have a winner 60

Example Book that flight We should find an S from 0 to 3 that is a completed state 61

CFG for Fragment of English S à NP VP S à Aux NP VP VP à V PP -> Prep NP NP à Det Nom N à old dog footsteps young flight NP àpropn V à dog include prefer book Nom -> Adj Nom Nom à N Aux à does Prep àfrom to on of Nom à N Nom PropN à Bush McCain Obama Nom à Nom PP VP à V NP Det à that this a the Adj -> old green red

S à NP VP, S -> VP S à Aux NP VP VP à V PP -> Prep NP NP à Det Nom N à old dog footsteps young flight NP àpropn, NP -> Pro Nom à N V à dog include prefer book Aux à does Prep àfrom to on of Nom à N Nom PropN à Bush McCain Obama Nom à Nom PP VP à V NP, VP -> V NP PP, VP -> V PP, VP -> VP PP Det à that this a the Adj -> old green red

S à NP VP, S -> VP S à Aux NP VP VP à V PP -> Prep NP NP à Det Nom N à old dog footsteps young flight NP àpropn, NP -> Pro Nom à N V à dog include prefer book Aux à does Prep àfrom to on of Nom à N Nom PropN à Bush McCain Obama Nom à Nom PP VP à V NP, VP -> V NP PP, VP -> V PP, VP -> VP PP Det à that this a the Adj -> old green red

Example 65

Example Completer 66

Example Completer 67

Example 68

Details What kind of algorithms did we just describe Not parsers recognizers The presence of an S state with the right apributes in the right place indicates a successful recogni1on. But no parse tree no parser That s how we solve (not) an exponen1al problem in polynomial 1me 69

Converting Earley from Recognizer to Parser With the addi1on of a few pointers we have a parser Augment the Completer to point to where we came from. 70

Augmenting the chart with structural information S8 S9 S10 S11 S12 S13 S8 S9 S8