Part-of-Speech Tagging (Fall 2007): Lecture 7 Tagging. Named Entity Recognition. Overview

Size: px

Start display at page:

Download "Part-of-Speech Tagging (Fall 2007): Lecture 7 Tagging. Named Entity Recognition. Overview"

Molly Hamilton
5 years ago
Views:

1 Part-of-Speech Tagging IUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as ir CEO Alan Mulally announced first quarter results (Fall 2007): Lecture 7 Tagging OUTPUT: Profits/N soared/v at/p Boeing/N Co./N,/, easily/adv topping/v forecasts/n on/p Wall/N Street/N,/, as/p ir/poss CEO/N Alan/N Mulally/N announced/v first/adj quarter/n results/n./. N V P Adv Adj... = Noun = Verb = Preposition = Adverb = Adjective 1 3 Overview Named Entity Recognition The Tagging Problem Hidden Markov Model (HMM) taggers Log-linear taggers IUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as ir CEO Alan Mulally announced first quarter results. OUTPUT: Profits soared at [Company Boeing Co.], easily topping forecasts on [Location Wall Street], as ir CEO [Person Alan Mulally] announced first quarter results. Log-linear models for parsing and or problems 2 4

2 Named Entity Extraction as Tagging IUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as ir CEO Alan Mulally announced first quarter results. OUTPUT: Profits/NA soared/na at/na Boeing/SC Co./CC,/NA easily/na topping/na forecasts/na on/na Wall/SL Street/CL,/NA as/na ir/na CEO/NA Alan/SP Mulally/CP announced/na first/na quarter/na results/na./na Two Types of Constraints Influential/JJ members/s of/ / House/P Ways/P and/cc Means/P Committee/P introduced/vbd legislation/ that/w would/md restrict/vb how/wrb / new/jj savings-and-loan/ bailout/ agency/ can/md raise/vb capital/./. Local : e.g., can is more likely to be a modal verb MD rar than a noun Contextual : e.g., a noun is much more likely than a verb to follow a determiner NA SC CC SL CL... = No entity = Start Company = Continue Company = Start Location = Continue Location Sometimes se preferences are in conflict: The trash can is in garage 5 7 Our Goal Training set: 1 Pierre/P Vinken/P,/, 61/CD years/s old/jj,/, will/md join/vb / board/ as/ a/ nonexecutive/jj director/ Nov./P 29/CD./. 2 Mr./P Vinken/P is/vbz chairman/ of/ Elsevier/P N.V./P,/, / Dutch/P publishing/vbg group/./. 3 Rudolph/P Agnew/P,/, 55/CD years/s old/jj and/cc chairman/ of/ Consolidated/P Gold/P Fields/P PLC/P,/, was/vbd named/vbn a/ nonexecutive/jj director/ of/ this/ British/JJ industrial/jj conglomerate/./ ,219 It/PRP is/vbz also/rb pulling/vbg 20/CD people/s out/ of/ Puerto/P Rico/P,/, who/wp were/vbd helping/vbg Huricane/P Hugo/P victims/s,/, and/cc sending/vbg m/prp to/to San/P Francisco/P instead/rb./. A Naive Approach Use a machine learning method to build a classifier that maps each word individually to its tag A problem: does not take contextual constraints into account From training set, induce a function/algorithm that maps new sentences to ir tag sequences. 6 8

3 Overview The Tagging Problem Hidden Markov Model (HMM) taggers Basic definitions Parameter estimation The Viterbi Algorithm Log-linear taggers A Trigram HMM Tagger: How to model P (T, S)? P (T, S) = P (END w 1... w n, t 1... t n ) nj=1 [ P (t j w 1... w j 1, t 1... t j 1 ) P (w j w 1... w j 1, t 1... t j )] = P (END t n 1, t n ) nj=1 [P (t j t j 2, t j 1 ) P (w j t j )] Chain rule Independence assumptions END is a special tag that terminates sequence We take t 0 = t 1 = *, where * is a special padding symbol Log-linear models for parsing and or problems 9 11 Hidden Markov Models We have an input sentence S = w 1, w 2,..., w n (w i is i th word in sentence) We have a tag sequence T = t 1, t 2,..., t n (t i is i th tag in sentence) We ll use an HMM to define P (t 1, t 2,..., t n, w 1, w 2,..., w n ) for any sentence S and tag sequence T of same length. Independence Assumptions in Trigram HMM Tagger 1st independence assumption: previous two tags each tag only depends on P (t j w 1... w j 1, t 1... t j 1 ) = P (t j t j 2, t j 1 ) 2nd independence assumption: each word only depends on underlying tag P (w j w 1... w j 1, t 1... t j ) = P (w j t j ) Then most likely tag sequence for S is T = argmax T P (T, S) 10 12

4 S = boy laughed T = VBD An Example P (T, S) = P ( START, START) P ( START, ) P (VBD, ) P (END, VBD) P ( ) P (boy ) P (laughed VBD) How to model P (T, S)? Hispaniola/P quickly/rb became/vb an/ important/jj base/ Probability of generating base/: P (, JJ) P (base ) Why Name? n n P (T, S) = P (END t n 1, t n ) P (t j t j 2, t j 1 ) P (w j t j ) j=1 j=1 }{{}}{{} Hidden Markov Chain w j s are observed The Tagging Problem Overview Hidden Markov Model (HMM) taggers Basic definitions Parameter estimation The Viterbi Algorithm Log-linear taggers Log-linear models for parsing and or problems 14 16

5 Smood Estimation Count(Dt, JJ, ) P (, JJ) = λ 1 Count(Dt, JJ) Count(JJ, ) +λ 2 Count(JJ) +λ 3 Count() Count() λ 1 + λ 2 + λ 3 = 1, and for all i, λ i 0 P (base ) = 17 Count(, base) Count() Dealing with Low-Frequency Words: An Example [Bikel et. al 1999] (named-entity recognition) Word class Example Intuition twodigitnum 90 Two digit year fourdigitnum 1990 Four digit year containsdigitandalpha A Product code containsdigitanddash Date containsdigitandslash 11/9/89 Date containsdigitandcomma 23, Monetary amount containsdigitandperiod 1.00 Monetary amount,percentage ornum Or number allcaps BBN Organization capperiod M. Person name initial firstword first word of sentence no useful capitalization information initcap Sally Capitalized word lowercase can Uncapitalized word or, Punctuation marks, all or words 19 Dealing with Low-Frequency Words A common method is as follows: Step 1: Split vocabulary into two sets Frequent words = words occurring 5 times in training Low frequency words = all or words Dealing with Low-Frequency Words: An Example Profits/NA soared/na at/na Boeing/SC Co./CC,/NA easily/na topping/na forecasts/na on/na Wall/SL Street/CL,/NA as/na ir/na CEO/NA Alan/SP Mulally/CP announced/na first/na quarter/na results/na./na firstword/na soared/na at/na initcap/sc Co./CC,/NA easily/na lowercase/na forecasts/na on/na initcap/sl Street/CL,/NA as/na ir/na CEO/NA Alan/SP initcap/cp announced/na first/na quarter/na results/na./na Step 2: Map low frequency words into a small, finite set, depending on prefixes, suffixes etc. NA SC CC SL CL... = No entity = Start Company = Continue Company = Start Location = Continue Location 18 20

6 Overview The Tagging Problem Hidden Markov Model (HMM) taggers Basic definitions Parameter estimation The Viterbi Algorithm Log-linear taggers Log-linear models for parsing and or problems The Viterbi Algorithm: Recursive Definitions Base case: π[0,, ] = log 1 = 0 π[0, u, v] = log 0 = for all or u, v here is a special tag padding beginning of sentence. Recursive case: for i = 1... n, for all u, v, π[i, u, v] = max {π[i 1, t, u] + Score(S, i, t, u, v)} t T { } Backpointers allow us to recover max probability sequence: BP[i, u, v] = argmax t T { } {π[i 1, t, u] + Score(S, i, t, u, v)} Where Score(S, i, t, u, v) = log P (v t, u) + log P (w i v) Complexity is O(nk 3 ), where n = length of sentence, k is number of possible tags The Viterbi Algorithm Question: how do we calculate following?: T = argmax T log P (T, S) Define n to be length of sentence Define a dynamic programming table π[i, u, v] = maximum log probability of a tag sequence ending in tags u, v at position i The Viterbi Algorithm: Running Time O(n T 3 ) time to calculate Score(S, i, t, u, v) for all i, t, u, v. n T 2 entries in π to be filled in. O(T ) time to fill in one entry O(n T 3 ) time Our goal is to calculate max π[n, u, v] u,v T 22 24

7 Pros and Cons Hidden markov model taggers are very simple to train (just need to compile counts from training corpus) Perform relatively well (over 90% performance on named entities) Main difficulty is modeling P (word tag) can be very difficult if words are complex Log-Linear Models We have an input sentence S = w 1, w 2,..., w n (w i is i th word in sentence) We have a tag sequence T = t 1, t 2,..., t n (t i is i th tag in sentence) We ll use an log-linear model to define P (t 1, t 2,..., t n w 1, w 2,..., w n ) for any sentence S and tag sequence T of same length. (Note: contrast with HMM that defines P (t 1, t 2,..., t n, w 1, w 2,..., w n )) Then most likely tag sequence for S is T = argmax T P (T S) Overview How to model P (T S)? The Tagging Problem Hidden Markov Model (HMM) taggers A Trigram Log-Linear Tagger: P (T S) = n j=1 P (t j w 1... w n, t 1... t j 1 ) Chain rule Log-linear taggers Log-linear models for parsing and or problems = n j=1 P (t j w 1,..., w n, t j 2, t j 1 ) Independence assumptions We take t 0 = t 1 = * Independence assumption: each tag only depends on previous two tags P (t j w 1,..., w n, t 1,..., t j 1 ) = P (t j w 1,..., w n, t j 2, t j 1 ) 26 28

8 An Example Hispaniola/P quickly/rb became/vb an/ important/jj base/?? from which Spain expanded its empire into rest of Western Hemisphere. There are many possible tags in position?? Y = {, S,, Vi,,,... } The input domain X is set of all possible histories (or contexts) Feature Vector Representations We have some input domain X, and a finite label set Y. Aim is to provide a conditional probability P (y x) for any x X and y Y. A feature is a function f : X Y R (Often binary features or indicator functions f : X Y {0, 1}). Say we have m features f k for k = 1... m A feature vector f(x, y) R m for any x X and y Y. Need to learn a function from (history, tag) pairs to a probability P (tag history) Representation: Histories A history is a 4-tuple t 2, t 1, w [1:n], i t 2, t 1 are previous two tags. w [1:n] are n words in input sentence. i is index of word being tagged X is set of all possible histories Hispaniola/P quickly/rb became/vb an/ important/jj base/?? from which Spain expanded its empire into rest of Western Hemisphere. t 2, t 1 =, JJ w [1:n] = Hispaniola, quickly, became,..., Hemisphere,. i = 6 30 An Example (continued) X is set of all possible histories of form t 2, t 1, w [1:n], i Y = {, S,, Vi,,,... } We have m features f k : X Y R for k = 1... m For example: f 1 (h, t) = f 2 (h, t) =... { 1 if current word wi is base and t = 0 orwise { 1 if current word wi ends in ing and t = VBG 0 orwise f 1 ( JJ,, Hispaniola,..., 6, ) = 1 f 2 ( JJ,, Hispaniola,..., 6, ) =

9 The Full Set of Features in [(Ratnaparkhi, 96)] Word/tag features for all word/tag pairs, e.g., { 1 if current word wi is base and t = f 100 (h, t) = 0 orwise Spelling features for all prefixes/suffixes of length 4, e.g., Log-Linear Models We have some input domain X, and a finite label set Y. Aim is to provide a conditional probability P (y x) for any x X and y Y. A feature is a function f : X Y R (Often binary features or indicator functions f : X Y {0, 1}). Say we have m features f k for k = 1... m A feature vector f(x, y) R m for any x X and y Y. f 101 (h, t) = f 102 (h, t) = { 1 if current word wi ends in ing and t = VBG 0 orwise { 1 if current word wi starts with pre and t = 0 orwise We also have a parameter vector v R m We define P (y x, v) = e v f(x,y) y Y e v f(x,y ) The Full Set of Features in [(Ratnaparkhi, 96)] Contextual Features, e.g., { 1 if t 2, t f 103 (h, t) = 1, t =, JJ, 0 orwise { 1 if t 1, t = JJ, f 104 (h, t) = 0 orwise { 1 if t = f 105 (h, t) = 0 orwise Training Log-Linear Model To train a log-linear model, we need a training set (x i, y i ) for i = 1... n. Then search for v = argmax v log P (y i x i, v) 1 v 2 i 2σ 2 k k }{{} }{{} Log Likelihood Gaussian P rior (see last lecture on log-linear models) f 106 (h, t) = f 107 (h, t) = { 1 if previous word wi 1 = and t = 0 orwise { 1 if next word wi+1 = and t = 0 orwise Training set is simply all history/tag pairs seen in training data 34 36

10 The Viterbi Algorithm for Log-Linear Models Question: how do we calculate following?: T = argmax T log P (T S) Define n to be length of sentence Define a dynamic programming table π[i, u, v] = maximum log probability of a tag sequence ending in tags u, v at position i FAQ Segmentation: McCallum et. al McCallum et. al compared HMM and log-linear taggers on a FAQ Segmentation task Main point: in an HMM, modeling is difficult in this domain P (word tag) Our goal is to calculate max u,v T π[n, u, v] 37 The Viterbi Algorithm: Recursive Definitions Base case: π[0,, ] = log 1 = 0 π[0, u, v] = log 0 = for all or u, v here is a special tag padding beginning of sentence. Recursive case: for i = 1... n, for all u, v, π[i, u, v] = max {π[i 1, t, u] + Score(S, i, t, u, v)} t T { } Backpointers allow us to recover max probability sequence: BP[i, u, v] = argmax t T { } {π[i 1, t, u] + Score(S, i, t, u, v)} 39 FAQ Segmentation: McCallum et. al <head>x-tp-poster: NewsHound v1.33 <head> <head>archive name: acorn/faq/part2 <head>frequency: monthly <head> <question>2.6) What configuration of serial cable should I use <answer> <answer> Here follows a diagram of necessary connections <answer>programs to work properly. They are as far as I know t <answer>agreed upon by commercial comms software developers fo <answer> <answer> Pins 1, 4, and 8 must be connected toger inside <answer>is to avoid well known serial port chip bugs. The Where Score(S, i, t, u, v) = log P (v t, u, w 1,..., w n, i) Identical to Viterbi for HMMs, except for definition of Score(S, i, t, u, v) 38 40

11 FAQ Segmentation: Line Features begins-with-number begins-with-ordinal begins-with-punctuation begins-with-question-word begins-with-subject blank contains-alphanum contains-bracketed-number contains-http contains-non-space contains-number contains-pipe contains-question-mark ends-with-question-mark first-alpha-is-capitalized indented-1-to-4 indented-5-to-10 more-than-one-third-space only-punctuation prev-is-blank prev-begins-with-ordinal shorter-than-30 FAQ Segmentation: An HMM Tagger <question>2.6) What configuration of serial cable should I use First solution for P (word tag): P ( 2.6) What configuration of serial cable should I use question) = P ( 2.6) question) P (W hat question) P (conf iguration question) P (of question) P (serial question)... i.e. have a language model for each tag FAQ Segmentation: The Log-Linear Tagger <head>x-tp-poster: NewsHound v1.33 <head> <head>archive name: acorn/faq/part2 <head>frequency: monthly <head> <question>2.6) What configuration of serial cable should I use Here follows a diagram of necessary connections programs to work properly. They are as far as I know t agreed upon by commercial comms software developers fo Pins 1, 4, and 8 must be connected toger inside is to avoid well known serial port chip bugs. The tag=question;prev=head;begins-with-number tag=question;prev=head;contains-alphanum tag=question;prev=head;contains-nonspace tag=question;prev=head;contains-number tag=question;prev=head;prev-is-blank FAQ Segmentation: McCallum et. al Second solution: first map each sentence to string of features: <question>2.6) What configuration of serial cable should I use <question>begins-with-number contains-alphanum contains-nonspac Use a language model again: P ( 2.6) What configuration of serial cable should I use question) = P (begins-with-number question) P (contains-alphanum question) P (contains-nonspace question) P (contains-number question) P (prev-is-blank question) 42 44

12 FAQ Segmentation: Results Log-Linear Taggers: Summary Method Precision Recall ME-Stateless TokenHMM FeatureHMM MEMM Precision and recall results are for recovering segments ME-stateless is a log-linear model that treats every sentence seperately (no dependence between adjacent tags) TokenHMM is an HMM with first solution we ve just seen FeatureHMM is an HMM with second solution we ve just seen MEMM is a log-linear trigram tagger (MEMM stands for Maximum- Entropy Markov Model ) The input sentence is S = w 1... w n Each tag sequence T has a conditional probability P (T S) = n j=1 P (t j w 1... w n, t 1... t j 1 ) Chain rule = n j=1 P (t j w 1... w n, t j 2, t j 1 ) Independence assumptions Estimate P (t j w 1... w n, t j 2, t j 1 ) using log-linear models Use Viterbi algorithm to compute argmax T T n log P (T S) Overview The Tagging Problem Hidden Markov Model (HMM) taggers Log-linear taggers A General Approach: (Conditional) History-Based Models We ve shown how to define P (T sequence S) where T is a tag How do we define P (T S) if T is a parse tree (or anor structure)? Log-linear models for parsing and or problems 46 48

13 A General Approach: (Conditional) History-Based Models Step 1: represent a tree as a sequence of decisions d 1... d m T = d 1, d 2,... d m m is not necessarily length of sentence Step 2: probability of a tree is m P (T S) = P (d i d 1... d i 1, S) i=1 Ratnaparkhi s Parser: Three Layers of Structure 1. Part-of-speech tags 2. Chunks 3. Remaining structure Step 3: Use a log-linear model to estimate P (d i d 1... d i 1, S) Step 4: Search?? (answer we ll get to later: beam or heuristic search) An Example Tree Layer 1: Part-of-Speech Tags S() () VP() () PP() Step 1: represent a tree as a sequence of decisions d 1... d m T = d 1, d 2,... d m () First n decisions are tagging decisions d 1... d n =,,,,,,, 50 52

14 Layer 2: Chunks Layer 3: Remaining Structure Alternate Between Two Classes of Actions: Join(X) or Start(X), where X is a label (, S, VP etc.) Check=YES or Check=NO Meaning of se actions: Chunks are defined as any phrase where all children are partof-speech tags (Or common chunks are ADJP, QP) Start(X) starts a new constituent with label X (always acts on leftmost constituent with no start or join label above it) Join(X) continues a constituent with label X (always acts on leftmost constituent with no start or join label above it) Check=NO does nothing Check=YES takes previous Join or Start action, and converts it into a completed constituent Layer 2: Chunks Start() Join() Or Start() Join() Or Start() Join() Step 1: represent a tree as a sequence of decisions d 1... d n T = d 1, d 2,... d n First n decisions are tagging decisions Next n decisions are chunk tagging decisions d 1... d 2n =,,,,,,,, Start(), Join(), Or, Start(), Join(), Or, Start(), Join() 54 56

15 Start(S) 57 Start(S) Check=NO 58 Start(S) Start(VP) 59 Start(S) Start(VP) Check=NO 60

16 Start(S) Start(VP) Join(VP) 61 Start(S) Start(VP) Join(VP) Check=NO 62 Start(S) Start(VP) Join(VP) Start(PP) 63 Start(S) Start(VP) Join(VP) Start(PP) Check=NO 64

17 Start(S) Start(VP) Join(VP) Start(PP) Join(PP) 65 Start(S) Start(VP) Join(VP) PP Check=YES 66 Start(S) Start(VP) Join(VP) Join(VP) PP 67 Start(S) VP PP Check=YES 68

18 Start(S) Join(S) VP PP The Final Sequence of decisions d 1... d m =,,,,,,,, Start(), Join(), Or, Start(), Join(), Or, Start(), Join(), Start(S), Check=NO, Start(VP), Check=NO, Join(VP), Check=NO, Start(PP), Check=NO, Join(PP), Check=YES, Join(VP), Check=YES, Join(S), Check=YES S A General Approach: (Conditional) History-Based Models Step 1: represent a tree as a sequence of decisions d 1... d m T = d 1, d 2,... d m VP m is not necessarily length of sentence PP Step 2: probability of a tree is m P (T S) = P (d i d 1... d i 1, S) i=1 Step 3: Use a log-linear model to estimate P (d i d 1... d i 1, S) Check=YES Step 4: Search?? (answer we ll get to later: beam or heuristic search) 70 72

19 Applying a Log-Linear Model Step 3: Use a log-linear model to estimate A reminder: where: P (d i d 1... d i 1, S) = P (d i d 1... d i 1, S) ef( d 1...d i 1,S,d i ) v d A e f( d 1...d i 1,S,d) v d 1... d i 1, S is history d i is outcome f maps a history/outcome pair to a feature vector v is a parameter vector A is set of possible actions Layer 3: Join or Start Looks at head word, constituent (or POS) label, and start/join annotation of n th tree relative to decision, where n = 2, 1 Looks at head word, constituent (or POS) label of n th tree relative to decision, where n = 0, 1, 2 Looks at bigram features of above for (-1,0) and (0,1) Looks at trigram features of above for (-2,-1,0), (-1,0,1) and (0, 1, 2) The above features with all combinations of head words excluded Various punctuation features Applying a Log-Linear Model Step 3: Use a log-linear model to estimate P (d i d 1... d i 1, S) = ef( d 1...d i 1,S,d i ) v d A e f( d 1...d i 1,S,d) v Layer 3: Check=NO or Check=YES A variety of questions concerning proposed constituent The big question: how do we define f? Ratnaparkhi s method defines f differently depending on wher next decision is: A tagging decision (same features as before for POS tagging!) A chunking decision A start/join decision after chunking A check=no/check=yes decision 74 76

20 The Search Problem In POS tagging, we could use Viterbi algorithm because P (t j w 1... w n, j, t 1... t j 1 ) = P (t j w 1... w n, j, t j 2... t j 1 ) Now: Decision d i could depend on arbitrary decisions in past no chance for dynamic programming Instead, Ratnaparkhi uses a beam search method 77

Tagging Problems. a b e e a f h j a/c b/d e/c e/c a/d f/c h/d j/c (Fall 2006): Lecture 7 Tagging and History-Based Models

Tagging Problems. a b e e a f h j a/c b/d e/c e/c a/d f/c h/d j/c (Fall 2006): Lecture 7 Tagging and History-Based Models Tagging Problems Mapping strings to Tagged Sequences 6.864 (Fall 2006): Lecture 7 Tagging and History-Based Models a b e e a f h j a/c b/d e/c e/c a/d f/c h/d j/c 1 3 Overview The Tagging Problem Hidden