MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion

MOTIVATING STRUCTURED- OUTPUT PREDICTION FOR NLP

Types of machine learning Input Attributes Input Attributes Classifier Density Estimator Prediction of categorical output or class One of a few discrete values Probability Input Attributes Regressor Prediction of real-valued output Copyright Andrew W. Moore

Outline Some Sample NLP Task [Noah Smith] What do these have in common? We re making many related predictions... Structured Prediction For NLP Structured Prediction Methods Structured Perceptron Conditional Random Fields Deep learning and NLP

via begin-inside-outside token tagging beginx inx* delimits entity X begingpe out out out out begingf ingf

via begin-inside-outside token tagging beginx inx* delimits entity X thistoken = english thistokenshape = X+ prevtoken = the prevprevtoken = across nettoken = channel netnettoken = Monday nettokenshape = X+ prevtokenshape = +. thistoken_english = 1 thistokenshape_x+ = 1 prevtoken_the = 1 prevprevtoken_across = 1 begingf begingf ingf a classification task!

via begin-inside-outside token tagging A problem: the instances are not i.i.d, nor are the classes. inx usually happens after beginx or inx beginx usually happens after out thistoken_the = 1 thistokenshape_+ = 1 prevtoken_across = 1 thistoken_english = 1 thistokenshape_x+ = 1 prevtoken_the = 1 prevprevtoken_across = 1 thistoken_channel = 1 thistokenshape_x+ = 1 prevtoken_english = 1 prevprevtoken_the = 1 out begingf ingf

Structured Output Prediction Structured Output Prediction: deqinition classiqier where output is structured, and input might be structured eample: predict a sequence of begin- in- out tags from a sequence of tokens (represented by a sequence of feature vectors)

NER via begin-inside-outside token tagging A problem: with K begin-inside-outside tags and N words, we have N K possible structured labels. = How can we learn to chose among so many possible outputs? y = begingpe out out out out begingf ingf

HMMS AND NER

out begingpe ingpe e.g., HMMs map a sequence of tokens to a sequence of outputs (the hidden states) begingf beginper ingf inper = y = begingpe out out out out begingf ingf

Borkar et al s: HMMs for segmentation n n n Eample: Addresses, bib records Problem: some DBs may split records up differently (eg no mail stop field, combine address and apt #, ) or not at all Solution: Learn to segment tetual form of records Author Year Title Journal Volume Page P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237. 21

IE with Hidden Markov Models Must learn transition, emission probabilities Smith Cohen Jordan Transition probabilities 0.01 0.05 0.3 0.1 dddd dd 0.5 Author 0.9 Year 0.8 0.2 0.8 Journal 0.2 Title 0.5 Learning Conve Comm. Trans. Chemical Emission probabilities 0.06 0.03.. 0.04 0.02 0.004 22

HMMS limitation: not well-suited to modeling a set of features for each token. thistoken = english thistokenshape = X+ prevtoken = the prevprevtoken = across nettoken = channel netnettoken = Monday nettokenshape = X+ prevtokenshape = +. thistoken_english = 1 thistokenshape_x+ = 1 prevtoken_the = 1 prevprevtoken_across = 1 begingf begingf ingf a classification task!

What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in -ski is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor is Wisniewski part of noun phrase S t - 1 ends in -ski O t - 1 S t O t S t+1 O t +1 Lots of learning systems are not confounded by multiple, nonindependent features: decision trees, neural nets, SVMs,

Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods HMMs Conditional Random Fields Structured Perceptron Discussion

CONDITIONAL RANDOM FIELDS

Inference for linear-chain MRFs When will prof Cohen post the notes Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them. (y(i)==b or y(i)==i) and (token(i) is capitalized) (y(i)==i and y(i-1)==b) and (token(i) is hyphenated) (y(i)==b and y(i-1)==b) eg tell Ziv William is on the way Idea 2: construct a graph where each path is a possible sequence labeling.

Inference for a linear-chain MRF B I O B I O B I O B I O B I O B I O B I O When will prof Cohen post the notes Inference: find the highest-weight path Learning: optimize the feature weights so that this highest-weight path is correct relative to the data

Inference for a linear-chain MRF When will prof Cohen post the notes B B B B B B B I I I I I I I O O O O O O O Feature + weights define edge weights: weight(y i,y i+1 ) = 2*[(y i =B or I) and iscap( i )] + 1*[(y i =B and isfirstname( i )] - 5*[(y i+1 B and islower( i ) and isupper( i+1 )]

CRF learning from Sha & Pereira

CRF learning from Sha & Pereira Something like forward-backward Idea: Define matri of y,y affinities at stage i M i [y,y ] = unnormalized probability of transition from y to y at stage I M i * M i+1 = unnormalized probability of any path through stages i and i+1

Sha & Pereira results CRF beats MEMM (McNemar s test); MEMM probably beats voted perceptron

Sha & Pereira results in minutes, 375k eamples

THE STRUCTURED PERCEPTRON

Review: The perceptron A instance i ^ y i B ^ Compute: y i = v k. i If mistake: v k+1 = v k + y i i y i

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 (3b) The guess v 2 after the one positive and one negative eample: v 2 =v 1-2 + 2 v 2 u u v 2 >γ v 1 v 1 + 1 -u -u - 2 2γ 2γ

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 (3b) The guess v 2 after the one positive and one negative eample: v 2 =v 1-2 + 2 v 2 u u v 2 v 1 v 1 + 1 -u -u - 2 2γ 2γ

2 2 2 2 R = γ 2 2

The voted perceptron for ranking A instances 1 2 3 4 b* b B ^ Compute: y i = v k. i Return: the inde b* of the best i If mistake: v k+1 = v k + b - b*

u Ranking some s with the target vector u γ -u

u Ranking some s with some guess vector v part 1 γ v -u

u Ranking some s with some guess vector v part 2. -u v The purple-circled is b* - the one the learner has chosen to rank highest. The green circled is b, the right answer.

u Correcting v by adding b b* v -u

V k+1 Correcting v by adding b b* (part 2) v k

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 u + 2 v 2 u >γ v v 1 -u -u 2γ

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 u + 2 v 2 u >γ v v 1 -u -u 3 2γ

(3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 u + 2 v 2 u >γ v 1 v -u 2γ -u 3

Notice this doesn t depend at all on the number of s being ranked u (3a) The guess v 2 after the two positive eamples: v 2 =v 1 + 2 + 2 v 2 u >γ v v 1 -u -u 2γ Neither proof depends on the dimension of the s.

The voted perceptron for ranking A instances 1 2 3 4 b* b B ^ Compute: y i = v k. i Return: the inde b* of the best i If mistake: v k+1 = v k + b - b* Change number one is notation: replace with g

The voted perceptron for NER A instances g 1 g 2 g 3 g 4 B ^ Compute: y i = v k. g i Return: the inde b* of the best g i b* b If mistake: v k+1 = v k + g b - g b* 1. A sends B the Sha & Pereira paper, some feature functions, and instructions for creating the instances g: A sends a word vector i. Then B could create the instances g 1 =F( i,y 1 ), g 2 = F( i,y 2 ), but instead B just returns the y* that gives the best score for the dot product v k. F( i,y*) by using Viterbi. 2. A sends B the correct label sequence y i. 3. On errors, B sets v k+1 = v k + g b - g b* = v k + F( i,y) - F( i,y*)

The voted perceptron for NER A instances z 1 z 2 z 3 z 4 B ^ Compute: y i = v k. z i Return: the inde b* of the best z i b* b If mistake: v k+1 = v k + z b - z b* 1. A sends a word vector i. 2. B just returns the y* that gives the best score for v k. F( i,y*) 3. A sends B the correct label sequence y i. 4. On errors, B sets v k+1 = v k + z b - z b* = v k + F( i,y) - F( i,y*) So, this algorithm can also be viewed as an approimation to the CRF learning algorithm where we re using a Viterbi approimation to the epectations, and stochastic gradient descent to optimize the likelihood.

And back to the paper.. EMNLP 2002, Best paper

Collins Eperiments POS tagging (with MXPOST features) NP Chunking (words and POS tags from Brill s tagger as features) and BIO output tags Compared Maent Tagging/MEMM s (with iterative scaling) and Voted Perceptron trained HMM s With and w/o averaging With and w/o feature selection (count>5)

Collins results

Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion

Discussion Structured prediction is just one part of NLP There is lots of work on learning and NLP It would be much easier to summarize nonlearning related NLP CRFs/HMMs/structured perceptrons are just one part of structured output prediction Chris Dyer course. Hot topic now: representation learning and deep learning for NLP NAACL 2013: Deep learning for NLP, Manning