Log-Linear Models with Structured Outputs

Similar documents
13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Intelligent Systems (AI-2)

A brief introduction to Conditional Random Fields

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Intelligent Systems (AI-2)

Lecture 13: Structured Prediction

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

10/17/04. Today s Main Points

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Probabilistic Models for Sequence Labeling

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 6: Graphical Models

COMP90051 Statistical Machine Learning

Undirected Graphical Models

Machine Learning for Structured Prediction

Graphical models for part of speech tagging

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Statistical Methods for NLP

Linear Classifiers IV

LECTURER: BURCU CAN Spring

Sequential Supervised Learning

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Directed Probabilistic Graphical Models CMSC 678 UMBC

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Hidden Markov Models

Midterm sample questions

Conditional Random Field

Hidden Markov Models in Language Processing

Conditional Random Fields: An Introduction

Conditional Random Fields for Sequential Supervised Learning

Hidden Markov Models

STA 4273H: Statistical Machine Learning

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Conditional Random Fields

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

STA 414/2104: Machine Learning

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lab 12: Structured Prediction

Natural Language Processing

CS838-1 Advanced NLP: Hidden Markov Models

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Soft Inference and Posterior Marginals. September 19, 2013

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Parsing with Context-Free Grammars

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Probabilistic Graphical Models

Final Exam, Machine Learning, Spring 2009

Log-Linear Models, MEMMs, and CRFs

The Noisy Channel Model and Markov Models

Lecture 7: Sequence Labeling

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Maxent Models and Discriminative Estimation

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Hidden Markov Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

with Local Dependencies

Structure Learning in Sequential Data

Natural Language Processing

Probabilistic Graphical Models

Statistical Methods for NLP

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Information Extraction from Text

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Structured Prediction Models via the Matrix-Tree Theorem

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Introduction to Machine Learning Midterm, Tues April 8

Lecture 15. Probabilistic Models on Graph

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Probabilistic Context-free Grammars

Structured Prediction

CRF for human beings

Maximum Entropy Markov Models

CPSC 340: Machine Learning and Data Mining

STA 4273H: Statistical Machine Learning

Intelligent Systems (AI-2)

Machine Learning for NLP

From perceptrons to word embeddings. Simon Šuster University of Groningen

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Statistical Methods for NLP

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

Structured Output Prediction: Generative Models

Pattern Recognition and Machine Learning

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

p L yi z n m x N n xi

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Statistical Approaches to Learning and Discovery

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Machine Learning for NLP

Design and Implementation of Speech Recognition Systems

Lecture 5 Neural models for NLP

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CSE 490 U Natural Language Processing Spring 2016

Hidden Markov Models Part 1: Introduction

Transcription:

Log-Linear Models with Structured Outputs Natural Language Processing CS 4120/6120 Spring 2016 Northeastern University David Smith (some slides from Andrew McCallum)

Overview Sequence labeling task (cf. POS tagging) Independent classifiers HMMs (Conditional) Maximum Entropy Markov Models Conditional Random Fields Beyond Sequence Labeling

Sequence Labeling Inputs: x = (x 1,, x n ) Labels: y = (y 1,, y n ) Typical goal: Given x, predict y Example sequence labeling tasks Part-of-speech tagging Named-entity-recognition (NER) Label people, places, organizations

NER Example:

First Solution: Maximum Entropy Classifier Conditional model p(y x). Do not waste effort modeling p(x), since x is given at test time anyway. Allows more complicated input features, since we do not need to model dependencies between them. Feature functions f(x,y): f 1 (x,y) = { word is Boston & y=location } f 2 (x,y) = { first letter capitalized & y=name } f 3 (x,y) = { x is an HTML link & y=location}

First Solution: MaxEnt Classifier How should we choose a classifier? Principle of maximum entropy We want a classifier that: Matches feature constraints from training data. Predictions maximize entropy. There is a unique, exponential family distribution that meets these criteria.

First Solution: MaxEnt Classifier Problem with using a maximum entropy classifier for sequence labeling: It makes decisions at each position independently!

Second Solution: HMM # t P(y,x) = P(y t y t"1 )P(x y t ) Defines a generative process. Can be viewed as a weighted finite state machine.

Second Solution: HMM How can represent we multiple features in an HMM? Treat them as conditionally independent given the class label? The example features we talked about are not independent. Try to model a more complex generative process of the input features? We may lose tractability (i.e. lose a dynamic programming for exact inference).

Second Solution: HMM Let s use a conditional model instead.

Third Solution: MEMM Use a series of maximum entropy classifiers that know the previous label. Define a Viterbi algorithm for inference. # t P(y x) = P yt"1 (y t x)

Third Solution: MEMM Combines the advantages of maximum entropy and HMM! But there is a problem

Problem with MEMMs: Label Bias In some state space configurations, MEMMs essentially completely ignore the inputs. r:_ Example 0 r:_ (ON BOARD). 1 4 This is not a problem for HMMs, because the input sequence is generated by the model. i:_ o:_ 2 5 b:rib b:rob 3

Fourth Solution: Conditional Random Field Conditionally-trained, undirected graphical model. For a standard linear-chain structure: P(y x) = $ t " k (y t, y t#1,x) ' " (y, y,x) = exp % k t t#1 ) k ( & k * f (y, y,x) t t#1, +

Fourth Solution: CRF Have the advantages of MEMMs, but avoid the label bias problem. CRFs are globally normalized, whereas MEMMs are locally normalized. Widely used and applied. CRFs give state-the-art results in many domains.

Fourth Solution: CRF Have the advantages of MEMMs, but avoid the label bias problem. CRFs are globally normalized, whereas MEMMs are locally normalized. Widely used and applied. CRFs give state-the-art results in many domains. Remember, Z is the normalization constant. How do we compute it?

CRF Applications Part-of-speech tagging Named entity recognition Gene prediction Chinese word segmentation Morphological disambiguation Citation parsing Etc., etc. Document layout (e.g. table) classification

NER as Sequence Tagging The Phoenicians came from the Red Sea 17

NER as Sequence Tagging O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

NER as Sequence Tagging Capitalized word O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

NER as Sequence Tagging Capitalized word Ends in s O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

NER as Sequence Tagging Capitalized word Ends in s Ends in ans O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

NER as Sequence Tagging Capitalized word Ends in s Ends in ans Previous word the O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

NER as Sequence Tagging Capitalized word Ends in s Ends in ans Previous word the Phoenicians in gazetteer O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

NER as Sequence Tagging O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

NER as Sequence Tagging Not capitalized O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

NER as Sequence Tagging Not capitalized Tagged as VB O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

NER as Sequence Tagging B-E to right Not capitalized Tagged as VB O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

NER as Sequence Tagging O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

NER as Sequence Tagging Word sea O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

NER as Sequence Tagging Word sea preceded by the ADJ Word sea O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

NER as Sequence Tagging Word sea preceded by the ADJ Hard constraint: I-L must follow B-L or I-L Word sea O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

Overview What computations do we need? Smoothing log-linear models MEMMs vs. CRFs again Action-based parsing and dependency parsing

Recipe for Conditional Training of p(y x) 1.Gather constraints/features from training data er constraints from training data: 2.Initialize ize all all parameters to to zero. 3.Classify training data with current parameters; calculate expectations 4.Gradient is in the direction o 5.Take a step in the direction of the gradient 6.Repeat from 3 until convergence

Recipe for Conditional Training of p(y x) 1.Gather constraints/features from training data er constraints from training data: 2.Initialize ize all all parameters to to zero. 3.Classify training data with current parameters; calculate expectations 4.Gradient is in the direction o 5.Take a step in the direction of the gradient 6.Repeat from 3 until convergence Where have we seen expected counts before?

Recipe for Conditional Training of p(y x) 1.Gather constraints/features from training data er constraints from training data: 2.Initialize ize all all parameters to to zero. 3.Classify training data with current parameters; calculate expectations 4.Gradient is in the direction o 5.Take a step in the direction of the gradient 6.Repeat from 3 until convergence EM! Where have we seen expected counts before?

Gradient-Based Training λ := λ + rate * Gradient(F) After all training examples? (batch) After every example? (on-line) Use second derivative for faster learning? A big field: numerical optimization

Parsing as Structured Prediction

Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking.

Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking.

Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking. Train log-linear model of p(action context)

Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking. Compare to an MEMM Train log-linear model of p(action context)

Structured Log-Linear Models 25

Structured Log-Linear Models Linear model for scoring structures score(out, in) = features(out, in) 25

Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models Usually the bottleneck in NLP score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models Inference: sampling, variational methods, dynamic programming, local search,... Usually the bottleneck in NLP score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models Inference: sampling, variational methods, dynamic programming, local search,... Training: maximum likelihood, minimum risk, etc. Usually the bottleneck in NLP score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

Structured Log-Linear Models With latent variables Several layers of linguistic structure Unknown correspondences Naturally handled by probabilistic framework Several inference setups, for example: p(out 1 in) = out 2,alignment p(out 1, out 2, alignment in) 26

Structured Log-Linear Models With latent variables Several layers of linguistic structure Unknown correspondences Naturally handled by probabilistic framework Several inference setups, for example: p(out 1 in) = out 2,alignment p(out 1, out 2, alignment in) Another computational problem 26

Edge-Factored Parsers No global features of a parse (McDonald et al. 2005) Each feature is attached to some edge MST or CKY-like DP for fast O(n 2 ) or O(n 3 ) parsing Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 27

Edge-Factored Parsers Is this a good edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 28

Edge-Factored Parsers Is this a good edge? yes, lots of positive features... Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 28

Edge-Factored Parsers Is this a good edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 29

Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 29

Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 30

Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 30

Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 31

Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) A! N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 31

Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) A! N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 32

Edge-Factored Parsers Is this a good edge? jasný! den A!( bright N day ) preceding conjunction jasný! N ( bright NOUN ) A! N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 32

Edge-Factored Parsers How about this competing edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 33

Edge-Factored Parsers How about this competing edge? not as good, lots of red... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 33

Edge-Factored Parsers How about this competing edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 34

Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 34

Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 34

Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 35

Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 35

Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 36

Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) A singular! N plural Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 36

Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) A singular! N plural Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 37

Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright A! clocks ) N... where undertrained N follows... a conjunction jasn! hodi ( bright clock, stems only) A singular! N plural Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 37

Edge-Factored Parsers Which edge is better? bright day or bright clocks? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 38

Edge-Factored Parsers Which edge is better? Score of an edge e = θ features(e) Standard algos " valid parse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 39

Edge-Factored Parsers Which edge is better? our current weight vector Score of an edge e = θ features(e) Standard algos " valid parse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 39

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging find preferred tags 40

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging find preferred tags Observed input sentence (shaded) 40

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) v v v find preferred tags Observed input sentence (shaded) 40

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging find preferred tags Observed input sentence (shaded) 41

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging v a n find preferred tags Observed input sentence (shaded) 41

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Binary factor that measures compatibility of 2 adjacent tags v n a v 0 2 1 n 2 1 0 a 0 3 1 find preferred tags 42

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Binary factor that measures compatibility of 2 adjacent tags v n a v 0 2 1 n 2 1 0 a 0 3 1 v n a v 0 2 1 n 2 1 0 a 0 3 1 Model reuses same parameters at this position find preferred tags 42

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging v 0.2 n 0.2 a 0 find preferred tags 43

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag v 0.2 n 0.2 a 0 find preferred tags 43

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags 43

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags can t be adj 43

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags 44

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags (could be made to depend on entire observed sentence) 44

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 45

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Different unary factor at each position v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 45

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging v n a v 0 2 1 n 2 1 0 a 0 3 1 v n a v 0 2 1 n 2 1 0 a 0 3 1 v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 46

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors values on v a n v n a v 0 2 1 n 2 1 0 a 0 3 1 v n a v 0 2 1 n 2 1 0 a 0 3 1 v a n v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 46

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors values on v a n v n a v 0 2 1 n 2 1 0 a 0 3 1 v n a v 0 2 1 n 2 1 0 a 0 3 1 v a n v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 47

Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors values on v a n v n a v 0 2 1 n 2 1 0 a 0 3 1 v n a v 0 2 1 n 2 1 0 a 0 3 1 v a n = 1*3*0.3*0.1*0.2 v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 47

Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links find preferred links 48

Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links find preferred links 48

Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links find preferred links 48

Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Possible parse... find preferred links 49

Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Possible f t... and variable parse... assignment f f t f find preferred links 49

Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Another parse... find preferred links 50

Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Another f f... and variable parse... assignment f t f t find preferred links 50

Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links An illegal t f... with a parse... cycle! f t f t find preferred links 51

Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! An illegal parse... O(n 2 ) boolean variables for the possible links f t... with a cycle and multiple f t t t parents! find preferred links 52

Local Factors for Parsing What factors determine parse probability? find preferred links 53

Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation t 2 f 1 find preferred links 53

Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation t 2 f 1 t 1 f 8 t 1 f 3 t 1 f 2 find t 1 preferred t 1links f 6 f 2 53

Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation But what if the best assignment isn t a tree? t 2 f 1 t 1 f 8 t 1 f 3 t 1 f 2 find t 1 preferred t 1links f 6 f 2 53

Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation find preferred links 54

Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 find preferred links 54

Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 find preferred links 54

Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 t f f f f t ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 we re legal! find preferred links 55

Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation optionally require the tree to be projective (no crossing links) Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 So far, this is equivalent to edge-factored parsing f f f t t f 64 entries (0/1) ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 we re legal! find preferred links 55

Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation optionally require the tree to be projective (no crossing links) Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 So far, this is equivalent to edge-factored parsing f f f t t f 64 entries (0/1) ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 we re legal! find preferred links Note: traditional parsers don t loop through this table to consider exponentially many trees one at a time. They use combinatorial algorithms; so should we! 55

Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 Second order effects: factors on 2 variables Grandparent parent child chains find preferred links 56

Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 Second order effects: factors on 2 variables Grandparent parent child chains f t f 1 1 t 1 3 find preferred links 56

Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 Second order effects: factors on 2 variables Grandparent parent child chains No crossing links Siblings Hidden morphological tags Word senses and subcategorization frames find preferred links 57

Great Ideas in ML: Message Passing adapted from MacKay (2003) textbook 58

Great Ideas in ML: Message Passing Count the soldiers adapted from MacKay (2003) textbook 58

Great Ideas in ML: Message Passing Count the soldiers 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

Great Ideas in ML: Message Passing Count the soldiers there s 1 of me 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

Great Ideas in ML: Message Passing Count the soldiers there s 1 of me 5 behind you 4 behind you 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

Great Ideas in ML: Message Passing Count the soldiers 1 before you 2 before you 5 behind you 4 behind you 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

Great Ideas in ML: Message Passing Count the soldiers 1 before you 2 before you 3 before you 4 before you 5 before you 5 behind you 4 behind you 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

Great Ideas in ML: Message Passing Count the soldiers 2 before you there s 1 of me only see my incoming messages 3 behind you adapted from MacKay (2003) textbook 59

Great Ideas in ML: Message Passing Count the soldiers 2 before you there s 1 of me Belief: Must be 2 + 1 + 3 = 6 of us only see my incoming messages 3 behind you adapted from MacKay (2003) textbook 59

Great Ideas in ML: Message Passing Count the soldiers 2 before you there s 1 of me Belief: Must be 2 + 1 + 3 = 6 of us only see my incoming messages 3 behind you adapted from MacKay (2003) textbook 59

Great Ideas in ML: Message Passing Count the soldiers 1 before you there s 1 of me Belief: Must be 2 + 1 + 3 = 6 of us only see my incoming messages 4 behind you adapted from MacKay (2003) textbook 60

Great Ideas in ML: Message Passing Count the soldiers 1 before you there s 1 of me Belief: Must be 1 + 1 + 4 = 6 of us only see my incoming messages 4 behind you adapted from MacKay (2003) textbook 60

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 61

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here adapted from MacKay (2003) textbook 61

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 1 of me 11 here (= 7+3+1) adapted from MacKay (2003) textbook 61

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 62

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 3 here adapted from MacKay (2003) textbook 62

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here (= 3+3+1) 3 here adapted from MacKay (2003) textbook 62

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 63

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 7 here 3 here adapted from MacKay (2003) textbook 63

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 11 here (= 7+3+1) 7 here 3 here adapted from MacKay (2003) textbook 63

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 64

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree Belief: Must be 14 of us adapted from MacKay (2003) textbook 64

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us adapted from MacKay (2003) textbook 64

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us adapted from MacKay (2003) textbook 65

Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us wouldn t work correctly with a loopy (cyclic) graph adapted from MacKay (2003) textbook 65

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm v n a v 0 2 1 n 2 1 0 a 0 3 1 v n a v 0 2 1 n 2 1 0 a 0 3 1 find preferred tags 66

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm belief v 1. n 8 0 a 4. 2 v n a v 0 2 1 n 2 1 0 a 0 3 1 v n a v 0 2 1 n 2 1 0 a 0 3 1 find preferred tags 66

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm message α v n a v 3 v 0 2 1 n 1 n 2 1 0 a 6 a 0 3 1 belief v 1. 8 n 0 a 4. 2 β message v 2 v n a nv 1 0 2 1 a n 7 2 1 0 a 0 3 1 find preferred tags 66

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm message α v n a v 3 v 0 2 1 n 1 n 2 1 0 a 6 a 0 3 1 belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv 1 0 2 1 a n 7 2 1 0 a 0 3 1 find preferred tags 66

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v n a v 3 v 0 2 1 n 1 n 2 1 0 a 6 a 0 3 1 belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv 1 0 2 1 a n 7 2 1 0 a 0 3 1 find preferred tags 66

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v 3 n 1 a 6 belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv 1 0 2 1 a n 7 2 1 0 a 0 3 1 find preferred tags 66

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α message α belief v 1. 8 n 0 a 4. 2 β message v 7 n 2 a 1 v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 66

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v 3 n 1 a 6 v n a v 0 2 1 n 2 1 0 a 0 3 1 belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β v 2 n 1 a 7 message find preferred tags 66

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v n a v 3 v 0 2 1 n 1 n 2 1 0 a 6 a 0 3 1 belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β v 2 n 1 a 7 message β v 3 n 6 a 1 find preferred tags 66

Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v n a v 3 v 0 2 1 n 1 n 2 1 0 a 6 a 0 3 1 belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv 1 0 2 1 a n 7 2 1 0 a 0 3 1 β v 3 n 6 a 1 find preferred tags 66

Sum-Product Equations # Message from variable v to factor f m v!f (x) = Y f 0 2N(v)\{f} m f 0!v(x) # Message from factor f to variable v X m f!v (x) = 2 4f(x m ) Y N(f)\{v} v 0 2N(f)\{v} 3 m v0!f (y) 5 67

Great ideas in ML: Forward-Backward α β v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 68

Great ideas in ML: Forward-Backward α β v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 68

Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor α β v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 68

Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief v 5.4 n 0 a 25.2 α β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 68

Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief $ Graph becomes loopy & α v 5.4` n 0 a 25.2` β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 69

Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief $ Graph becomes loopy & α Red messages not independent? Pretend they are! v 5.4` n 0 a 25.2` β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 69

Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief $ Graph becomes loopy & α Red messages not independent? Pretend they are! Loopy Belief Propagation v 5.4` n 0 a 25.2` β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 69

Terminological Clarification propagation 70

Terminological Clarification belief propagation 70

Terminological Clarification loopy belief propagation 70

Terminological Clarification loopy belief propagation 70

Terminological Clarification loopy belief propagation 70

Terminological Clarification loopy belief propagation loopy belief propagation 70

Terminological Clarification loopy belief propagation loopy belief propagation 70

Propagating Global Factors Loopy belief propagation is easy for local factors How do combinatorial factors (like TREE) compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links?? find preferred links 71

Propagating Global Factors Loopy belief propagation is easy for local factors How do combinatorial factors (like TREE) compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links?? find preferred links 71

Propagating Global Factors Loopy belief propagation is easy for local factors How do combinatorial factors (like TREE) compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links? find preferred links? TREE factor ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 71

Propagating Global Factors How does the TREE factor compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links? find preferred links? TREE factor ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 72

Propagating Global Factors How does the TREE factor compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links? Old-school parsing to the rescue! find preferred links? TREE factor ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 This is the outside probability of the link in an edge-factored parser! TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n 3 ) time by inside-outside Non-projective: total O(n 3 ) time by inverting Kirchhoff matrix 72

Graph Theory to the Rescue! Tutte s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. 73

Graph Theory to the Rescue! Tutte s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! 73

Graph Theory to the Rescue! O(n 3 ) time! Tutte s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! 73

Kirchoff (Laplacian) Matrix # 0 s(1,0) s(2,0)! s(n,0) & % ( 0 0 s(2,1)! s(n,1) % ( % 0 s(1,2) 0! s(n,2) ( % ( " " " # " % ( $ % 0 s(1,n) s(2,n)! 0 '( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

Kirchoff (Laplacian) Matrix # 0 s(1,0) s(2,0)! s(n,0) & % ( 0 0 s(2,1)! s(n,1) % ( % 0 s(1,2) 0! s(n,2) ( % ( " " " # " % ( $ % 0 s(1,n) s(2,n)! 0 '( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

Kirchoff (Laplacian) Matrix % 0 s(1,0) s(2,0)! s(n,0) # 0 s(1,0) s(2,0)! s(n,0) & ' * % ( ' 0 s(1, j) s(2,1)! s(n,1) 0 s(2,1) s(n,1) * % ( j 1 ' * '% 0 s(1,2) s(2, 0 j)! s(n,2) (* '% ( j 2 * '% " " " # " (* ' $ % 0 s(1,n) s(2,n)! 0 '( * j n 0 s(1,n) s(2,n)! s(n, j) & ' ) * ( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

Kirchoff (Laplacian) Matrix % 0 s(1,0) s(2,0)! s(n,0) # 0 s(1, s(1,0) j) s(2,1) s(2,0)! s(n,0) s(n,1) & ' * % ( ' 0 j 1 s(1, j) s(2,1)! s(n,1) 0 s(2,1) s(n,1) * % ( j 1 ' s(1,2) s(2, j)! s(n,2) * '% 0 s(1,2) s(2, 0 j) s(n,2) (* % j 2 ' ( j 2 * '% " " " " " # " (* ' $ % s(1,n) 0 s(1,n) s(2,n) s(2,n)! s(n, 0 j) '( * j n 0 s(1,n) s(2,n)! s(n, j) & ' ) * ( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

Kirchoff (Laplacian) Matrix % 0 s(1,0) s(2,0)! s(n,0) # 0 s(1, s(1,0) j) s(2,1) s(2,0)! s(n,0) s(n,1) & ' * % ( ' 0 j 1 s(1, j) s(2,1)! s(n,1) 0 s(2,1) s(n,1) * % ( j 1 ' s(1,2) s(2, j)! s(n,2) * '% 0 s(1,2) s(2, 0 j) s(n,2) (* % j 2 ' ( j 2 * '% " " " " " # " (* ' $ % s(1,n) 0 s(1,n) s(2,n) s(2,n)! s(n, 0 j) '( * j n 0 s(1,n) s(2,n)! s(n, j) & ' ) * ( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant N.B.: This allows multiple children of root, but see Koo et al. 2007. 74

Transition-Based Parsing Linear time Online Train a classifier to predict next action Deterministic or beam-search strategies But... generally less accurate

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Start state: ([ ], [1,...,n], {}) Final state: (S, [], A) Shift: (S, i B, A) ) (S i, B, A) Reduce: (S i, B, A) ) (S, B, A) Right-Arc: (S i, j B, A) ) (S i j, B, A [ {i! j}) Left-Arc: (S i, j B, A) ) (S, j B, A [ {i j})

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [] S [who, did, you, see] B {}

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [who] S [did, you, see] B {} Shift

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [] S [did, you, see] B { who OBJ did } Left-arc OBJ

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you }

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Reduce Stack Buffer Arcs [did] S [see] B { who OBJ did, did SBJ! you }

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc VG Stack Buffer Arcs [did, see] S [] B { who OBJ did, did SBJ! you, did VG! see }

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you }

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you } Choose action w/best classifier score 100k - 1M features

Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs Very fast linear-time performance WSJ 23 (2k sentences) in 3 s [did, you] S [see] B { who OBJ did, did SBJ! you } Choose action w/best classifier score 100k - 1M features