Personal Project: Shift-Reduce Dependency Parsing

Similar documents
Computational Linguistics

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

Transition-Based Parsing

Transition-based dependency parsing

Lecture 13: Structured Prediction

Driving Semantic Parsing from the World s Response

Lab 12: Structured Prediction

Structured Prediction

Statistical Methods for NLP

Machine Learning for NLP

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Introduction to Data-Driven Dependency Parsing

Machine Learning for Structured Prediction

Machine Learning for NLP

Dynamic-oracle Transition-based Parsing with Calibrated Probabilistic Output

10/17/04. Today s Main Points

A Tabular Method for Dynamic Oracles in Transition-Based Parsing

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Probabilistic Context-free Grammars

Transition-based Dependency Parsing with Selectional Branching

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Log-Linear Models with Structured Outputs

Linear Classifiers IV

NLP Programming Tutorial 11 - The Structured Perceptron

AN ABSTRACT OF THE DISSERTATION OF

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Structured Prediction Models via the Matrix-Tree Theorem

Statistical Methods for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Sequence Labeling: HMMs & Structured Perceptron

Logistic Regression & Neural Networks

MIRA, SVM, k-nn. Lirong Xia

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Log-Linear Models, MEMMs, and CRFs

Probabilistic Context Free Grammars. Many slides from Michael Collins

A Support Vector Method for Multivariate Performance Measures

Features of Statistical Parsers

Natural Language Processing

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Generalized Linear Classifiers in NLP

The Structured Weighted Violations Perceptron Algorithm

Spatial Role Labeling CS365 Course Project

Inexact Search is Good Enough

Distributed Training Strategies for the Structured Perceptron

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

Dependency grammar. Recurrent neural networks. Transition-based neural parsing. Word representations. Informs Models

A Deterministic Word Dependency Analyzer Enhanced With Preference Learning

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Global Machine Learning for Spatial Ontology Population

Structured Prediction with Perceptron: Theory and Algorithms

Multi-Component Word Sense Disambiguation

Lecture 3: Multiclass Classification

Multiclass and Introduction to Structured Prediction

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Hidden Markov Models Hamid R. Rabiee

Homework 4, Part B: Structured perceptron

Linear Classifiers. Michael Collins. January 18, 2012

Multiclass and Introduction to Structured Prediction

Sequential Supervised Learning

Multiclass Classification

CS 498F: Machine Learning Fall 2010

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Final Exam. 1 True or False (15 Points)

Online Passive-Aggressive Algorithms. Tirgul 11

Feedforward Neural Networks

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Information Extraction from Text

Sequential Data Modeling - The Structured Perceptron

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Graph-based Dependency Parsing. Ryan McDonald Google Research

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 5 Neural models for NLP

Final Examination CS 540-2: Introduction to Artificial Intelligence

Midterm, Fall 2003

Maschinelle Sprachverarbeitung

The Noisy Channel Model and Markov Models

Maschinelle Sprachverarbeitung

Minibatch and Parallelization for Online Large Margin Structured Learning

Undirected Graphical Models

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Lecture Notes on Inductive Definitions

Logistic regression and linear classifiers COMS 4771

The Perceptron Algorithm 1

Conditional Random Fields

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Transition-based Dependency Parsing with Selectional Branching

Hidden Markov Models

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Lecture 15. Probabilistic Models on Graph

Introduction to Machine Learning Midterm, Tues April 8

Online Learning in Tensor Space

Marrying Dynamic Programming with Recurrent Neural Networks

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Statistical Methods for NLP

Hidden Markov Models

Transcription:

Personal Project: Shift-Reduce Dependency Parsing 1 Problem Statement The goal of this project is to implement a shift-reduce dependency parser. This entails two subgoals: Inference: We must have a shift-reduce parser that finds the correct parse given an oracle. Learning: We must choose a model that approximates the oracle and train it with labeled data. 2 Framework 2.1 Projective Dependency Trees Given a sentence x = x 1... x n, we want to find its dependency tree structure. The n words x 1... x n correspond to n nodes in the tree. A valid dependency tree y for x is a directed tree over the n nodes rooted at a special node. We will characterize it as a set of arcs (i, j, l) where the pair (i, j) forms a directed edge and l L is the label of the edge. An example is worth a thousand words: x = I see. y = {(0, 2, ROOT), (2, 1, SBJ), (2, 3, PU)} ROOT SBJ PU I see. 0 1 2 3 We will only consider projective dependency trees, in which for every edge (i, j) there is a directed path from i to all nodes between i and j. As an illustration, tree (a) below is not projective since there is no path from 3 to 4 even though there is an edge (3, 5). In contrast, tree (b) is projective. x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 5 (a) (b) A projective dependency tree has the nested property that for every word x, all words that are reachable from x form a contiguous subsequence of the sentence. Note that tree (a) violates this property. 2.2 Shift-Reduce Parsing At any point in parsing sentence x, a shift-reduce parser maintains a parser configuration c = (S, Q, A) with respect to x where S is a stack [... i] S of nodes that are processed. Q is a queue [j...] Q of nodes that are yet to be processed. A is a set of arcs at this point. 1

The parser moves from one configuration to the next by performing one of the following transitions: left-arc(l): ([... i, j] S, Q, A) ([... j] S, Q, A {(j, i, l)}) right-arc(l): ([... i, j] S, Q, A) ([... i] S, Q, A {(i, j, l)}) shift: ([...] S, [i...] Q, A) ([... i] S, [...] Q, A) Let T be the set of all transitions. Given x = x 1... x n, the parser initializes the configuration as c = ([0] S, [1... n] Q, {}) and applies a sequence of transitions t T to reach the goal configuration c = ([0] S, [] Q, A) for some final set of arcs A. Then it returns y = A as the predicted dependency tree. 3 Inference Let o be an oracle that predicts the correct transition o(x, c) = t T for any configuration c with respect to sentence x. Using this oracle, we can find the true dependency tree for any sentence with the algorithm Shift-ReduceParse. Note that the running time is linear in the length of the sentence n, since there can be at most 2n transitions before reaching the goal configuration. Shift-ReduceParse Input: a sentence x = x 1... x n, an oracle o Output: a dependency tree y, transition history H Initialize c ([0] S, [1... n] Q, {}) and H {}. While S > 1 or Q [], t o(x, c) H H {(c, t)} c = (S, Q, A) t(c) Return y = A and H. 4 Learning Given Q training examples (x (1), y (1) )... (x (Q), y (Q) ) where x (q) is a sentence and y (q) is the dependency tree associated with it, we want to train a predictor ô that approximates the oracle o. 4.1 Sample Extraction Since the oracle receives a sentence x and a parser configuration c with respect to x as the input and returns a transition t T as the output, we need to prepare samples of form ((q, c), t) where q [Q] points to the relevant sentence x (q). For this purpose, we make use of an auxiliary function NextTransition. NextTransition Input: a dependency tree y, a configuration c = (S, Q, A) Output: the next transition to be applied to c based on y 1. Return shift if S < 2. 2. Otherwise, S = [... i, j] S for some i < j. (a) Return left-arc(l) if (j, i, l) y. (b) Return right-arc(l) if (i, j, l) y and every (j, j, l ) y is also in A. (c) Return shift otherwise. 2

The extra condition in 2(b) makes sure that node j parents all its children before it is removed from the stack. This is not necessary in 2(a) since if the tree y is projective, node i must parent all its children before reaching j in order to satisfy the nested property. Now we can extract a set of samples E = {((q (z), c (z) ), t (z) )} Z z=1 of some size Z using the algorithm ExtractSamples. ExtractSamples Input: training examples (x (1), y (1) )... (x (Q), y (Q) ) Output: a set of samples E = {((q (z), c (z) ), t (z) )} Z z=1 E {} For q = 1... Q, Define oracle o q for x (q) as follows. Given configuration c, the oracle will predict o q (x (q), c) = NextTransition(y (q), c) y q, H q Shift-ReduceParse(x (q), o q ) // y q = y (q) must hold E E {((q, c), t) : (c, t) H q } Return E. 4.2 Feature Representation Now that we have labeld samples ((q, c), t), it is straightforward to train a multiclass classifier that mimics the oracle. But first, we must decide on how to represent the input (q, c). Let φ be a feature function that maps a sentence-configuration pair (x, c) to a d-dimensional vector φ(x, c) R d. We can use any features in x and c = (S, Q, A) useful for making prediction, such as Part-of-speech tags of the nodes on the stack Word identities of the nodes on the stack Labels of the arcs originating from the nodes on the stack For example, suppose we extract a sample ((q, c), t) where x (q) = I see. c = ([0, 2, 3] S, [] Q, {(2, 1, SBJ)}) t = right-arc(pu) We can use a binary vector v = φ(x (q), c) R d to encode the following information: ROOT = True POS(3) = SYM POS(2) = VB WORD(3) =. WORD(2) = see ARC-L(3) = ARC-R(3) = ARC-L(2) = SBJ ARC-R(2) = For notational cleanness, we will use E = {(v (z), t (z) )} Z z=1 = {(φ(x (q(z)), c (z) ), t (z) )} Z z=1 to denote the set of feature-transformed samples. 3

4.3 Averaged Perceptron A linear classifier keeps a weight vector w t R d for each t T and defines a score function f(w t, φ(c)) R. Given a sentence x and a parser configuration c with respect to x, an oracle approximator ô using this classifier will predict ô(x, c) = arg max f(w t, φ(x, c)) t T We will choose the averaged perceptron as our classifier, which defines f(w t, φ(c)) = w t φ(c). The weight vector w t R d is learned from feature-transformed samples E = {(v (z), t (z) )} Z z=1 using the algorithm TrainAveragedPerceptron. Two remarks on this specific installment of the algorithm: Averaging: Instead of storing a vector w r,z t R d for all t T, r [R], z [Z] and then averaging R Z r=1 z=1 w t = wr,z t RZ we keep distinct weights w t only once and record how many examples it endures without making a mistake by a dictionary s t. Then the final weights are given by w t w t w st t s t (w t) w s t(w t) t st Update: The update scheme here is called ultraconservative : w t w t + γ t φ(c (z) ) where γ t (z) = 1, t t (z) γ t = 1, and γ t = 0 for t T on which no mistake is made. The normalization contraint is necessary for the percetron convergence guarantee. TrainAveragedPerceptron Input: E = {(v (z), t (z) )} Z z=1, number of rounds R N Data Structure: a dictionary s t for each t T Output: w t R d for each t T w t (0,..., 0) R d for all t T For r = 1... R, for z = 1... Z, Return For t T, s t (w t ) s t (w t ) + 1 if w t s t, s t (w t ) 1 otherwise Find a set of transitions that incorrectly scored higher than the true transition: If Ψ > 0, Ψ = {t T {t (z) } : w t v (z) > w t (z) v (z) } Update w t (z) w t (z) + v (z) For t Ψ, update w t w t 1 Ψ w t v (z) w t st w t s t (w t) w t st s t(w t) for all t T 4.4 Summary Here we summarize the procedure of estimating the oracle developed in this section. The inputs are training data of dependency trees, a feature function φ to represent any sentence-configuration pair (x, c), and the number of training rounds R. 4

EstimateOracle Input: (x (1), y (1) )... (x (Q), y (Q) ), feature function φ, number of rounds R Output: an oracle appoximator ô E ExtractSamples((x (1), y (1) )... (x (Q), y (Q) )) E {(φ(x (q), c), t) : ((q, c), t) E} {w t } t T TrainAveragedPerceptron(E, R) Return an oracle approximator ô that predicts for any sentence x and a configuration c with respect to x ô(c) = arg max w t φ(x, c) t T References Sandra Kübler, Ryan McDonald, and Joakim Nivre (2009). Dependency Parsing. Joakim Nivre (2005). Inductive Dependency Parsing of Natural Language Text. Massimiliano Ciaramita and Giuseppe Attardi (2007). Feature Maps and Annotated Semantic Information. Dependency Parsing with Second-Order Michael Collins (2002). Discriminative Training Methods for HMMs: Theory and Experiments with Perceptron Algorithms. 5