Graph-based Dependency Parsing. Ryan McDonald Google Research

Similar documents
Advanced Graph-Based Parsing Techniques

The Real Story on Synchronous Rewriting Systems. Daniel Gildea Computer Science Department University of Rochester

Polyhedral Outer Approximations with Application to Natural Language Parsing

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Data-Driven Dependency Parsing

Marrying Dynamic Programming with Recurrent Neural Networks

Quasi-Second-Order Parsing for 1-Endpoint-Crossing, Pagenumber-2 Graphs

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Relaxed Marginal Inference and its Application to Dependency Parsing

Decoding and Inference with Syntactic Translation Models

Lab 12: Structured Prediction

Treebank Grammar Techniques for Non-Projective Dependency Parsing

Spectral Unsupervised Parsing with Additive Tree Metrics

Computational Linguistics

Personal Project: Shift-Reduce Dependency Parsing

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Probabilistic Context Free Grammars. Many slides from Michael Collins

Maximum Flow Problem (Ford and Fulkerson, 1956)

Soft Inference and Posterior Marginals. September 19, 2013

Statistical Methods for NLP

Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

CMPSCI611: The Matroid Theorem Lecture 5

Unit 1A: Computational Complexity

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Lecture 15. Probabilistic Models on Graph

Language and Statistics II

Structured Prediction Models via the Matrix-Tree Theorem

NP Completeness and Approximation Algorithms

Parsing. Unger s Parser. Introduction (1) Unger s parser [Grune and Jacobs, 2008] is a CFG parser that is

Proof of Theorem 1. Tao Lei CSAIL,MIT. Here we give the proofs of Theorem 1 and other necessary lemmas or corollaries.

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Easy Problems vs. Hard Problems. CSE 421 Introduction to Algorithms Winter Is P a good definition of efficient? The class P

5 Integer Linear Programming (ILP) E. Amaldi Foundations of Operations Research Politecnico di Milano 1

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Stacking Dependency Parsers

Algorithms for Syntax-Aware Statistical Machine Translation

Lecture 18: P & NP. Revised, May 1, CLRS, pp

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Outline. Parsing. Approximations Some tricks Learning agenda-based parsers

Products of Weighted Logic Programs

Probabilistic Graphical Models

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

CS281A/Stat241A Lecture 19

NP-Completeness. Andreas Klappenecker. [based on slides by Prof. Welch]

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Some Writing Examples

the tree till a class assignment is reached

Lecture 21: Spectral Learning for Graphical Models

Fisher Information in Gaussian Graphical Models

Bringing machine learning & compositional semantics together: central concepts

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16

The minimum G c cut problem

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Better! Faster! Stronger*!

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Recurrent neural network grammars

Aspects of Tree-Based Statistical Machine Translation

Markov Networks.

Maxent Models and Discriminative Estimation

So# Inference and Posterior Marginals. September 19, 2013

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Spectral Clustering. Guokun Lai 2016/10

Dependency grammar. Recurrent neural networks. Transition-based neural parsing. Word representations. Informs Models

P, NP, NP-Complete, and NPhard

Efficient Parsing of Well-Nested Linear Context-Free Rewriting Systems

3 : Representation of Undirected GM

Preliminaries. Introduction to EF-games. Inexpressivity results for first-order logic. Normal forms for first-order logic

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Branch-and-Bound. Leo Liberti. LIX, École Polytechnique, France. INF , Lecture p. 1

Hidden Markov Models. x 1 x 2 x 3 x N

Some Algebra Problems (Algorithmic) CSE 417 Introduction to Algorithms Winter Some Problems. A Brief History of Ideas

Undirected Graphical Models: Markov Random Fields

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Multilevel Coarse-to-Fine PCFG Parsing

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Massive Experiments and Observational Studies: A Linearithmic Algorithm for Blocking/Matching/Clustering

Section Notes 8. Integer Programming II. Applied Math 121. Week of April 5, expand your knowledge of big M s and logical constraints.

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

Connotation: A Dash of Sentiment Beneath the Surface Meaning

Some Algebra Problems (Algorithmic) CSE 417 Introduction to Algorithms Winter Some Problems. A Brief History of Ideas

Probability and Structure in Natural Language Processing

Products of weighted logic programs

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Cluster algebras, snake graphs and continued fractions. Ralf Schiffler

NP-Complete Problems and Approximation Algorithms

Lecture 22: Counting

Computational and Statistical Learning Theory

Statistical Machine Translation of Natural Languages

Algorithm Design and Analysis

Parsing Mildly Non-projective Dependency Structures

Artificial Intelligence

Random Walks and Electric Resistance on Distance-Regular Graphs

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

NATIONAL UNIVERSITY OF SINGAPORE CS3230 DESIGN AND ANALYSIS OF ALGORITHMS SEMESTER II: Time Allowed 2 Hours

Chapter Finding parse trees

Transcription:

Graph-based Dependency Parsing Ryan McDonald Google Research ryanmcd@google.com

Reader s Digest Graph-based Dependency Parsing Ryan McDonald Google Research ryanmcd@google.com

root ROOT Dependency Parsing SBJ will VC NMOD Tomash remain PP Mr as NP Mr Tomash will remain as a director emeritus a NMOD director emeritus NMOD

Definitions L = {l 1,l 2,...,l m } X = x 0 x 1...x n Y Arc label set Input sentence Dependency Graph/Tree

Definitions root L = {l 1,l 2,...,l m } X = x 0 x 1...x n Y Arc label set Input sentence Dependency Graph/Tree

Definitions L = {l 1,l 2,...,l m } X = x 0 x 1...x n Y Arc label set Input sentence Dependency Graph/Tree (i, j, k) Y indicates l k x i x j

Graph-based Parsing Factor the weight/score graphs by subgraphs w(y )= τ Y w τ τ is from a set of subgraphs of interest, e.g., arcs, adjacent arcs Product vs. Sum: Y = arg max Y τ Y w τ = arg max Y τ Y log w τ

Arc-factored Graph-based Parsing root saw John Mary

Arc-factored Graph-based Parsing root 10 9 John saw 30 30 20 0 11 3 9 Mary Learn to weight arcs w(y )= a Y w a

Arc-factored Graph-based Parsing root 10 9 John saw 30 30 20 0 11 3 9 Mary Learn to weight arcs w(y )= a Y w a Y = arg max Y a Y w a Inference/Parsing/Argmax

Arc-factored Graph-based Parsing root 10 9 John saw 30 30 20 0 11 3 9 Mary Learn to weight arcs w(y )= a Y w a Y = arg max Y a Y w a Inference/Parsing/Argmax root John saw Mary

Arc-factored Projective Parsing W[i][j][h] = weight of best tree spanning words i to j rooted at word h k h h h w(a) w(b) w k hh max over k, l, h A B i l l+1 j k i j h A h B i l l+1 j

Arc-factored Projective Parsing W[i][j][h] = weight of best tree spanning words i to j rooted at word h Eisner 96 O( L n 5 ) O(n 3 + L n 2 ) k h h h w(a) w(b) w k hh max over k, l, h A B i l l+1 j k i j h A h B i l l+1 j

Arc-factored Non-projective Parsing Non-projective Parsing (McDonald et al 05) Inference: O( L n2 ) with Chu-Liu-Edmonds MST alg Greedy-Recursive algorithm root 10 9 John saw 30 30 20 0 11 3 9 Mary Spanning trees Valid dependency graphs

Arc-factored Non-projective Parsing Non-projective Parsing (McDonald et al 05) Inference: O( L n2 ) with Chu-Liu-Edmonds MST alg Greedy-Recursive algorithm root 9 We win with non-projective algorithms!... err... John 10 saw 30 30 20 11 3 0 9 Mary Spanning trees Valid dependency graphs

Arc-factored Non-projective Parsing Non-projective Parsing (McDonald et al 05) Inference: O( L n2 ) with Chu-Liu-Edmonds MST alg Greedy-Recursive algorithm root 9 We win with non-projective algorithms!... err... 10 Spanning trees Greedy/Recursive saw is not what we are used to John 30 30 20 11 3 0 9 Mary Valid dependency graphs

Arc-factored models can be powerful But does not model linguistic reality Syntax is not context independent Beyond Arc-factored Models

Arc-factored models can be powerful But does not model linguistic reality Syntax is not context independent Beyond Arc-factored Models Arity Arity of a word = # of modifiers in graph Model arity through preference parameters

Figure 4: Vertical and Horizontal neighbourhood for Arc-factored models can be powerful But does not model linguistic reality Syntax is not context independent Beyond Arc-factored Models Arity Arity of a word = # of modifiers in graph Model arity through preference parameters Markovization Vertical/Horizontal Adjacent arcs

Projective -- Easy W[i][j][h][a] = weight of best tree spanning words i to j rooted at word h with arity a Arity terms k h,a-1 h h,a w(a) w a 1 h w(b) w k hh w a h A B i l l+1 j max over k, l, h k i j h A h,a-1 B i l l+1 j

Non-projective -- Hard McDonald and Satta 07 Arity (even just modified/not-modified) is NP-hard Markovization is NP-hard Can basically generalize to any non-local info Generalizes Nehaus and Boker 97 Arc-factored: non-projective easier Beyond arc-factored: non-projective harder

Non-projective Solutions In all cases we augment w(y) w(y )= (i,j,k) w k ij β Calculate w(y) using: Approximations (Jason s talk!) Exact ILP methods Chart-parsing Algorithms Re-ranking MCMC Arity/Markovization/etc

Annealing Approximations (McDonald & Pereira 06) Start with initial guess Make small changes to increase w(y) w(y )= (i,j,k) w k ij β

Annealing Approximations (McDonald & Pereira 06) Start with initial guess Make small changes to increase w(y) w(y )= (i,j,k) w k ij β Initial guess: arg max Y (i,j,k) w k ij Arc Factored

Annealing Approximations (McDonald & Pereira 06) Start with initial guess Make small changes to increase w(y) w(y )= (i,j,k) w k ij β Initial guess: arg max Y (i,j,k) w k ij Arc Factored Until convergence Find arc change to maximize Make the change to guess w(y )= (i,j,k) w k ij β

Annealing Approximations (McDonald & Pereira 06) Start with initial guess Make small changes to increase w(y) w(y )= (i,j,k) w k ij β Initial guess: arg max Y (i,j,k) w k ij Arc Factored Until convergence Find arc change to maximize Make the change to guess Good in w(y )= practice, wij k β but suffers from (i,j,k) local maxima

Integer Linear Programming (ILP) (Riedel and Clarke 06, Kubler et al 09, Martins, Smith and Xing 09) An ILP is an optimization problem with: A linear objective function A set of linear constraints ILPs are NP-hard in worst-case, but well understood w/ fast algorithms in practice Dependency parsing can be cast as an ILP Note: we will work in the log space Y = arg max Y Y (G X ) (i,j,k) log w k ij

Arc-factored Dependency Parsing as an ILP (from Kubler, MDonald and Nivre 2009) Define integer variables: a k ij {0, 1} a k ij = 1 iff (i, j, k) Y b ij {0, 1} b ij = 1 iff x i... x j Y

Arc-Factored Dependency Parsing as an ILP (from Kubler, McDonald and Nivre 2009) max a i,j,k a k ij log w k ij such that: i,k a k i0 =0 j : i,k a k ij =1 Constrain arc assignments to produce a tree i, j, k : b ij a k ij 0 i, j, k :2b ik b ij b jk 1 i : b ii =0

Arc-Factored Dependency Parsing as an ILP (from Kubler, McDonald and Nivre 2009) max a i,j,k a k ij log w k ij such Can add that: non-local constraints & preference parameters Riedel & Clarke 06, Martins et al. 09 i,k a k i0 =0 j : i,k a k ij =1 Constrain arc assignments to produce a tree i, j, k : b ij a k ij 0 i, j, k :2b ik b ij b jk 1 i : b ii =0

Dynamic Prog/Chart-based methods

Dynamic Prog/Chart-based methods Question: are there efficient non-projective chart parsing algorithms for unrestricted trees? Most likely not: we could just augment them to get tractable non-local non-projective models

Dynamic Prog/Chart-based methods Question: are there efficient non-projective chart parsing algorithms for unrestricted trees? Most likely not: we could just augment them to get tractable non-local non-projective models Gomez-Rodriguez et al. 09, Kuhlmann 09 For well-nested dependency trees of gap-degree 1 Kuhlmann & Nivre: Accounts for >> 99% of trees O(n 7 ) deductive/chart-parsing algorithms

Dynamic Prog/Chart-based methods Question: are there efficient non-projective chart parsing algorithms for unrestricted trees? Most likely not: we could just augment them to get tractable non-local non-projective models Gomez-Rodriguez et al. 09, Kuhlmann 09 For well-nested dependency trees of gap-degree 1 Kuhlmann & Nivre: Accounts for >> 99% of trees O(n 7 ) deductive/chart-parsing algorithms Chart-parsing == easy to extend beyond arc-factored assumptions

What is next? Getting back to grammars? Non-projective unsupervised parsing? Efficiency?

Getting Back to Grammars Almost all research has been grammar-less All possible structures permissible Just learn to discriminate good from bad Unlike SOTA phrase-based methods All explicitly use (derived) grammar

Getting Back to Grammars

Getting Back to Grammars Projective == CF Dependency Grammars Gaifman (65), Eisner & Blatz (07), Johnson (07)

Getting Back to Grammars Projective == CF Dependency Grammars Gaifman (65), Eisner & Blatz (07), Johnson (07) Mildly context sensitive dependency grammars Restricted chart parsing for well-nested/gap-degree 1 Bodirsky et al. (05): capture LTAG derivations

Getting Back to Grammars Projective == CF Dependency Grammars Both just put constraints on output Annealing algs == repair algs in CDGs Gaifman (65), Eisner & Blatz (07), Johnson (07) Mildly context sensitive dependency grammars Restricted chart parsing for well-nested/gap-degree 1 Bodirsky et al. (05): capture LTAG derivations ILP == Constraint Dependency Grammars (Maruyama 1990) CDG constraints can be added to ILP (hard/soft)

Getting Back to Grammars Projective == CF Dependency Grammars Questions and parsing speeds? Both just put constraints on output Annealing algs == repair algs in CDGs Gaifman (65), Eisner & Blatz (07), Johnson (07) Mildly context sensitive dependency grammars Restricted chart parsing for well-nested/gap-degree 1 1. Bodirsky Can we et flush al. (05): out capture the connections LTAG derivations further? 2. Can we use grammars to improve accuracy ILP == Constraint Dependency Grammars (Maruyama 1990) CDG constraints can be added to ILP (hard/soft)

Non-projective Unsupervised Parsing McDonald and Satta 07 Not true w/ valence All projective Dependency model w/o valence (arity) is tractable Klein & Manning 04, Smith 06, Headden et al. 09 Valence++ required for good performance

Non-projective Unsupervised Parsing McDonald and Satta 07 Not true w/ valence All projective Dependency model w/o valence (arity) is tractable Klein & Manning 04, Smith 06, Headden et al. 09 Valence++ required for good performance Non-projective Unsupervised Systems?

Swedish O(nL) Efficiency / Resources O(n 3 + nl) O(n 3 L 2 ) O(n 3 L 2 ) O(n 2 kl 2 ) Malt Joint MST pipeline MST joint MST Joint Feat Hash MST Joint Feat Hash Coarse to Fine LAS 84.6 82.0 83.9 84.3 84.1 Parse time - 1.00 ~125.00 ~30.00 4.50 Model size - 88 Mb 200 Mb 11 Mb 15 Mb # features - 16 M 30 M 30 M 30 M Pretty good, but still not there! -- A*?, More pruning?

Summary

Where we ve been Arc-factored: Eisner / MST Beyond arc-factored: NP-hard Approximations ILP Chart-parsing on defined subset Summary

Where we ve been Arc-factored: Eisner / MST Beyond arc-factored: NP-hard Approximations ILP Chart-parsing on defined subset Summary What s next The return of grammars? Non-projective unsupervised parsing Making models practical on web-scale