Structured Prediction Models via the Matrix-Tree Theorem

Similar documents
Probabilistic Context Free Grammars. Many slides from Michael Collins

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Advanced Graph-Based Parsing Techniques

Natural Language Processing

Sequential Supervised Learning

Introduction to Data-Driven Dependency Parsing

Polyhedral Outer Approximations with Application to Natural Language Parsing

Expectation Maximization (EM)

The Structured Weighted Violations Perceptron Algorithm

Structured Prediction

Statistical Methods for NLP

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Stacking Dependency Parsers

Lecture 9: PGM Learning

Spectral Learning for Non-Deterministic Dependency Parsing

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

lecture 6: modeling sequences (final part)

Marrying Dynamic Programming with Recurrent Neural Networks

Intelligent Systems:

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Dual Decomposition for Natural Language Processing. Decoding complexity

Relaxed Marginal Inference and its Application to Dependency Parsing

Personal Project: Shift-Reduce Dependency Parsing

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Transition-based Dependency Parsing with Selectional Branching

The Infinite PCFG using Hierarchical Dirichlet Processes

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Features of Statistical Parsers

2.2 Structured Prediction

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

The Ising model and Markov chain Monte Carlo

Machine Learning for Structured Prediction

Conditional Random Fields: An Introduction

Probabilistic Graphical Models & Applications

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Probabilistic Graphical Models

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Transition-based Dependency Parsing with Selectional Branching

Learning and Inference over Constrained Output

Generalized Linear Classifiers in NLP

Probabilistic Graphical Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Expectation Propagation in Factor Graphs: A Tutorial

12 : Variational Inference I

CS540 ANSWER SHEET

Lab 12: Structured Prediction

Lecture 15. Probabilistic Models on Graph

Introduction To Graphical Models

Graphical Models and Kernel Methods

Intelligent Systems (AI-2)

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Structured Learning with Approximate Inference

Statistical Methods for NLP

Bayesian Learning in Undirected Graphical Models

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Undirected Graphical Models

Soft Inference and Posterior Marginals. September 19, 2013

Fast Inference and Learning with Sparse Belief Propagation

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Graphical Models

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Probabilistic Graphical Models (I)

Structured Determinantal Point Processes

Expectation Maximization (EM)

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

10/17/04. Today s Main Points

Distributed Training Strategies for the Structured Perceptron

Log-Linear Models with Structured Outputs

Learning MN Parameters with Approximation. Sargur Srihari

Intelligent Systems (AI-2)

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Tightening Fractional Covering Upper Bounds on the Partition Function for High-Order Region Graphs

Prepositional Phrase Attachment over Word Embedding Products

Collapsed Variational Bayesian Inference for Hidden Markov Models

A* Search. 1 Dijkstra Shortest Path

Algorithms for Syntax-Aware Statistical Machine Translation

Bayesian Learning (II)

Transition-based dependency parsing

Conditional Random Field

CS281A/Stat241A Lecture 19

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Lecture 13: Structured Prediction

Conditional Random Fields for Sequential Supervised Learning

Chunking with Support Vector Machines

with Local Dependencies

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

17 Variational Inference

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Online EM for Unsupervised Models

A Deterministic Word Dependency Analyzer Enhanced With Preference Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Transcription:

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo Amir Globerson Xavier Carreras Michael Collins maestro@csail.mit.edu gamir@csail.mit.edu carreras@csail.mit.edu mcollins@csail.mit.edu MIT Computer Science and Artificial Intelligence Laboratory

Dependency parsing * John saw Mary Syntactic structure represented by head-modifier dependencies

Projective vs. non-projective structures * John saw a movie today that he liked Non-projective structures allow crossing dependencies Frequent in languages like Czech, Dutch, etc. Non-projective parsing is max-spanning-tree (McDonald et al., 2005)

Contributions of this work Fundamental inference algorithms that sum over possible structures: This talk: Model type Inference Algorithm HMM Forward-Backward Graphical Model Belief Propagation PCFG Inside-Outside Projective Dep. Trees Inside-Outside Non-projective Dep. Trees?? Inside-outside-style algorithms for non-projective dependency structures An application: training log-linear and max-margin parsers Independently-developed work: Smith and Smith (2007), McDonald and Satta (2007)

Overview Background Matrix-Tree-based inference Experiments

Edge-factored structured prediction 0 * 1 2 3 John saw Mary A dependency tree y is a set of head-modifier dependencies (McDonald et al., 2005; Eisner, 1996) (h, m) is a dependency with feature vector f(x, h, m) Y(x) is the set of all possible trees for sentence x y = argmax y Y(x) (h,m) y w f(x, h, m)

Training log-linear dependency parsers Given a training set {(x i, y i )} N i=1, minimize L(w) = C 2 w 2 N i=1 log P (y i x i ; w)

Training log-linear dependency parsers Given a training set {(x i, y i )} N i=1, minimize L(w) = C 2 w 2 N i=1 log P (y i x i ; w) Log-linear distribution over trees P (y x; w) = Z(x; w) = 1 Z(x; w) exp y Y(x) exp (h,m) y (h,m) y w f(x, h, m) w f(x, h, m)

Training log-linear dependency parsers Gradient-based optimizers evaluate L(w) and L w L(w) = C 2 w 2 C 2 w 2 + N i=1 N i=1 (h,m) y i w f(x i, h, m) log Z(x i ; w) Main difficulty: computation of the partition functions

Training log-linear dependency parsers Gradient-based optimizers evaluate L(w) and L w L w = Cw N Cw + i=1 N i=1 (h,m) y i f(x i, h, m) h,m P (h m x; w)f(x i, h, m ) The marginals are edge-appearance probabilities P (h m x; w) = y Y(x) : (h,m) y P (y x; w)

Generalized log-linear inference Vector θ with parameter θ h,m for each dependency P (y x; θ) = Z(x; θ) = P (h m x; θ) = 1 Z(x; θ) exp y Y(x) exp 1 Z(x; θ) (h,m) y (h,m) y θ h,m y Y(x) : (h,m) y θ h,m exp (h,m) y θ h,m E.g., θ h,m = w f(x, h, m)

Applications of log-linear inference Generalized inference engine that takes θ as input Different definitions of θ can be used for log-linear or max-margin training [ C N w LL = argmin w w MM = argmin w 2 w 2 [ C 2 w 2 + i=1 N i=1 log P (y i x i ; w) max y ] (E i,y m i,y (w)) Exponentiated-gradient updates for max-margin models Bartlett, Collins, Taskar and McAllester (2004) Globerson, Koo, Carreras and Collins (2007) ]

Overview Background Matrix-Tree-based inference Experiments

Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

Matrix-Tree Theorem (Tutte, 1984) Given: 1. Directed graph G 2. Edge weights θ 3. A node r in G 1 2 1 4 3 2 3 A matrix L (r) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r

Matrix-Tree Theorem (Tutte, 1984) Given: 1. Directed graph G 2. Edge weights θ 3. A node r in G 1 2 1 3 4 2 3 A matrix L (r) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp {2 + 4} + exp {1 + 3} = det(l (1) )

Matrix-Tree Theorem (Tutte, 1984) Given: 1 2 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L (r) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp {2 + 4} + exp {1 + 3} = det(l (1) )

Matrix-Tree Theorem (Tutte, 1984) Given: 1 2 2 1. Directed graph G 1 4 2. Edge weights θ 3 3. A node r in G 3 A matrix L (r) can be constructed whose determinant is the sum of weighted spanning trees of G rooted at r = exp {2 + 4} + exp {1 + 3} = det(l (1) )

Multi-root partition function 0 1 2 3 * John saw Mary Edge weights θ, root r = 0 det(l (0) ) = non-projective multi-root partition function

Construction of L (0) L (0) has a simple construction off-diagonal: L (0) h,m = exp {θ h,m} on-diagonal: L (0) m,m = n h =0 exp {θ h,m } E.g., L (0) 3,3 0 1 2 3 * John saw Mary The determinant of L (0) can be evaluated in O(n 3 ) time

Single-root vs. multi-root structures * John saw Mary * John saw Mary Multi-root structures allow multiple edges from * Single-root structures have exactly one edge from * Independent adaptations of the Matrix-Tree Theorem: Smith and Smith (2007), McDonald and Satta (2007)

Single-root partition function Naïve method for computing the single-root non-projective partition function 1 0 2 3 John saw Mary *

Single-root partition function Naïve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0, 1) Computing n determinants requires O(n 4 ) time

Single-root partition function Naïve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0, 2) Computing n determinants requires O(n 4 ) time

Single-root partition function Naïve method for computing the single-root non-projective partition function 0 1 2 3 * John saw Mary Exclude all root edges except (0, 3) Computing n determinants requires O(n 4 ) time

Single-root partition function An alternate matrix ˆL can be constructed such that det(ˆl) is the single-root partition function first row: ˆL1,m = exp {θ 0,m } other rows, on-diagonal: ˆLm,m = n h =1 exp {θ h,m } other rows, off-diagonal: ˆLh,m = exp {θ h,m } Single-root partition function requires O(n 3 ) time

Non-projective marginals The log-partition generates the marginals P (h m x; θ) = = Derivative of log-determinant: log Z(x; θ) θ h,m = log det(ˆl) h,m ˆL h,m log det(ˆl) ˆL log det(ˆl) ˆL h,m θ h,m Complexity dominated by matrix inverse, O(n 3 ) θ h,m = (ˆL 1 ) T

Summary of non-projective inference Partition function: matrix determinant, O(n 3 ) Marginals: matrix inverse, O(n 3 ) Single-root inference: ˆL Multi-root inference: L (0)

Overview Background Matrix-Tree-based inference Experiments

Log-linear and max-margin training Log-linear training w LL = argmin w [ C 2 w 2 N i=1 log P (y i x i ; w) ] Max-margin training w MM = argmin w [ C 2 w 2 + N i=1 max y (E i,y m i,y (w)) ]

Multilingual parsing experiments Six languages from CoNLL 2006 shared task Training algorithms: averaged perceptron, log-linear models, max-margin models Projective models vs. non-projective models Single-root models vs. multi-root models

Multilingual parsing experiments Dutch (4.93%cd) Projective Training Non-Projective Training Perceptron 77.17 78.83 Log-Linear 76.23 79.55 Max-Margin 76.53 79.69 Non-projective training helps on non-projective languages

Multilingual parsing experiments Spanish (0.06%cd) Projective Training Non-Projective Training Perceptron 81.19 80.02 Log-Linear 81.75 81.57 Max-Margin 81.71 81.93 Non-projective training doesn t hurt on projective languages

Multilingual parsing experiments Results across all 6 languages (Arabic, Dutch, Japanese, Slovene, Spanish, Turkish) Perceptron 79.05 Log-Linear 79.71 Max-Margin 79.82 Log-linear and max-margin parsers show improvement over perceptron-trained parsers Improvements are statistically significant (sign test)

Summary Inside-outside-style inference algorithms for non-projective structures Application of the Matrix-Tree Theorem Inference for both multi-root and single-root structures Empirical results Non-projective training is good for non-projective languages Log-linear and max-margin parsers outperform perceptron parsers

Thanks! Thanks for listening!

Thanks!

Challenges for future research State-of-the-art performance is obtained by higher-order models (McDonald and Pereira, 2006; Carreras, 2007) Higher-order non-projective inference is nontrivial (McDonald and Pereira, 2006; McDonald and Satta, 2007) Approximate inference may work well in practice Reranking of k-best spanning trees (Hall, 2007)