Spectral Unsupervised Parsing with Additive Tree Metrics

Similar documents
Spectral Unsupervised Parsing with Additive Tree Metrics

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 21: Spectral Learning for Graphical Models

LECTURER: BURCU CAN Spring

Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Multiword Expression Identification with Tree Substitution Grammars

Probabilistic Context Free Grammars. Many slides from Michael Collins

Language and Statistics II

Lecture 5: UDOP, Dependency Grammars

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Advanced Natural Language Processing Syntactic Parsing

Spectral Probabilistic Modeling and Applications to Natural Language Processing

Statistical Methods for NLP

The Infinite PCFG using Hierarchical Dirichlet Processes

Processing/Speech, NLP and the Web

CS460/626 : Natural Language

Lecture 12: Algorithms for HMMs

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Probabilistic Context-free Grammars

Topics in Natural Language Processing

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Lecture 12: Algorithms for HMMs

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Analyzing the Errors of Unsupervised Learning

CS460/626 : Natural Language

Lecture 9: Hidden Markov Model

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Latent Variable Models in NLP

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Natural Language Processing

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Parsing with Context-Free Grammars

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Lecture 15. Probabilistic Models on Graph

Spectral Learning for Non-Deterministic Dependency Parsing

Probabilistic Context-Free Grammar

Expectation Maximization (EM)

Lab 12: Structured Prediction

CS : Speech, NLP and the Web/Topics in AI

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Midterm sample questions

Statistical methods in NLP, lecture 7 Tagging and parsing

A Context-Free Grammar

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

10/17/04. Today s Main Points

Features of Statistical Parsers

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 13: Structured Prediction

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CS 662 Sample Midterm

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Posterior Sparsity in Unsupervised Dependency Parsing

DT2118 Speech and Speaker Recognition

Hidden Markov Models

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Semi-Supervised Learning

Annealing Techniques for Unsupervised Statistical Language Learning

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

CS838-1 Advanced NLP: Hidden Markov Models

Marrying Dynamic Programming with Recurrent Neural Networks

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Smoothing for Bracketing Induction

Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction

Machine Learning for NLP

Chapter 14 (Partially) Unsupervised Parsing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

topic modeling hanna m. wallach

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Maschinelle Sprachverarbeitung

Spectral Learning of Refinement HMMs

Maschinelle Sprachverarbeitung

Hidden Markov Models (HMMs)

Computational Learning of Probabilistic Grammars in the Unsupervised Setting

Sequence Labeling: HMMs & Structured Perceptron

Better! Faster! Stronger*!

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Polyhedral Outer Approximations with Application to Natural Language Parsing

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

{Probabilistic Stochastic} Context-Free Grammars (PCFGs)

Random Generation of Nondeterministic Tree Automata

PAC Generalization Bounds for Co-training

TnT Part of Speech Tagger

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Text Mining. March 3, March 3, / 49

An introduction to PRISM and its applications

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Structured Prediction

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Maxent Models and Discriminative Estimation

Reduced-Rank Hidden Markov Models

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Graph-based Dependency Parsing. Ryan McDonald Google Research

Transcription:

Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1

Overview Model: We present a novel approach to unsupervised parsing via latent tree structure learning Algorithm: Unlike existing methods, our algorithm is local-optima-free and has theoretical guarantees of statistical consistency Key Ideas: Additive tree metrics from phylogenetics Spectral decomposition of cross-covariance word embedding matrix Kernel smoothing Empirical: Our method performs favorably to the constituent context model [Klein and Manning 2002] Ankur Parikh 2014 2

Outline Motivation Intuition and Model Learning algorithm Experimental results Ankur Parikh 2014 3

Supervised Parsing Training Set Given sentences with parse trees Test Set Find parse tree for each sentence? NN ADV VB NN CONJ NN Lions quickly chase deer and antelope Ankur Parikh 2014 4

Supervised Parsing Modeling: Assume tag sequence is generated by set of rules: P tree = P S NP VP P NP DT NN NP) P VP VB NN VP) Learning: Easy to directly estimate rule probabilities from training data Foundation of modern supervised parsing systems. Ankur Parikh 2014 5

Annotated Training Data Is Difficult to Obtain Annotating parse structure requires domain expertise, not easily crowdsourced. But sentences (and part-of-speech tags) are abundant! Ankur Parikh 2014 6

Unsupervised Parsing Training Set Given sentences and part-of-speech tags DT NN VB NN The bear likes fish DT NN VB DT NN The llama eats the grass Test Set Find (unlabeled) parse tree for each sentence? NN ADV VB NN CONJ NN Lions quickly chase deer and antelope Parse tree structure now is a latent variable Ankur Parikh 2014 7

Unsupervised Parsing is Much Harder Attempt to apply context free grammar strategy [Carroll and Charniak 1992, Pereira and Schabes 1992] Modeling: Some unknown set of rules generates the tree. Learning: Attempt to find set of rules R and parameters θ that maximize data likelihood. Ankur Parikh 2014 8

Unsupervised Parsing is Much Harder Unsupervised PCFGs perform abysmally and worse than trivial baselines such as right branching trees. Why? Modeling: Solution that optimizes likelihood is not unique (nonidentifiability) [Hsu et al. 2013] Learning: Likelihood function highly non-convex and search space contains severe local optima Ankur Parikh 2014 9

Existing Approaches Other strategies outperform PCFGs but face similar challenges objectives still NP-hard [Cohen & Smith 2012]. Severe local optima - accuracy can vary 40 percentage points between random restarts Need complicated techniques to achieve good results Model/feature engineering [Klein & Manning 2002, Cohen & Smith 2009, Gillenwater et al. 2010] Careful initialization [Klein & Manning 2002, Spitkovsky et al. 2010] count transforms [Spitkovsky et al. 2013] These generally lack theoretical justification and effectiveness can vary across languages Ankur Parikh 2014 10

Existing Approaches Spectral techniques have led to theoretical insights for unsupervised parsing Restriction of PCFG model [Hsu et al. 2013] Weighted Matrix Completion [Bailly et al. 2013] But these algorithms not designed for good empirical performance Our goal is to give a first step to bridging this theoryexperiment gap Ankur Parikh 2014 11

Our Approach Formulate new model where unsupervised parsing corresponds to latent tree structure learning problem Derive local optima free learning algorithm with theoretical guarantees on statistical consistency Part of broader research theme of exploiting linear algebra for probabilistic modeling Ankur Parikh 2014 12

Outline Motivation Intuition and Model Learning algorithm Experimental results Ankur Parikh 2014 13

Intuition Consider the following part-of-speech tag sequence: VBD DT NN verb article noun Two possible binary (unlabeled) parses Ankur Parikh 2014 14

Intuition Consider sentences with this tag sequence: VBD DT NN ate an apple baked a cake hit the ball ran the race Can we uncover the parse structure based on these sentences? Ankur Parikh 2014 15

Intuition article(dt) and noun(nn) are dependent an = noun is singular and starts with a vowel a = noun is singular and starts with constant the = noun could be anything verb(vbd) and article(dt) not very dependent Choice of article not dependent on choice of verb VBD DT NN ate an apple baked a cake hit the balls ran the race Ankur Parikh 2014 16

Intuition article (DT) and noun(nn) are more dependent than verb(vb) and article(dt) Ankur Parikh 2014 17

Latent Variable Intuition z w 2 w 3 plurality/starts with vowel part-of-speech tags w 1 w 2 w 3 ate an apple baked a cake hit the balls ran the race P w 2, w 3 z, x = P w 2 z, x P w 3 z, x) Ankur Parikh 2014 18

Latent Variable Intuition w 1 z 1 verb/noun semantic class z 2 w 2 w 3 plurality + noun topic w 1 w 2 w 3 ate an apple baked a cake hit the balls ran the race Looks a lot like a constituent parse tree!! Ankur Parikh 2014 19

Our Conditional Latent Tree Model Each tag sequence x associated with a latent tree x = (DT, NN, VBD, DT, NN) H p w, z x) = p z i π x z i ) i=1 l(x) p w i π x w i ) i=1 z 1 z 2 z 3 w 1 w 2 w 3 w 4 w 5 w 1, w 2, w 3, w 4, w 5, z 1, z 2, z 3 The bear ate the fish Ankur Parikh 2014 20

Different Tag Sequences Have Different Trees x 1 = (DT, NN, VBD, DT, NN) x 2 = (DT, NN, VBD, DT, ADJ, NN) z 1 z 1 z 2 z 3 z 2 z 3 w 1 w 2 w 3 w 4 w 5 w 1 w 2 w 3 w 4 z 4 The bear ate the fish A moose ran the race w 5 w 6 The bear ate the big fish The moose ran the tiring race Ankur Parikh 2014 21

Mapping Latent Tree To Parse Tree Latent tree is undirected. Direct by choosing a split point z 2 z 1 z 3 z 1 z 2 w 1 w 2 w 3 w 4 w 5 w 1 w 2 w 3 z 3 w 4 w 5 Result is (unlabeled) parse tree Ankur Parikh 2014 22

Model Summary Each tag sequence x is associated with a latent tree u(x) u(x) generates sentences with these tags u(x) can be deterministically mapped to parse tree given a split point x = (DT, NN, VBD, DT, NN) z 2 z 1 z 3 z 1 z 2 w 1 w 2 w 3 w 4 w 5 The bear ate the fish A moose ran the race w 1 w 2 w 3 z 3 w 4 w 5 Ankur Parikh 2014 23

Outline Motivation Intuition and Model Learning algorithm Experimental results Ankur Parikh 2014 24

A Structure Learning Problem Goal is to learn the most likely undirected latent tree u(x) for each tag sequence x given sentences DT NN VB DT NN The llama eats the grass A bug likes the flower An orca chases the fish z 2 z 1 z 3 w 1 w 2 w 3 w 4 w 5 Assume for now that there are many sentences for each x (we deal with this problem in the paper using kernel smoothing) Ankur Parikh 2014 25

Observed Case Chow Liu Algorithm Compute distance matrix between variables d w i, w j w j w 1 w 2 w i w 6 w 7 w 3 w 8 Find minimum spanning tree Provably optimal w 4 w 5 Ankur Parikh 2014 26

Latent Case Not all distances can be computed from data z 1 w 1 w 2 d w 2, z 1?? d w 3, z 1?? d w 2, z 1?? w 3 z 2 z 3 z 1 w 4 w 5 Need a distance function such that the observed distances can be used to recover the latent distances Ankur Parikh 2014 27

Problem Traces Back to Phylogenetics z 1 z 2 z 3 w 1 w 2 w 3 w 4 w 5 Existing species like words Latent ancestors like bracketing states Ankur Parikh 2014 28

Additive Tree Metrics [Buneman 1974] d i, j = a,b path(i,j) d(a, b) z 12 z 21 z 3 w 1 w 2 w 3 w 4 w 5 d w 1, w 3 = d w 1, z 2 + d z 1, z 2 + d(w 3, z 1 ) Computable from data not computable from data Ankur Parikh 2014 29

Why Additive Metrics Are Useful Given tree structure, we can compute latent distances as a function of observed distances. d i, j = 1 2 (d g, b + d h, a d g, h d a, b ) z j e i,j z i Ankur Parikh 2014 30

Find Minimum Cost Tree u = min u (i,j) E u d(i, j) z 1 z 2 z 3 w 1 w 2 w 3 w 4 w 5 This strategy recovers correct tree [Rzhetsky and Nei, 1993] Objective is NP-hard in general But for special case of projective parse trees, we show tractable dynamic programming algorithm exists [Eisner and Satta 1999]. Ankur Parikh 2014 31

Spectral Additive Metric For Our Model Following distance function is an additive tree metric for our model (adapted from Anandkumar et al. 2011) d x spectral i, j = log Λ m E[w i w j T x] m where Λ m A = σ k (A) k=1 Each w i represented by p-dimensional word embedding Ankur Parikh 2014 32

Complete Algorithm Summary (1) For each tag sequence x, estimate distances d x spectral i, j w i, w j (2) Use dynamic programming to recover minimum cost undirected latent tree (3) Transform into a parse tree by directing it using the split point R z 1 z 2 z 3 w 1 w 2 w 3 w 4 w 5 Ankur Parikh 2014 33

Theoretical Guarantees Our learning algorithm is statistically consistent If sentences are generated according to our model then as #sentences, u x = u x with high probability x Ankur Parikh 2014 34

Outline Motivation Intuition and Model Learning algorithm Experimental results Ankur Parikh 2014 35

Experiments Primary comparison is the Constituent Context Model (CCM) [Klein and Manning 2002]. We evaluate on three languages English PennTreebank German Negra corpus Chinese Chinese Treebank Use heuristic to find split point R to direct our latent trees Ankur Parikh 2014 36

English Results Spectral Spectral-Oracle CCM Ankur Parikh 2014 37

German Results Spectral Spectral-Oracle CCM Ankur Parikh 2014 38

Chinese Results Spectral Spectral-Oracle CCM Ankur Parikh 2014 39

Across Languages Ankur Parikh 2014 40

CCM Random Restarts Random PCFG Right branching Ankur Parikh 2014 41

Conclusion We approach unsupervised parsing as a structure learning problem This enables us to develop a local optima free learning algorithm with theoretical guarantees Part of a broader research theme that aims to exploit linear algebra perspectives for probabilistic modeling. Ankur Parikh 2014 42

Thanks! Ankur Parikh 2014 43

Differences Unsupervised PCFGs Trees are generated by probabilistically combining rules. Set of rules and rule probabilities (the grammar) must be learned from data Not only NP-hard, but also severely non-identifiable Our Model There is no grammar. Each tag sequence deterministically maps to a latent tree. Intuition is that word correlations can help us uncover the latent tree for each tag sequence. Identifiable and provable learning algorithm exists Ankur Parikh 2014 44