Spectral Learning for Non-Deterministic Dependency Parsing

Similar documents
Spectral Learning for Non-Deterministic Dependency Parsing

Spectral Learning of Weighted Automata

Spectral Learning of Weighted Automata

Spectral learning of weighted automata

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Expectation Maximization (EM)

Advanced Natural Language Processing Syntactic Parsing

Statistical Methods for NLP

Statistical Methods for NLP

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Reduced-Rank Hidden Markov Models

Estimating Covariance Using Factorial Hidden Markov Models

Lecture 21: Spectral Learning for Graphical Models

Statistical Methods for NLP

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Probabilistic Context-free Grammars

Maschinelle Sprachverarbeitung

Text Mining. March 3, March 3, / 49

Maschinelle Sprachverarbeitung

Probabilistic Context Free Grammars. Many slides from Michael Collins

Learning probability distributions generated by finite-state machines

Spectral Unsupervised Parsing with Additive Tree Metrics

LECTURER: BURCU CAN Spring

Structured Prediction Models via the Matrix-Tree Theorem

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Spectral Dependency Parsing with Latent Variables

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

A gentle introduction to Hidden Markov Models

Lecture 12: Algorithms for HMMs

Hidden Markov Models in Language Processing

Natural Language Processing

Machine Learning for natural language processing

Soft Inference and Posterior Marginals. September 19, 2013

Learning Automata with Hankel Matrices

Learning Multi-Step Predictive State Representations

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Algorithms for Syntax-Aware Statistical Machine Translation

Consistency of Feature Markov Processes

Hidden Markov Models. x 1 x 2 x 3 x N

Lecture 12: Algorithms for HMMs

Computational Genomics and Molecular Biology, Fall

Features of Statistical Parsers

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CS 7180: Behavioral Modeling and Decision- making in AI

Statistical Processing of Natural Language

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

HMM: Parameter Estimation

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

DT2118 Speech and Speaker Recognition

Data-Intensive Computing with MapReduce

Statistical NLP: Hidden Markov Models. Updated 12/15

EM with Features. Nov. 19, Sunday, November 24, 13

10/17/04. Today s Main Points

Lecture 3: ASR: HMMs, Forward, Viterbi

Probabilistic Context-Free Grammar

Hidden Markov models

Hidden Markov Models

Spectral Learning of Refinement HMMs

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Chapter 14 (Partially) Unsupervised Parsing

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Hidden Markov Modelling

Hidden Markov Models

Lab 12: Structured Prediction

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Hidden Markov Models

Collapsed Variational Bayesian Inference for Hidden Markov Models

Hidden Markov Models Hamid R. Rabiee

Computation of Substring Probabilities in Stochastic Grammars Ana L. N. Fred Instituto de Telecomunicac~oes Instituto Superior Tecnico IST-Torre Norte

Hidden Markov Models (HMMs)

Computability and Complexity

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Midterm sample questions

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Methods of Moments for Learning Stochastic Languages: Unified Presentation and Empirical Comparison

Processing/Speech, NLP and the Web

CS 7180: Behavioral Modeling and Decision- making in AI

Using Regression for Spectral Estimation of HMMs

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models

Stephen Scott.

Expectation Maximization (EM)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

COMP90051 Statistical Machine Learning

CSCE 471/871 Lecture 3: Markov Chains and

Transcription:

Spectral Learning for Non-Deterministic Dependency Parsing Franco M. Luque 1 Ariadna Quattoni 2 Borja Balle 2 Xavier Carreras 2 1 Universidad Nacional de Córdoba 2 Universitat Politècnica de Catalunya and CONICET WPLN Montevideo 2012 9 de noviembre de 2012

Non-local Phenomena in Dependency Structures L xx xx L R I travel from Argentina to Avignon Higher order models: Sparsity issues. Increased parsing complexity. Hidden variable models: This work: Expensive parameter estimation (e.g. EM). Dependency parsing with non-deterministic SHAGs. Fast spectral learning algorithm.

Outline SHAGs and PNFAs Spectral Learning Experiments

Outline SHAGs and PNFAs Spectral Learning Experiments

Split Head-Automata Grammars (SHAG) R R R xx L xx L John saw a new movie today. SHAG: a popular context-free grammatical formalism whose derivations are dependency trees (Eisner Satta 99) Each symbol in the grammar has two automata (Left/Right) that generate modiers to each side of it

Probabilistic Split Head-Automata Grammars R John saw a new movie today. Pr[tree] = Pr[saw, right]

Probabilistic Split Head-Automata Grammars R L John saw a new movie today. Pr[tree] = Pr[saw, right] Pr[John saw, left]

Probabilistic Split Head-Automata Grammars R R R xx L John saw a new movie today. Pr[tree] = Pr[saw, right] Pr[John saw, left] Pr[movie, today,. saw, right]

Probabilistic Split Head-Automata Grammars R R R xx L xx L John saw a new movie today. Pr[tree] = Pr[saw, right] Pr[John saw, left] Pr[movie, today,. saw, right] Pr[a, new movie, left]

Probabilistic Split Head-Automata Grammars R R R xx L xx L John saw a new movie today. Pr[tree] = Pr[saw, right] Pr[John saw, left] Pr[movie, today,. saw, right] Pr[a, new movie, left] Pr[ɛ movie, right] Pr[ɛ John, right]...

Probabilistic Dependency Parsing In probabilistic SHAG, dependency trees factor into head-modier sequences: Pr[tree] = Pr[x 1:T h, d] In this work: h,d,x 1:T y We model the dynamics of modier sequences with hidden structure, using PNFAs. We use spectral methods to induce hidden structure. It is a direct application of the method: each PNFA of a grammar is learned independently of the rest.

Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = 0.4 0.3 0.6 + 0.2 0.1 0.6 = 0.084

Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = 0.4 0.3 0.6 + 0.2 0.1 0.6 = 0.084

Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = 0.4 0.3 0.6 + 0.2 0.1 0.6 = 0.084

Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = α A b A a α 1

Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 [ ] 1.0 P(ab) = α A b A a 0.0

Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 [ ] [ ] 0.4 0.1 1.0 P(ab) = α A b 0.2 0.1 0.0

Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 [ ] 0.4 P(ab) = α A b 0.2

Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = 0.4 0.3 0.6 + 0.2 0.1 0.6 = 0.084

Operator Models X : an alphabet of symbols An operator model A with n states is a tuple where α 1, α, {A a } a X α1, α R n are vectors Aa R n n is an operator matrix A computes a probability distribution over strings in X as follows: P(x 1:T ) = α A xt A x2 A x1 α 1 Change of basis: B = M 1 α 1, α M, {M 1 A a M} a X implies P B = P A.

Outline SHAGs and PNFAs Spectral Learning Experiments

Hankel Matrices Consider a distribution P( ) over X. H R X X (string indexed matrix) H(s, p) = P(ps) λ a b aa ab... λ 0.0 0.12 0.18 0.06 0.084... a 0.12 0.06 0.03 0.0276 0.0156... 8 0.084 0.036 0.0384 0.0192... aa 0.06 0.0276 0.0114 0.0126 0.00612... ab 0.084 0.0384 0.0156 0.01752 0.0084............ H(λ, ab) = H(b, a) = H(ab, λ) = P(ab) = 0.084

Hankel Matrix Factorization Assume P is generated by a PNFA with n states There has to be a rank factorization of H where H = BF F R n X is a forward matrix that summarizes P after generating any prex into an n-dimensional state. B R X n is a backward matrix that generates suxes wrt. P given an n-dimensional state Then P(ps) = H(s, p) = B(s, :) F (:, p)

Hankel Matrix Factorization F = λ a b aa ab... ( ) q 0 1.0 0.4 0.1 0.18 0.06... q 1 0.0 0.2 0.3 0.1 0.14... B = q 0 q 1 λ 0.0 0.6 a 0.12 0.06 8 0.06 aa 0.06 0.018 ab 0.084 0.024... P(ab) = H(b, a) = B(b, :)F (;, a) = [ 0.18 0.06 ] [ 0.4 0.2 ] = 0.084

Spectral Methods for PNFAs H = BF. From the factorization we can recover the PNFA: For any symbol a, P(pas) = H(s, pa) = B(s, :)A a F (;, p). Spectral Methods for PNFAs: 1. Collect statistics about H using training samples. 2. Obtain an n-rank factorization. 3. Obtain the operator model.

Substring Expectation Hankel Matrices Consider the expected number of substring occurrences: f(x) = E(x x ) = p,s X P(pxs) (x x is the number of times x appears in x ) The Hankel matrix of f is H f (s, p) = f(ps). We will look at H f, instead of H, and estimate it from samples of the target distribution. From the factorization of H f = BF (and some more statistics) we can also recover the PNFA A.

Hankel Sub-blocks Factorization A nite sub-block P of H f with same rank can also be factorized: P = BF Given B and F, we can recover the operator model A. Also, from any other n-rank factorization P = QR we can recover an equivalent operator model A (a projection of A, with Q = BM, R = M 1 F ). In this work, we choose P R X X, where X = X {λ}.

The SVD Factorization We can recover valid operators for P with any rank factorization of P. P = QR. Since P is estimated from training samples, a natural choice is to choose a factorization which is robust to estimation errors. Such natural choice is thin SVD: P = }{{} U (ΣV }{{ }). Q R

The Learning Algorithm inputs: An alphabet X A training set train = { x i 1:T }M i=1 The number of hidden states n 1: Compute an empirical estimate from train of statistics matrices p 1, p, P, and { P a } a X 2: Compute the SVD of P and let Û be the matrix of top n left singular vectors of P 3: Compute the observable operators for h and d: 4: α 1 = Û p 1 5: ( α ) = p (Û P ) + 6: Â a = Û Pa (Û P ) + for each a X 7: return Operators α 1, α, {Âa} a X

Remarks The hidden space is induced from P : P R X X has statistics of bigrams of symbols. In general, P can be dened for any arbitrary set of prexes and suxes. Our algorithm shares many features with previous spectral methods for FSMs: Hsu, Kakade and Zhang (2009), for HMMs. Bailly (2011), for PNFAs. Two novelties are: Our formulation is based on forward-backward recursions. Our algorithm uses statistics from substrings of the training samples. Previous work restricted to prexes only.

The Parsing Algorithm Task: given a sentence x 1:T recover the dependency tree with highest probability. With SHAG consisting of PNFA, the problem is not tractable: We employ MBR decoding, as follows: Compute marginal dependency probabilities (O(T 3 ) inside/outside): Pr[x i x j x 1:T ] = Pr[y] y Y : x i x j y Maximize product of marginals (also O(T 3 )): ŷ = argmax y Y Pr[x i x j x 1:T ] x h x m y

Outline SHAGs and PNFAs Spectral Learning Experiments

Spectral vs. EM We restrict to parsing PoS sequences. We avoid sparsity issues at estimating lexical operators. English Penn Treebank data (45 PoS tags). We compare to: A simple deterministic baseline, that estimates Pr(x h, dir) from counts in the data. A second deterministic baseline, that has separate statistics for the rst generated symbol in each automata. Non-deterministic SHAG with Expectation Maximization training.

Spectral vs. EM 82 80 unlabeled attachment score 78 76 74 72 70 68 2 4 6 8 10 12 14 number of states Det Det+F Spectral EM (5) EM (10) EM (25) EM (100) Training times: EM (25): > 50 min. (2 to 3 min. each iteration). Spectral: 30 seg.

Lexical Deterministic + PoS Spectral We consider three types of lexicalized deterministic models: Single statistics of P r[x h, dir], where h and m are now lexical items (Lex). Separate statistics for rst generated word (Lex+F). Separate statistics for rst generated word, and words following coordinations and punctuation (Lex+FCP). We combine such lexicalized baselines with our PNFA-based model, in a log-linear fashion: score( h, d, x 1:T ) = log Pr sp (x 1:T h, d) + log Pr det (x 1:T h, d)

Lexicalized parsing (development set) 86 84 unlabeled attachment score 82 80 78 76 74 72 2 3 4 5 6 7 8 9 10 number of states Lex Lex+F Lex+FCP Lex + Spectral Lex+F + Spectral Lex+FCP + Spectral

Summary and Future Work Summary: A new basic tool for inducing hidden structure in PNFAs. Non-deterministic SHAGs as operator models. A cubic time inside/outside algorithm (see paper). In experiments: Future Work: Much faster than EM, comparable in accuracy. Largely improves several deterministic models. Lexicalized operator models. Vertical hidden relations.

Thank you! Questions?