Spectral Learning for Non-Deterministic Dependency Parsing Franco M. Luque 1 Ariadna Quattoni 2 Borja Balle 2 Xavier Carreras 2 1 Universidad Nacional de Córdoba 2 Universitat Politècnica de Catalunya and CONICET WPLN Montevideo 2012 9 de noviembre de 2012
Non-local Phenomena in Dependency Structures L xx xx L R I travel from Argentina to Avignon Higher order models: Sparsity issues. Increased parsing complexity. Hidden variable models: This work: Expensive parameter estimation (e.g. EM). Dependency parsing with non-deterministic SHAGs. Fast spectral learning algorithm.
Outline SHAGs and PNFAs Spectral Learning Experiments
Outline SHAGs and PNFAs Spectral Learning Experiments
Split Head-Automata Grammars (SHAG) R R R xx L xx L John saw a new movie today. SHAG: a popular context-free grammatical formalism whose derivations are dependency trees (Eisner Satta 99) Each symbol in the grammar has two automata (Left/Right) that generate modiers to each side of it
Probabilistic Split Head-Automata Grammars R John saw a new movie today. Pr[tree] = Pr[saw, right]
Probabilistic Split Head-Automata Grammars R L John saw a new movie today. Pr[tree] = Pr[saw, right] Pr[John saw, left]
Probabilistic Split Head-Automata Grammars R R R xx L John saw a new movie today. Pr[tree] = Pr[saw, right] Pr[John saw, left] Pr[movie, today,. saw, right]
Probabilistic Split Head-Automata Grammars R R R xx L xx L John saw a new movie today. Pr[tree] = Pr[saw, right] Pr[John saw, left] Pr[movie, today,. saw, right] Pr[a, new movie, left]
Probabilistic Split Head-Automata Grammars R R R xx L xx L John saw a new movie today. Pr[tree] = Pr[saw, right] Pr[John saw, left] Pr[movie, today,. saw, right] Pr[a, new movie, left] Pr[ɛ movie, right] Pr[ɛ John, right]...
Probabilistic Dependency Parsing In probabilistic SHAG, dependency trees factor into head-modier sequences: Pr[tree] = Pr[x 1:T h, d] In this work: h,d,x 1:T y We model the dynamics of modier sequences with hidden structure, using PNFAs. We use spectral methods to induce hidden structure. It is a direct application of the method: each PNFA of a grammar is learned independently of the rest.
Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = 0.4 0.3 0.6 + 0.2 0.1 0.6 = 0.084
Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = 0.4 0.3 0.6 + 0.2 0.1 0.6 = 0.084
Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = 0.4 0.3 0.6 + 0.2 0.1 0.6 = 0.084
Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = α A b A a α 1
Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 [ ] 1.0 P(ab) = α A b A a 0.0
Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 [ ] [ ] 0.4 0.1 1.0 P(ab) = α A b 0.2 0.1 0.0
Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 [ ] 0.4 P(ab) = α A b 0.2
Probabilistic Non-deterministic Finite Automata X = {a, b}. a 0.4 a 0.1 q 0 q 1 a 0.2 b 0.3 a 0.1 0.6 [ ] 1.0 α 1 = 0.0 [ ] 0.0 α = 0.6 [ ] 0.4 0.1 A a = 0.2 0.1 [ ] 0.1 0.1 A b = 0.3 0.1 P(ab) = 0.4 0.3 0.6 + 0.2 0.1 0.6 = 0.084
Operator Models X : an alphabet of symbols An operator model A with n states is a tuple where α 1, α, {A a } a X α1, α R n are vectors Aa R n n is an operator matrix A computes a probability distribution over strings in X as follows: P(x 1:T ) = α A xt A x2 A x1 α 1 Change of basis: B = M 1 α 1, α M, {M 1 A a M} a X implies P B = P A.
Outline SHAGs and PNFAs Spectral Learning Experiments
Hankel Matrices Consider a distribution P( ) over X. H R X X (string indexed matrix) H(s, p) = P(ps) λ a b aa ab... λ 0.0 0.12 0.18 0.06 0.084... a 0.12 0.06 0.03 0.0276 0.0156... 8 0.084 0.036 0.0384 0.0192... aa 0.06 0.0276 0.0114 0.0126 0.00612... ab 0.084 0.0384 0.0156 0.01752 0.0084............ H(λ, ab) = H(b, a) = H(ab, λ) = P(ab) = 0.084
Hankel Matrix Factorization Assume P is generated by a PNFA with n states There has to be a rank factorization of H where H = BF F R n X is a forward matrix that summarizes P after generating any prex into an n-dimensional state. B R X n is a backward matrix that generates suxes wrt. P given an n-dimensional state Then P(ps) = H(s, p) = B(s, :) F (:, p)
Hankel Matrix Factorization F = λ a b aa ab... ( ) q 0 1.0 0.4 0.1 0.18 0.06... q 1 0.0 0.2 0.3 0.1 0.14... B = q 0 q 1 λ 0.0 0.6 a 0.12 0.06 8 0.06 aa 0.06 0.018 ab 0.084 0.024... P(ab) = H(b, a) = B(b, :)F (;, a) = [ 0.18 0.06 ] [ 0.4 0.2 ] = 0.084
Spectral Methods for PNFAs H = BF. From the factorization we can recover the PNFA: For any symbol a, P(pas) = H(s, pa) = B(s, :)A a F (;, p). Spectral Methods for PNFAs: 1. Collect statistics about H using training samples. 2. Obtain an n-rank factorization. 3. Obtain the operator model.
Substring Expectation Hankel Matrices Consider the expected number of substring occurrences: f(x) = E(x x ) = p,s X P(pxs) (x x is the number of times x appears in x ) The Hankel matrix of f is H f (s, p) = f(ps). We will look at H f, instead of H, and estimate it from samples of the target distribution. From the factorization of H f = BF (and some more statistics) we can also recover the PNFA A.
Hankel Sub-blocks Factorization A nite sub-block P of H f with same rank can also be factorized: P = BF Given B and F, we can recover the operator model A. Also, from any other n-rank factorization P = QR we can recover an equivalent operator model A (a projection of A, with Q = BM, R = M 1 F ). In this work, we choose P R X X, where X = X {λ}.
The SVD Factorization We can recover valid operators for P with any rank factorization of P. P = QR. Since P is estimated from training samples, a natural choice is to choose a factorization which is robust to estimation errors. Such natural choice is thin SVD: P = }{{} U (ΣV }{{ }). Q R
The Learning Algorithm inputs: An alphabet X A training set train = { x i 1:T }M i=1 The number of hidden states n 1: Compute an empirical estimate from train of statistics matrices p 1, p, P, and { P a } a X 2: Compute the SVD of P and let Û be the matrix of top n left singular vectors of P 3: Compute the observable operators for h and d: 4: α 1 = Û p 1 5: ( α ) = p (Û P ) + 6: Â a = Û Pa (Û P ) + for each a X 7: return Operators α 1, α, {Âa} a X
Remarks The hidden space is induced from P : P R X X has statistics of bigrams of symbols. In general, P can be dened for any arbitrary set of prexes and suxes. Our algorithm shares many features with previous spectral methods for FSMs: Hsu, Kakade and Zhang (2009), for HMMs. Bailly (2011), for PNFAs. Two novelties are: Our formulation is based on forward-backward recursions. Our algorithm uses statistics from substrings of the training samples. Previous work restricted to prexes only.
The Parsing Algorithm Task: given a sentence x 1:T recover the dependency tree with highest probability. With SHAG consisting of PNFA, the problem is not tractable: We employ MBR decoding, as follows: Compute marginal dependency probabilities (O(T 3 ) inside/outside): Pr[x i x j x 1:T ] = Pr[y] y Y : x i x j y Maximize product of marginals (also O(T 3 )): ŷ = argmax y Y Pr[x i x j x 1:T ] x h x m y
Outline SHAGs and PNFAs Spectral Learning Experiments
Spectral vs. EM We restrict to parsing PoS sequences. We avoid sparsity issues at estimating lexical operators. English Penn Treebank data (45 PoS tags). We compare to: A simple deterministic baseline, that estimates Pr(x h, dir) from counts in the data. A second deterministic baseline, that has separate statistics for the rst generated symbol in each automata. Non-deterministic SHAG with Expectation Maximization training.
Spectral vs. EM 82 80 unlabeled attachment score 78 76 74 72 70 68 2 4 6 8 10 12 14 number of states Det Det+F Spectral EM (5) EM (10) EM (25) EM (100) Training times: EM (25): > 50 min. (2 to 3 min. each iteration). Spectral: 30 seg.
Lexical Deterministic + PoS Spectral We consider three types of lexicalized deterministic models: Single statistics of P r[x h, dir], where h and m are now lexical items (Lex). Separate statistics for rst generated word (Lex+F). Separate statistics for rst generated word, and words following coordinations and punctuation (Lex+FCP). We combine such lexicalized baselines with our PNFA-based model, in a log-linear fashion: score( h, d, x 1:T ) = log Pr sp (x 1:T h, d) + log Pr det (x 1:T h, d)
Lexicalized parsing (development set) 86 84 unlabeled attachment score 82 80 78 76 74 72 2 3 4 5 6 7 8 9 10 number of states Lex Lex+F Lex+FCP Lex + Spectral Lex+F + Spectral Lex+FCP + Spectral
Summary and Future Work Summary: A new basic tool for inducing hidden structure in PNFAs. Non-deterministic SHAGs as operator models. A cubic time inside/outside algorithm (see paper). In experiments: Future Work: Much faster than EM, comparable in accuracy. Largely improves several deterministic models. Lexicalized operator models. Vertical hidden relations.
Thank you! Questions?