Prepositional Phrase Attachment over Word Embedding Products

Size: px

Start display at page:

Download "Prepositional Phrase Attachment over Word Embedding Products"

Corey Daniels
5 years ago
Views:

1 Prepositional Phrase Attachment over Word Embedding Products Pranava Madhyastha (1), Xavier Carreras (2), Ariadna Quattoni (2) (1) University of Sheffield (2) Naver Labs Europe

2 Prepositional Phrases I went to the restaurant by bike v/s I went to the restaurant by the bridge 2

3 Prepositional Phrases prep? I went to the restaurant by bike v/s prep? I went to the restaurant by the bridge 2

4 Prepositional Phrases prep? prep? prep? I went to the restaurant by bike v/s prep? I went to the restaurant by the bridge 2

5 Prepositional Phrases prep? prep? prep? I went to the restaurant by bike v/s prep? I went to the restaurant by the bridge s/bridge/arno/? 2

6 Local PP Attachment Decisions Ratnaparkhi; 1998 approached PP attachment as a local classification decision Lose potential advantage for inference over the whole structure 3

7 Local PP Attachment Decisions Ratnaparkhi; 1998 approached PP attachment as a local classification decision Lose potential advantage for inference over the whole structure Two settings: Binary attachment problem[ratnaparkhi; 1998] Multi-class attachment problem[belinkov+; 2015] 3

8 Word Embeddings Previous approaches: features in a binary decision setting Word Embeddings are rich source of information How far can we go with only word embeddings? 4

9 Resolving Ternary Relations using Word Embeddings φ(prep) φ(mod) φ(head) The problem reduces to finding compatibility between: φ(head), φ(prep) and φ(mod) One simple approach is to do: f (h, p, m) = w [v h v p v m ] Exploit exploit each dimension alone - with a simple trick: append 1 5

10 Our Proposal: Unfolding the Tensors Unfolding W h p m This results in: f (h, p, m) = v h W [v p v m ] We can further explore compositionality as: f (h, p, m) = α(h) W β(p, m) 6

11 Training the Model We use logistic loss Std. conditional Max. Likelihood optimization arg min W logistic(tset, W ) + λ W (1) As a structural constraint, we use rank-constrained regularization 7

12 Empirical Overview Exploring Word Embeddings Exploring Compositionality Binary Attachment Datasets Multiple Attachment Datasets Semi-supervised multiple attachment experiments with PTB and WTB 8

13 Exploring Word Embeddings Word2Vec based skip-gram embeddings ([Mikolov+; 2013]) Skip-dep ([Bansal+; 2014]): uses word2vec with dependency contexts Exploration over different dimensions and data. 9

14 Exploring Word Embeddings Word Embedding Accuracy wrt. dimension (n) Type Source Data n = 50 n = 100 n = 300 Skip-gram BLLIP Skip-gram Wikipedia Skip-gram NYT Skip-dep BLLIP Skip-dep Wikipedia Skip-dep NYT Skip-gram & Skip-dep BLLIP Attachment accuracy over RRR devset 10

15 Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : 11

16 Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : Sum : f (h, p, m) = α(h) W (w p + w m ) 11

17 Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : Sum : f (h, p, m) = α(h) W (w p + w m ) Concatenation : f (h, p, m) = α(h) W (w p ; w m ) 11

18 Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : Sum : f (h, p, m) = α(h) W (w p + w m ) Concatenation : f (h, p, m) = α(h) W (w p ; w m ) Products : f (h, p, m) = α(h) W (w p w m ) 11

19 Exploring Compositionality f (h, p, m) = α(h) W β(p, m) Exploring the possibilities with β(p, m) : Sum : f (h, p, m) = α(h) W (w p + w m ) Concatenation : f (h, p, m) = α(h) W (w p ; w m ) Products : f (h, p, m) = α(h) W (w p w m ) Prep. Identities : f (h, p, m) = α(h) W p w m 11

20 Exploring Compositionality Composition of p and m Tensor Size Acc. Sum [v p + v m ] n n Concatenation [v p ; v m ] n 2n p Identities [i p v m ] n P n Product [v p v m ] n n n

21 Empirical Results over Binary Attachment Datasets Test Accuracy Method Word Embedding RRR WIKI NYT Tensor product Skip-gram, Wikipedia, n = '' Skip-gram, NYT, n = '' Skip-dep, BLLIP, n = '' Skip-dep, Wikipedia, n = '' Skip-dep, NYT, n = [Stetina+; 1997(*)] [Collins+; 1999] [Belinkov+; 2014(*)] [Nakashole+; 2015(*)] Scores refer to Accuracy measures 13

22 Multiple Heads and Positional Information f (h, p, m) = α(h) W β(p, m) In this case, we can modify α(h) to include the positional info: α(h) = δ h v h, where δ h R H is the position vector This can also be written as: f (h, p, m) = v h W δ h [v p v m ]. 14

23 Experiments with Multiple Attachment Setting Test Accuracy Arabic English Tensor product (n =50, l 2 ) Tensor product (n =50, l ) Tensor product (n =100, l ) Belinkov+; 2014 (basic) Belinkov+; 2014 (syn) Belinkov+; 2014 (feat) Belinkov+; 2014 (full) Yu+; 2016 (full)

24 Semi-supervised Experiments Embeddings trained on source and target raw-data Model trained on the source data Compare against Stanford neural network based parser and Turbo parser 16

25 Semi-supervised Experiments with PTB vs WTB PTB Web Treebank Test Test A E N R W Avg (2523) (868) (936) (839) (902) (788) (4333) Tensor BLLIP Tensor BLLIP+WTB Chen+; Martins+; 2010 (2 nd order) Martins+; 2010 (3 rd order) Scores refer to Accuracy measures 17

26 What Seems to Work Turbo 3 rd Tensor Stanford PP Came the disintegration of the Beatles minds with LSD... 18

27 Where it has a Problem... the return address for the letters to the Senators... 19

28 Conclusions We described a PP attachment model using simple tensor products Product of embeddings seem to be better at capturing compositions In out-of-domain datasets, we see the models shows a big promise Proposed models can serve as a building block to dependency parsing methods 20

29 Thank You 21

30 Learning with Structural Constraints Controlling the learned parameter metrix with regularization 22

31 Learning with Structural Constraints Controlling the learned parameter metrix with regularization We can obtain dense, dispersed parameter matrix with l 2 22

32 Learning with Structural Constraints Controlling the learned parameter metrix with regularization u 11 u 1k u 21 w 2k σ 1 0 v 11 v 1n σ k v k1 v kn }{{}}{{} u n1 u nk }{{} Σ V } U {{ } SVD(W ) = UΣV 22

33 Learning with Structural Constraints Controlling the learned parameter metrix with regularization We can obtain dense, dispersed parameter matrix with l 2 Obtain a low-rank matrix with l regularization 22

34 Learning with Structural Constraints Controlling the learned parameter metrix with regularization Bonus everything can be convex optimized with FOBOS!!! 22

35 A Quicklook at Optimization Input: Gradient function g 1 W t+1 = argmin W W t+0.5 W η tλr(w ); Output: W t+1 2 W 1 = 0 3 while t < T do 4 η t = c t 5 W t+0.5 = W t η t g(w t ) 6 if l 2 regularizer then 7 W t+1 = 1 1+η tλ W t else if l regularizer then 9 UΣV = SVD(W t+0.5 ) 10 Σ i,i = max(σ i,i η t λ, 0) 11 W t+1 = U ΣV 12 end 23

36 Data BLLIP with 1.8 million sentences and 43 million tokens of Wall Street Journal text (and excludes PTB evaluation sets); English Wikipedia 13.1 million sentences and 129 million tokens; The New York Times portion of the GigaWord corpus, with 52 million sentences and 1, 253 million tokens. 24

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the