A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Size: px

Start display at page:

Download "A Supertag-Context Model for Weakly-Supervised CCG Parser Learning"

Elizabeth Day
5 years ago
Views:

1 A Supertag-Context Model for Weakly-Supervised CCG Parser Learning Dan Garrette Chris Dyer Jason Baldridge Noah A. Smith U. Washington CMU UT-Austin CMU

2 Contributions 1. A new generative model for learning CCG parsers from weak supervision 2. A way to select Bayesian priors that capture properties of CCG 3. A Bayesian inference procedure to learn the parameters of our model

3 Type-Level Supervision Unannotated text Incomplete tag dictionary: word {tags}

4 Type-Level Supervision the lazy dogs np/n n/n n np np (s\np)/np wander

5 Type-Level Supervision the lazy dogs np/n n/n n np np (s\np)/np wander n n/n np/n s\np

6 Type-Level Supervision? the lazy dogs np/n n/n n np np (s\np)/np wander n n/n np/n s\np

7 PCFG: Local Decisions

8 PCFG: Local Decisions A

9 PCFG: Local Decisions A B C

10 PCFG: Local Decisions A B C

11 PCFG: Local Decisions A B C D E F G

12 PCFG: Local Decisions A B C D E F G P( B B ) D E P( C C ) F G

13 PCFG: Local Decisions A B C D E F G P( B B ) D E P( C C ) F G

14 A New Generative Model A B C D E F G P( B B ) D E

15 A New Generative Model A B C D E F G P( B B ) D E P( ) F B B R

16 A New Generative Model A B C <S> D E F G P( B B ) D E P( B F B ) R P( S B B ) L

17 A New Generative Model A B C <S> D E F G <E> (This makes inference tricky we ll come back to that)

18 Why CCG? The grammar formalism itself can be used to guide learning Given any two categories, we always know whether they are combinable. We can extract a priori context preferences, before we even look at the data Adjacent categories tend to be combinable.

19 Why CCG? s S np VP NP s/np np/n n buy the book VB DT NN? buy the book universal, intrinsic grammar properties all relationships must be learned

20 CCG Parsing s np n np/ n n/ n n s\ np FA the lazy dog sleeps

21 CCG Parsing s np np/n np/ n n/ n n s\ np FC the lazy dog sleeps

22 Supertag Context np/ n n/nn n s\ np the lazy dog sleeps

23 Supertag Context n np/ n n/nn n s\ np the lazy dog sleeps

24 Supertag Context n np/ n n/nn n s\ np the lazy dog sleeps

25 Supertag Context np/ n n s\ np the lazy dog sleeps

26 Supertag Context n np/ n n/n n s\ np the lazy dog sleeps

27 Constituent Context Klein & Manning showed the value of modeling context with the Constituent Context Model (CCM) the lazy dog sleeps [Klein & Manning 2002]

28 Constituent Context DT ( JJ NN ) VBZ [Klein & Manning 2002]

29 Constituent Context substitutability DT ( JJ NN ) VBZ lazy dog [Klein & Manning 2002]

30 Constituent Context substitutability DT ( NN ) dog VBZ [Klein & Manning 2002]

31 Constituent Context substitutability DT ( JJ JJ NN ) VBZ big lazy dog [Klein & Manning 2002]

32 Constituent Context substitutability DT ~Noun ( ) VBZ [Klein & Manning 2002]

33 Constituent Context substitutability DT ( ) VBZ [Klein & Manning 2002]

34 Supertag Context n np/ n ( n/n n ) s\ np the lazy dog sleeps

35 Supertag Context We know the constituent label We know if it s a fitting context, even before looking at the data n np/ n ( ) s\ np the sleeps

36 This Paper 1. A new generative model for learning CCG parsers from weak supervision 2. A way to select Bayesian priors that capture properties of CCG 3. A Bayesian inference procedure to learn the parameters of our model

37 Supertag-Context Parsing Standard PCFG A04 P(Aroot) A03 P(A Aleft Aright OR wi) A13 t1 t2 t3 t4 w1 w2 w3 w

38 Supertag-Context Parsing With Context A04 P(Aroot) A03 P(A Aleft Aright OR wi) A13 P(A tleft) <s> t1 t2 t3 t4 <e> P(A tright) w1 w2 w3 w

39 Prior on Categories np np np\(np/n) n np/n (np\(np/n))/n n np/n n/n n the lazy dog the lazy dog [Garrette, Dyer, Baldridge, and Smith, 2015]

40 Supertag-Context Prior PL-prior(tleft A) { if tleft can combine with A otherwise? A tleft tright the lazy dog sleeps

41 Supertag-Context Prior { 105 PR-prior(tright A) 1 if A can combine with tright otherwise n? tleft tright the lazy dog sleeps

42 This Paper 1. A new generative model for learning CCG parsers from weak supervision 2. A way to select Bayesian priors that capture properties of CCG 3. A Bayesian inference procedure to learn the parameters of our model

43 Type-Level Supervision? the lazy dogs np/n n/n n np np (s\np)/np wander

44 Type-Supervised Learning unlabeled corpus tag dictionary universal properties of the CCG formalism

45 Posterior Inference A Bayesian inference procedure will make use of our linguistically-informed priors But we can t do sampling like a PCFG Can t compute the inside chart, even with dynamic programming.

46 Sampling via Metropolis-Hastings Idea: Sample tree from an efficient proposal distribution (PCFG parameters) (Johnson et al. 2007) Accept according to the full distribution (Context parameters)

47 Posterior Inference Priors (prefer connections) Model the lazy dogs np/n n/n n np np (s\np)/np wander

48 Posterior Inference Priors (prefer connections) Model the lazy dogs np/n n/n n np np (s\np)/np wander

49 Posterior Inference Priors (prefer connections) Model the lazy dogs np/n n/n n np np (s\np)/np wander n n/n np/n s\np

$the lazy dogs np/n n/n n np np (s\np)/np wander n n/n np/n$

50 Posterior Inference Priors (prefer connections) Inside Model the lazy dogs np/n n/n n np np (s\np)/np wander n n/n np/n s\np

$the lazy dogs np/n n/n n np np (s\np)/np wander n n/n np/n$

51 Posterior Inference Priors (prefer connections) Sample Model the lazy dogs np/n n/n n np np (s\np)/np wander n n/n np/n s\np

52 Metropolis-Hastings Priors (prefer connections) Model

53 Metropolis-Hastings Priors (prefer connections) Existing Tree Model New Tree

54 Metropolis-Hastings Priors (prefer connections) Existing Tree Model New Tree

55 Metropolis-Hastings Priors (prefer connections) Existing Tree Model New Tree

56 Metropolis-Hastings Priors (prefer connections) Model

57 Posterior Inference Priors (prefer connections) Model

58 Metropolis-Hastings Sample tree based only on the pcfg parameters Accept based only on the context New worse than old => less likely to accept

59 Experimental Results

60 Experimental Question When supervision is incomplete, does modeling context, and biasing toward combining contexts, help learn better parsing models?

61 English Results 75 no context +context combinability parsing accuracy k 200k 150k 100k 50k 25k size of the corpus from which the tag dictionary is drawn

62 Experimental Results no context +context combinability parsing accuracy English Italian Chinese 25k token TD corpus

63 Conclusion Under weak supervision, we can use universal grammatical knowledge about context to find trees with a better global structure.

64 Deficiency Generative story has a throw away step if the context-generated nonterminals don t match the tree. We sample only over the space of valid trees (condition on well-formed structures). This is a benefit of the Bayesian formulation. See Smith 2011.

65 Metropolis-Hastings current tree new tree Pcontext(y) = Pfull(y) / Ppcfg(y) Pcontext(y ) = Pfull(y ) / Ppcfg(y ) z uniform(0,1) accept if z < Pfull(y ) Pfull(y) / / Ppcfg(y ) Ppcfg(y) = Pcontext(y ) Pcontext(y)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements