Semantic Parsing with Combinatory Categorial Grammars

Semantic Parsing with Combinatory Categorial Grammars Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer University of Washington ACL 2013 Tutorial Sofia, Bulgaria

Learning Data Learning Algorithm CCG What kind of data/supervision we can use? What do we need to learn?

Supervised Data show me flights to Boston S/N N P P/NP NP f.f x.flight(x) y. x.to(x, y) BOSTON > PP x.to(x, BOST ON) N\N f. x.f(x) ^ to(x, BOST ON) < N x.flight(x) ^ to(x, BOST ON) > S x.flight(x) ^ to(x, BOST ON)

Supervised Data show me flights to Boston S/N N PP/NP NP f.f x.f light(x) y. x.to(x, y) BOST ON PP x.to(x, BOST ON) Latent > N\N f. x.f(x) ^ to(x, BOST ON) N x.f light(x) ^ to(x, BOST ON) S x.f light(x) ^ to(x, BOST ON) < >

Supervised Data Supervised learning is done from pairs of sentences and logical forms Show me flights to Boston x.flight(x) ^ to(x, BOST ON) I need a flight from baltimore to seattle x.flight(x) ^ from(x, BALT IMORE) ^ to(x, SEAT T LE) what ground transportation is available in san francisco x.ground transport(x) ^ to city(x, SF ) [Zettlemoyer and Collins 2005; 2007]

Weak Supervision Logical form is latent Labeling requires less expertise Labels don t uniquely determine correct logical forms Learning requires executing logical forms within a system and evaluating the result

Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico [Clarke et al. 2010; Liang et al. 2011]

Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico argmax( x.state(x) ^ border(x, T X), y.size(y)) argmax( x.river(x) ^ in(x, T X), y.size(y)) [Clarke et al. 2010; Liang et al. 2011]

Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico argmax( x.state(x) ^ border(x, T X), y.size(y)) New Mexico argmax( x.river(x) ^ in(x, T X), y.size(y)) Rio Grande [Clarke et al. 2010; Liang et al. 2011]

Weak Supervision Learning from Demonstrations at the chair, move forward three steps past the sofa [Chen and Mooney 2011; Kim and Mooney 2012; Artzi and Zettlemoyer 2013b]

Weak Supervision Learning from Demonstrations at the chair, move forward three steps past the sofa Some examples from other domains: Sentences and labeled game states [Goldwasser and Roth 2011] Sentences and sets of physical objects [Matuszek et al. 2012] [Chen and Mooney 2011; Kim and Mooney 2012; Artzi and Zettlemoyer 2013b]

Parsing Learning Modeling Structured perceptron A unified learning algorithm Supervised learning Weak supervision

Structured Perceptron Simple additive updates - Only requires efficient decoding (argmax) - Closely related to maxent and other feature rich models - Provably finds linear separator in finite updates, if one exists Challenge: learning with hidden variables

Structured Perceptron Data: {(x i,y i ):i =1...n} For t =1...T : [iterate epochs] For i =1...n: y arg max y h, (x i,y)i If y 6= y i : + (x i,y i ) (x i,y ) [iterate examples] [predict] [check] [update] [Collins 2002]

One Derivation of the Perceptron Log-linear model: Step 1: Differentiate, to maximize data log-likelihood update = X i p(y x) = f(x i,y i ) ew f(x,y) P y 0 e w f(x,y0 ) E p(y xi )f(x i,y) Step 2: Use online, stochastic gradient updates, for example i: update i = f(x i,y i ) E p(y xi )f(x i,y) Step 3: Replace expectations with maxes (Viterbi approx.) update i = f(x i,y i ) f(x i,y ) where y =argmax y w f(x i,y)

The Perceptron with Hidden Variables Log-linear model: p(y x) = X h p(y, h x) p(y, h x) = e w f(x,h,y) P y 0,h 0 e w f(x,h0,y 0 ) Step 1: Differentiate marginal, to maximize data log-likelihood update = X i E p(h yi,x i )[f(x i,h,y i )] E p(y,h xi )[f(x i,h,y)] Step 2: Use online, stochastic gradient updates, for example i: update i = E p(yi,h x i )[f(x i,h,y i )] E p(y,h xi )[f(x i,h,y)] Step 3: Replace expectations with maxes (Viterbi approx.) update i = f(x i,h 0,y i ) f(x i,h,y ) where y,h =argmax y,h w f(x i,h,y) and h 0 =argmax h w f(x i,h,y i )

Hidden Variable Perceptron Data: {(x i,y i ):i =1...n} For t =1...T : [iterate epochs] For i =1...n: y,h arg max y,h h, (x i,h,y)i If y 6= y i : h 0 arg max h h, (x i,h,y i ) + (x i,h 0,y i ) (x i,h,y ) [iterate examples] [predict] [check] [predict hidden] [update] [Liang et al. 2006; Zettlemoyer and Collins 2007]

Hidden Variable Perceptron No known convergence guarantees - Log-linear version is non-convex Simple and easy to implement - Works well with careful initialization Modifications for semantic parsing - Lots of different hidden information - Can add a margin constraint, do probabilistic version, etc.

Validation Function Indicates correctness of a parse y Varying V allows for differing forms of supervision Learning Choices Lexical Generation Procedure V : Y! {t, f} GENLEX(x, V;, ) Given: sentence x validation function V lexicon parameters Produce a overly general set of lexical entries

Initialize using 0, 0 For t =1...T,i=1...n: Step 1: (Lexical generation) a. Set G GENLEX(x i, V i ;, ), [ G b. Let Y be the k highest scoring parses from GEN(x i ; ) c. Select lexical entries from the highest scoring valid parses: S i y2maxv i (Y ; ) LEX(y) d. Update lexicon: [ i Step 2: (Update parameters) Output: Parameters and lexicon

Unification-based GENLEX(x, z;, ) I want a flight to Boston x.f light(x) ^ to(x, BOS) 1. Find highest scoring correct parse 2. Find splits that most increases score 3. Return new lexical entries to I want a flight Boston (S NP)/N P N P y. x.to(x, y) BOS to Boston S/(S NP) S NP f. x.f light(x) ^ f(x) x.to(x, BOS) > S x.f light(x) ^ to(x, BOS) Iteration 2