Semantic Parsing with Combinatory Categorial Grammars Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer University of Washington ACL 2013 Tutorial Sofia, Bulgaria
Learning Data Learning Algorithm CCG What kind of data/supervision we can use? What do we need to learn?
Supervised Data show me flights to Boston S/N N P P/NP NP f.f x.flight(x) y. x.to(x, y) BOSTON > PP x.to(x, BOST ON) N\N f. x.f(x) ^ to(x, BOST ON) < N x.flight(x) ^ to(x, BOST ON) > S x.flight(x) ^ to(x, BOST ON)
Supervised Data show me flights to Boston S/N N PP/NP NP f.f x.f light(x) y. x.to(x, y) BOST ON PP x.to(x, BOST ON) Latent > N\N f. x.f(x) ^ to(x, BOST ON) N x.f light(x) ^ to(x, BOST ON) S x.f light(x) ^ to(x, BOST ON) < >
Supervised Data Supervised learning is done from pairs of sentences and logical forms Show me flights to Boston x.flight(x) ^ to(x, BOST ON) I need a flight from baltimore to seattle x.flight(x) ^ from(x, BALT IMORE) ^ to(x, SEAT T LE) what ground transportation is available in san francisco x.ground transport(x) ^ to city(x, SF ) [Zettlemoyer and Collins 2005; 2007]
Weak Supervision Logical form is latent Labeling requires less expertise Labels don t uniquely determine correct logical forms Learning requires executing logical forms within a system and evaluating the result
Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico [Clarke et al. 2010; Liang et al. 2011]
Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico argmax( x.state(x) ^ border(x, T X), y.size(y)) argmax( x.river(x) ^ in(x, T X), y.size(y)) [Clarke et al. 2010; Liang et al. 2011]
Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico argmax( x.state(x) ^ border(x, T X), y.size(y)) New Mexico argmax( x.river(x) ^ in(x, T X), y.size(y)) Rio Grande [Clarke et al. 2010; Liang et al. 2011]
Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico argmax( x.state(x) ^ border(x, T X), y.size(y)) New Mexico argmax( x.river(x) ^ in(x, T X), y.size(y)) Rio Grande [Clarke et al. 2010; Liang et al. 2011]
Weak Supervision Learning from Demonstrations at the chair, move forward three steps past the sofa [Chen and Mooney 2011; Kim and Mooney 2012; Artzi and Zettlemoyer 2013b]
Weak Supervision Learning from Demonstrations at the chair, move forward three steps past the sofa Some examples from other domains: Sentences and labeled game states [Goldwasser and Roth 2011] Sentences and sets of physical objects [Matuszek et al. 2012] [Chen and Mooney 2011; Kim and Mooney 2012; Artzi and Zettlemoyer 2013b]
Parsing Learning Modeling Structured perceptron A unified learning algorithm Supervised learning Weak supervision
Structured Perceptron Simple additive updates - Only requires efficient decoding (argmax) - Closely related to maxent and other feature rich models - Provably finds linear separator in finite updates, if one exists Challenge: learning with hidden variables
Structured Perceptron Data: {(x i,y i ):i =1...n} For t =1...T : [iterate epochs] For i =1...n: y arg max y h, (x i,y)i If y 6= y i : + (x i,y i ) (x i,y ) [iterate examples] [predict] [check] [update] [Collins 2002]
One Derivation of the Perceptron Log-linear model: Step 1: Differentiate, to maximize data log-likelihood update = X i p(y x) = f(x i,y i ) ew f(x,y) P y 0 e w f(x,y0 ) E p(y xi )f(x i,y) Step 2: Use online, stochastic gradient updates, for example i: update i = f(x i,y i ) E p(y xi )f(x i,y) Step 3: Replace expectations with maxes (Viterbi approx.) update i = f(x i,y i ) f(x i,y ) where y =argmax y w f(x i,y)
The Perceptron with Hidden Variables Log-linear model: p(y x) = X h p(y, h x) p(y, h x) = e w f(x,h,y) P y 0,h 0 e w f(x,h0,y 0 ) Step 1: Differentiate marginal, to maximize data log-likelihood update = X i E p(h yi,x i )[f(x i,h,y i )] E p(y,h xi )[f(x i,h,y)] Step 2: Use online, stochastic gradient updates, for example i: update i = E p(yi,h x i )[f(x i,h,y i )] E p(y,h xi )[f(x i,h,y)] Step 3: Replace expectations with maxes (Viterbi approx.) update i = f(x i,h 0,y i ) f(x i,h,y ) where y,h =argmax y,h w f(x i,h,y) and h 0 =argmax h w f(x i,h,y i )
Hidden Variable Perceptron Data: {(x i,y i ):i =1...n} For t =1...T : [iterate epochs] For i =1...n: y,h arg max y,h h, (x i,h,y)i If y 6= y i : h 0 arg max h h, (x i,h,y i ) + (x i,h 0,y i ) (x i,h,y ) [iterate examples] [predict] [check] [predict hidden] [update] [Liang et al. 2006; Zettlemoyer and Collins 2007]
Hidden Variable Perceptron No known convergence guarantees - Log-linear version is non-convex Simple and easy to implement - Works well with careful initialization Modifications for semantic parsing - Lots of different hidden information - Can add a margin constraint, do probabilistic version, etc.
Validation Function Indicates correctness of a parse y Varying V allows for differing forms of supervision Learning Choices Lexical Generation Procedure V : Y! {t, f} GENLEX(x, V;, ) Given: sentence x validation function V lexicon parameters Produce a overly general set of lexical entries
Initialize using 0, 0 For t =1...T,i=1...n: Step 1: (Lexical generation) a. Set G GENLEX(x i, V i ;, ), [ G b. Let Y be the k highest scoring parses from GEN(x i ; ) c. Select lexical entries from the highest scoring valid parses: S i y2maxv i (Y ; ) LEX(y) d. Update lexicon: [ i Step 2: (Update parameters) Output: Parameters and lexicon
Unification-based GENLEX(x, z;, ) I want a flight to Boston x.f light(x) ^ to(x, BOS) 1. Find highest scoring correct parse 2. Find splits that most increases score 3. Return new lexical entries to I want a flight Boston (S NP)/N P N P y. x.to(x, y) BOS to Boston S/(S NP) S NP f. x.f light(x) ^ f(x) x.to(x, BOS) > S x.f light(x) ^ to(x, BOS) Iteration 2