Language to Logical Form with Neural Attention Mirella Lapata and Li Dong Institute for Language, Cognition and Computation School of Informatics University of Edinburgh mlap@inf.ed.ac.uk 1 / 32
Semantic Parsing Natural Language (NL) Parser Machine Executable Language 2 / 32
Semantic Parsing: Querying a Database NL What are the capitals of states bordering Texas? Parser DB λ x. capital(y, x) state(y) next_to(y, Texas) 3 / 32
Semantic Parsing: Instructing a Robot 1 NL at the chair, move forward three steps past the sofa LF Parser λ a.pre(a, x.chair(x)) move(a) len(a, 3) dir(a, forward) past(a, y.sofa(y)) : sentence! logical form 1 Courtesy: Artzi et al., 2013 4 / 32
Search Semantic Parsing: Answering Questions with Freebase titanic 1997 NL KB Who are the male actors in Titanic? Parser λ x. y. gender(male, x) cast(titanic, x, y) Titanic 1997 Drama film/romance 3h 30m 7.7/10 IMDb 88% Rotten Tomatoes James Cameron's "Titanic" is an epic, action-packed romance set against the ill-fated maiden voyage of the R.M.S. Titanic; the pride and joy of the White Star Line and, at the time, the larg More Initial release: November 18, 1997 (London) Director: James Cameron Featured song: My Heart Will Go On Cast Leonardo DiCaprio Jack Dawson Kate Winslet Rose DeWitt Bukater Billy Zane Caledon Hockley Gloria Stuart Rose DeWitt Bukater Kathy Bates Molly Brown Feedback 5 / 32
Supervised Approaches Induce parsers from sentences paired with logical forms Question Who are the male actors in Titanic? Logical Form λx. y. gender(male,x) cast(titanic,x,y) Parsing (Ge and Mooney, 2005; Lu et al., 2008; Zhao and Huang, 2015) Inductive logic programming (Zelle and Mooney, 1996; Tang and Mooney, 2000; Thomspon and Mooney, 2003) Machine translation (Wong and Mooney, 2006; Wong and Mooney, 2007; Andreas et al., 2013) CCG grammar induction (Zettlemoyer and Collins, 2005; Zettle- moyer and Collins, 2007; Kwiatkowski et al., 2010; Kwiatkowski et al., 2011) 6 / 32
Indirect Supervision Induce parsers from questions paired with side information Question Who are the male actors in Titanic? Answer {DICAPRIO, BILLYZANE... } Answers to questions (Clarke et al., 2010; Liang et al., 2013) System demonstrations (Chen and Mooney, 2011; Goldwasser and Roth, 2011; Artzi and Zettlemoyer, 2013) Distant supervision (Cai and Yates, 2013; Reddy et al., 2014) 7 / 32
Indirect Supervision Induce parsers from questions paired with side information Question Who are the male actors in Titanic? Logical Form Latent λx. y. gender(male,x) cast(titanic,x,y) Answer {DICAPRIO, BILLYZANE... } Answers to questions (Clarke et al., 2010; Liang et al., 2013) System demonstrations (Chen and Mooney, 2011; Goldwasser and Roth, 2011; Artzi and Zettlemoyer, 2013) Distant supervision (Cai and Yates, 2013; Reddy et al., 2014) 7 / 32
In General Developing semantic parsers requires linguistic expertise! high-quality lexicons based on underlying grammar formalism manually-built templates based on underlying grammar formalism grammar-based features pertaining to Logical Form and Natural Language Domain- and representation-specific! 8 / 32
Our Goal: All Purpose Semantic Parsing Question Who are the male actors in Titanic? Logical Form λx. y. gender(male,x) cast(titanic,x,y) 9 / 32
Our Goal: All Purpose Semantic Parsing Question Who are the male actors in Titanic? Logical Form λx. y. gender(male,x) cast(titanic,x,y) Learn from NL descriptions paired with meaning representations Use minimal domain (and grammar) knowledge Model is general and can be used across meaning representations 9 / 32
Problem Formulation Model maps natural language input q = x 1 x q to a logical form representation of its meaning a = y 1 y a. p(a q) = a t=1 p(y t y <t,q) where y <t = y 1 y t 1 Encoder encodes natural language input q into a vector representation Decoder generates y 1,,y a conditioned on the encoding vector. Vinyals et al., (2015a,b), Kalchbrenner and Blunsom (2013), Cho et al., (2014), Sutskever et al., (2014), Karpathy and Fei-Fei, (2015) 10 / 32
Encoder Decoder Framework Attention Layer what microsoft jobs do not require a bscs? answer(j,(compa ny(j,'microsoft'),j ob(j),not((req_de g(j,'bscs'))))) Input Utterance Sequence Encoder Sequence/Tree Decoder Logical Form 11 / 32
SEQ2SEQ Model encoder decoder h l t = ( h l ) t 1,hl 1 t h 0 t = W q e(x t ) h 0 t = W a e(y t 1 ) p(y t y <t,q) = softmax ( W o h L ) e(yt t ) 12 / 32
SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32
SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32
SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32
SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32
SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32
SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32
SEQ2TREE MODEL lambda $0 e <n> </s> and <n> <n> </s> > <n> 1600:ti </s> from $0 dallas:ci </s> <n> Nonterminal Start decoding Parent feeding Encoder unit Decoder unit departure _time $0 </s> 14 / 32
Parent Feeding Connections y 1 =A y 2 =B y 3 =<n> y 4 =</s> q <s> t 1 t 2 t 3 t 4 y 5 =C y 6 =</s> t 5 t 6 <(> A SEQ2TREE decoding example for the logical form A B (C) Hidden vector of the parent nonterminal is concatenated with the inputs and fed to the. p(a q) = p(y 1 y 2 y 3 y 4 q)p(y 5 y 6 y 3,q) 15 / 32
Attention Mechanism Attention Scores Scores computed by current h vector and all h vectors of encoder The encoder-side context vector c t is obtained in the form of a weighted sum, which is further used to predict y t Bahdanau et al., (2015), Luong et al., (2015b), Xu et al., (2015) 16 / 32
Training and Inference Our goal is to maximize the likelihood of the generated logical forms given natural language utterances as input. minimize (q,a) D logp(a q) where D is the set of all natural language-logical form training pairs At test time, we predict the logical form for an input utterance q by: â = argmax a p ( a q ) Iterating over all possible a s to obtain the optimal prediction is impractical Probability p(a q) decomposed so that we can use greedy/beam search. 17 / 32
Argument Identification Many LFs contain named entities and numbers aka rare words. Does not make sense to replace them with special unknown word symbol. Identify entities and numbers in input questions and replace them with their type names and unique IDs. jobs with a salary of 40000 job(ans), salary_greater_than(ans, 40000, year) Pre-processed examples are used as training data. After decoding, a post-processing step recovers all type i markers to corresponding logical constants. 18 / 32
Argument Identification Many LFs contain named entities and numbers aka rare words. Does not make sense to replace them with special unknown word symbol. Identify entities and numbers in input questions and replace them with their type names and unique IDs. jobs with a salary of num 0 job(ans), salary_greater_than(ans, num 0, year) Pre-processed examples are used as training data. After decoding, a post-processing step recovers all type i markers to corresponding logical constants. 18 / 32
Semantic Parsing Datasets Length JOBS 9.80 what microsoft jobs do not require a bscs? 22.90 answer(company(j, microsoft ),job(j),not((req_deg(j, bscs )))) Length GEO 7.60 what is the population of the state with the largest area? 19.10 (population:i (argmax $0 (state:t $0) (area:i $0))) Length ATIS 11.10 dallas to san francisco leaving after 4 in the afternoon please 28.10 (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci) (to $0 san_francisco:ci))) JOBS: queries to job listings; 500 training, 140 test instances. GEO: queries US geography database; 680 training, 200 test instances. ATIS: queries to a flight booking system; 4,480 training, 480 dev, 450 test. 19 / 32
Semantic Parsing Datasets Length IFTTT 6.95 turn on heater when temperature drops below 58 degree 21.80 TRIGGER: Weather - Current_temperature_drops_below - ((Temperature (58)) (Degrees_in (f))) ACTION: WeMo_Insight_Switch - Turn_on - ((Which_switch? (""))) if-this-then-that recipes from the IFTTT website (Quirk et al., 2015) Recipes are simple programs with exactly one trigger and one action 77,495 training, 5,171 development, and 4,294 test examples IFTTT programs are represented as abstract syntax trees; NL descriptions provided by users. 20 / 32
Evaluation Results on JOBS 90 88% 91% 87% 90% 85 80 79% 79% 85% 78% 84% 75 70 71% COCKTAIL01 PRECISE03 ZC05 DCS+L13 TISP15 SEQ2SEQ attention argument SEQ2TREE attention Accuracy (%) 21 / 32
Evaluation Results on GEO Accuracy (%) 90 88 86 84 82 80 78 76 74 72 72% 72% 75% 87% 82% 79% 86% 88% 89%89% 89% 88% 85% 73% 86% 77% 70 68 69% SCISSOR05 KRISP06 WASP06 λ -WASP06 LNLZ08 ZC05 ZC07 UBL10 FUBL11 KCAZ13 DCS+L13 TISP15 SEQ2SEQ attention argument SEQ2TREE attention 22 / 32
Evaluation Results on ATIS 85 85% 83% 84% 84% 84% 85% Accuracy (%) 80 75 75% 79% 74% 80% 71% ZC07 UBL10 FUBL11 GUSP-FULL13 GUSP++13 TISP15 SEQ2SEQ attention argument SEQ2TREE attention 23 / 32
Results on IFTTT 50 48% 49% 50% 50% 50% 50% 50% Accuracy (%) 45 40 42% 35 35% 35% retrieval phrasal sync classifier posclass SEQ2SEQ attention argument SEQ2TREE attention (a) Omit non-english 24 / 32
Results on IFTTT Accuracy (%) 60 55 50 45 49% 57% 58% 60% 60% 60% 60% 60% 40 40% 38% retrieval phrasal sync classifier posclass SEQ2SEQ attention argument SEQ2TREE attention (b) Omit non-english & unintelligible 25 / 32
Results on IFTTT 74% 74% 70 65% 67% Accuracy (%) 60 50 56% 46% 43% 40 retrieval phrasal sync classifier posclass 74% 73% 71% SEQ2SEQ attention argument SEQ2TREE attention (c) 3 turkers agree with gold 26 / 32
Example Output: SEQ2SEQ job ( ANS ), salary_greater_than ( ANS, num0, year ), \+ ( ( req_deg ( ANS, degid0 ) ) ) </s> </s> degid0 a requir not do that num0 pay job which <s> ( lambda $0 e( exists $1 ( and ( round_trip $1)( class_type $1 first:cl )( from $1 ci0 )( to $1 ci1 )( = ( fare $1 ) $0 )) ) </s> </s> ci1 to ci0 from trip round fare class first what <s> which jobs pay num0 that do not require a degid0 what s first class fare round trip from ci0 to ci1 27 / 32
Example Output: TREE2SEQ argmin $0 ( and ( flight $0 ) ( from $0 ci0 ) ( to $0 ci1 ) ( tomorrow $0 ) ) ( departure_time $0 ) </s> </s> tomorrow ci1 to ci0 from flight earliest the is what <s> argmax $0 ( and ( place:t $0 ) ( loc:t $0 co0 ) ) ( elevation:i $0 ) </s> </s> co0 the in elev highest the is what <s> what is the earliest flight from ci0 to ci1 what is the highest elevation in the co0 28 / 32
What Have we Learned? Encoder-decoder neural network model for mapping natural language descriptions to meaning representations TREE2SEQ and attention improves performance Model general and could transfer to other tasks/architectures Future work: learn a model from question-answer pairs without access to target logical forms. 29 / 32
Summary For more details see Dong and Lapata (ACL, 2016) Data and code from http://homepages.inf.ed.ac.uk/mlap/ index.php?page=code 30 / 32
Training Details Arguments were identified with plain string matching Parameters updated with RMSProp algorithm (batch size set to 20). Greadients were clipped, dropout was applied, and parameters were randomly initialized from a uniform distribution. A two-layer was used for IFTTT. Dimensions of hidden vector and word embedding from {150, 200, 250}. Early stopping was used to determine the number of epochs. Input sentences were reversed (Sutskever et al., 2014). All experiments were conducted on a single GTX 980 GPU device. 31 / 32
Decoding for SEQ2TREE Algorithm 1 Decoding for SEQ2TREE Input: q: Natural language utterance Output: â: Decoding result 1: Push the encoding result to a queue 2: Q.init({hid : SeqEnc(q)}) 3: Decode until no more nonterminals 4: while (c Q.pop()) do 5: Call sequence decoder 6: c.child SeqDec(c.hid) 7: Push new nonterminals to queue 8: for n nonterminal in c.child do 9: Q.push({hid : HidVec(n)}) 10: â convert decoding tree to output sequence 32 / 32