Language to Logical Form with Neural Attention

Similar documents
Driving Semantic Parsing from the World s Response

Semantic Parsing with Combinatory Categorial Grammars

Top Down Intro to Neural Semantic Parsing. Ofir

Outline. Learning. Overview Details Example Lexicon learning Supervision signals

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Learning Dependency-Based Compositional Semantics

Cross-lingual Semantic Parsing

Bidirectional Attention for SQL Generation. Tong Guo 1 Huilin Gao 2. Beijing, China

Introduction to Semantic Parsing with CCG

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Learning to translate with neural networks. Michael Auli

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Lecture 6: Neural Networks for Representing Word Meaning

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

NLU: Semantic parsing

Lecture 5: Semantic Parsing, CCGs, and Structured Classification

arxiv: v1 [cs.cl] 21 May 2017

Better Conditional Language Modeling. Chris Dyer

arxiv: v2 [cs.cl] 1 Jan 2019

Marrying Dynamic Programming with Recurrent Neural Networks

Expectation Maximization (EM)

Multi-Source Neural Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

LECTURER: BURCU CAN Spring

Parts 3-6 are EXAMPLES for cse634

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Multi-Source Neural Translation

Conditional Language modeling with attention

Multiword Expression Identification with Tree Substitution Grammars

Today s Lecture. Dropout

AN ABSTRACT OF THE DISSERTATION OF

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Bringing machine learning & compositional semantics together: central concepts

Recurrent neural network grammars

Unanimous Prediction for 100% Precision with Application to Learning Semantic Mappings

text classification 3: neural networks

Lecture 5 Neural models for NLP

Learning to reason by reading text and answering questions

PROBABILISTIC KNOWLEDGE GRAPH CONSTRUCTION: COMPOSITIONAL AND INCREMENTAL APPROACHES. Dongwoo Kim with Lexing Xie and Cheng Soon Ong CIKM 2016

Improving Lexical Choice in Neural Machine Translation. Toan Q. Nguyen & David Chiang

Probabilistic Context-free Grammars

Margin-based Decomposed Amortized Inference

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Coverage Embedding Models for Neural Machine Translation

with Local Dependencies

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Outline. Learning. Overview Details

Random Coattention Forest for Question Answering

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

A Discriminative Model for Semantics-to-String Translation

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Information Extraction from Text

Recurrent Neural Network

Jointly Extracting Event Triggers and Arguments by Dependency-Bridge RNN and Tensor-Based Argument Interaction

Hidden Markov Models

Task-Oriented Dialogue System (Young, 2000)

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Linear Classifiers IV

Expectation Maximization (EM)

Bayesian Paragraph Vectors

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Random Generation of Nondeterministic Tree Automata

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

Natural Language Processing (CSEP 517): Machine Translation

Parsing with Context-Free Grammars

Qualifier: CS 6375 Machine Learning Spring 2015

Latent Variable Models in NLP

Statistical Methods for NLP

Conditional Language Modeling. Chris Dyer

Dialogue Systems. Statistical NLU component. Representation. A Probabilistic Dialogue System. Task: map a sentence + context to a database query

Personal Project: Shift-Reduce Dependency Parsing

Tracking the World State with Recurrent Entity Networks

Presented By: Omer Shmueli and Sivan Niv

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

arxiv: v2 [cs.lg] 26 Jan 2016

Lecture Notes on Inductive Definitions

KNOWLEDGE-GUIDED RECURRENT NEURAL NETWORK LEARNING FOR TASK-ORIENTED ACTION PREDICTION

Improving Sequence-to-Sequence Constituency Parsing

Lecture 3: ASR: HMMs, Forward, Viterbi

A proof theoretical account of polarity items and monotonic inference.

Deep Learning Recurrent Networks 2/28/2018

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

From perceptrons to word embeddings. Simon Šuster University of Groningen

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

arxiv: v1 [cs.cl] 27 Aug 2018

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Data Mining & Machine Learning

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Lecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1

Lecture Notes on Inductive Definitions

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Tuning as Linear Regression

Deep Neural Solver for Math Word Problems

Transcription:

Language to Logical Form with Neural Attention Mirella Lapata and Li Dong Institute for Language, Cognition and Computation School of Informatics University of Edinburgh mlap@inf.ed.ac.uk 1 / 32

Semantic Parsing Natural Language (NL) Parser Machine Executable Language 2 / 32

Semantic Parsing: Querying a Database NL What are the capitals of states bordering Texas? Parser DB λ x. capital(y, x) state(y) next_to(y, Texas) 3 / 32

Semantic Parsing: Instructing a Robot 1 NL at the chair, move forward three steps past the sofa LF Parser λ a.pre(a, x.chair(x)) move(a) len(a, 3) dir(a, forward) past(a, y.sofa(y)) : sentence! logical form 1 Courtesy: Artzi et al., 2013 4 / 32

Search Semantic Parsing: Answering Questions with Freebase titanic 1997 NL KB Who are the male actors in Titanic? Parser λ x. y. gender(male, x) cast(titanic, x, y) Titanic 1997 Drama film/romance 3h 30m 7.7/10 IMDb 88% Rotten Tomatoes James Cameron's "Titanic" is an epic, action-packed romance set against the ill-fated maiden voyage of the R.M.S. Titanic; the pride and joy of the White Star Line and, at the time, the larg More Initial release: November 18, 1997 (London) Director: James Cameron Featured song: My Heart Will Go On Cast Leonardo DiCaprio Jack Dawson Kate Winslet Rose DeWitt Bukater Billy Zane Caledon Hockley Gloria Stuart Rose DeWitt Bukater Kathy Bates Molly Brown Feedback 5 / 32

Supervised Approaches Induce parsers from sentences paired with logical forms Question Who are the male actors in Titanic? Logical Form λx. y. gender(male,x) cast(titanic,x,y) Parsing (Ge and Mooney, 2005; Lu et al., 2008; Zhao and Huang, 2015) Inductive logic programming (Zelle and Mooney, 1996; Tang and Mooney, 2000; Thomspon and Mooney, 2003) Machine translation (Wong and Mooney, 2006; Wong and Mooney, 2007; Andreas et al., 2013) CCG grammar induction (Zettlemoyer and Collins, 2005; Zettle- moyer and Collins, 2007; Kwiatkowski et al., 2010; Kwiatkowski et al., 2011) 6 / 32

Indirect Supervision Induce parsers from questions paired with side information Question Who are the male actors in Titanic? Answer {DICAPRIO, BILLYZANE... } Answers to questions (Clarke et al., 2010; Liang et al., 2013) System demonstrations (Chen and Mooney, 2011; Goldwasser and Roth, 2011; Artzi and Zettlemoyer, 2013) Distant supervision (Cai and Yates, 2013; Reddy et al., 2014) 7 / 32

Indirect Supervision Induce parsers from questions paired with side information Question Who are the male actors in Titanic? Logical Form Latent λx. y. gender(male,x) cast(titanic,x,y) Answer {DICAPRIO, BILLYZANE... } Answers to questions (Clarke et al., 2010; Liang et al., 2013) System demonstrations (Chen and Mooney, 2011; Goldwasser and Roth, 2011; Artzi and Zettlemoyer, 2013) Distant supervision (Cai and Yates, 2013; Reddy et al., 2014) 7 / 32

In General Developing semantic parsers requires linguistic expertise! high-quality lexicons based on underlying grammar formalism manually-built templates based on underlying grammar formalism grammar-based features pertaining to Logical Form and Natural Language Domain- and representation-specific! 8 / 32

Our Goal: All Purpose Semantic Parsing Question Who are the male actors in Titanic? Logical Form λx. y. gender(male,x) cast(titanic,x,y) 9 / 32

Our Goal: All Purpose Semantic Parsing Question Who are the male actors in Titanic? Logical Form λx. y. gender(male,x) cast(titanic,x,y) Learn from NL descriptions paired with meaning representations Use minimal domain (and grammar) knowledge Model is general and can be used across meaning representations 9 / 32

Problem Formulation Model maps natural language input q = x 1 x q to a logical form representation of its meaning a = y 1 y a. p(a q) = a t=1 p(y t y <t,q) where y <t = y 1 y t 1 Encoder encodes natural language input q into a vector representation Decoder generates y 1,,y a conditioned on the encoding vector. Vinyals et al., (2015a,b), Kalchbrenner and Blunsom (2013), Cho et al., (2014), Sutskever et al., (2014), Karpathy and Fei-Fei, (2015) 10 / 32

Encoder Decoder Framework Attention Layer what microsoft jobs do not require a bscs? answer(j,(compa ny(j,'microsoft'),j ob(j),not((req_de g(j,'bscs'))))) Input Utterance Sequence Encoder Sequence/Tree Decoder Logical Form 11 / 32

SEQ2SEQ Model encoder decoder h l t = ( h l ) t 1,hl 1 t h 0 t = W q e(x t ) h 0 t = W a e(y t 1 ) p(y t y <t,q) = softmax ( W o h L ) e(yt t ) 12 / 32

SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32

SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32

SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32

SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32

SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32

SEQ2TREE MODEL (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))) root lambda $0 e < n > and < n > < n > > < n > 1600:ti from $0 dallas:ci departure_time $0 13 / 32

SEQ2TREE MODEL lambda $0 e <n> </s> and <n> <n> </s> > <n> 1600:ti </s> from $0 dallas:ci </s> <n> Nonterminal Start decoding Parent feeding Encoder unit Decoder unit departure _time $0 </s> 14 / 32

Parent Feeding Connections y 1 =A y 2 =B y 3 =<n> y 4 =</s> q <s> t 1 t 2 t 3 t 4 y 5 =C y 6 =</s> t 5 t 6 <(> A SEQ2TREE decoding example for the logical form A B (C) Hidden vector of the parent nonterminal is concatenated with the inputs and fed to the. p(a q) = p(y 1 y 2 y 3 y 4 q)p(y 5 y 6 y 3,q) 15 / 32

Attention Mechanism Attention Scores Scores computed by current h vector and all h vectors of encoder The encoder-side context vector c t is obtained in the form of a weighted sum, which is further used to predict y t Bahdanau et al., (2015), Luong et al., (2015b), Xu et al., (2015) 16 / 32

Training and Inference Our goal is to maximize the likelihood of the generated logical forms given natural language utterances as input. minimize (q,a) D logp(a q) where D is the set of all natural language-logical form training pairs At test time, we predict the logical form for an input utterance q by: â = argmax a p ( a q ) Iterating over all possible a s to obtain the optimal prediction is impractical Probability p(a q) decomposed so that we can use greedy/beam search. 17 / 32

Argument Identification Many LFs contain named entities and numbers aka rare words. Does not make sense to replace them with special unknown word symbol. Identify entities and numbers in input questions and replace them with their type names and unique IDs. jobs with a salary of 40000 job(ans), salary_greater_than(ans, 40000, year) Pre-processed examples are used as training data. After decoding, a post-processing step recovers all type i markers to corresponding logical constants. 18 / 32

Argument Identification Many LFs contain named entities and numbers aka rare words. Does not make sense to replace them with special unknown word symbol. Identify entities and numbers in input questions and replace them with their type names and unique IDs. jobs with a salary of num 0 job(ans), salary_greater_than(ans, num 0, year) Pre-processed examples are used as training data. After decoding, a post-processing step recovers all type i markers to corresponding logical constants. 18 / 32

Semantic Parsing Datasets Length JOBS 9.80 what microsoft jobs do not require a bscs? 22.90 answer(company(j, microsoft ),job(j),not((req_deg(j, bscs )))) Length GEO 7.60 what is the population of the state with the largest area? 19.10 (population:i (argmax $0 (state:t $0) (area:i $0))) Length ATIS 11.10 dallas to san francisco leaving after 4 in the afternoon please 28.10 (lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci) (to $0 san_francisco:ci))) JOBS: queries to job listings; 500 training, 140 test instances. GEO: queries US geography database; 680 training, 200 test instances. ATIS: queries to a flight booking system; 4,480 training, 480 dev, 450 test. 19 / 32

Semantic Parsing Datasets Length IFTTT 6.95 turn on heater when temperature drops below 58 degree 21.80 TRIGGER: Weather - Current_temperature_drops_below - ((Temperature (58)) (Degrees_in (f))) ACTION: WeMo_Insight_Switch - Turn_on - ((Which_switch? (""))) if-this-then-that recipes from the IFTTT website (Quirk et al., 2015) Recipes are simple programs with exactly one trigger and one action 77,495 training, 5,171 development, and 4,294 test examples IFTTT programs are represented as abstract syntax trees; NL descriptions provided by users. 20 / 32

Evaluation Results on JOBS 90 88% 91% 87% 90% 85 80 79% 79% 85% 78% 84% 75 70 71% COCKTAIL01 PRECISE03 ZC05 DCS+L13 TISP15 SEQ2SEQ attention argument SEQ2TREE attention Accuracy (%) 21 / 32

Evaluation Results on GEO Accuracy (%) 90 88 86 84 82 80 78 76 74 72 72% 72% 75% 87% 82% 79% 86% 88% 89%89% 89% 88% 85% 73% 86% 77% 70 68 69% SCISSOR05 KRISP06 WASP06 λ -WASP06 LNLZ08 ZC05 ZC07 UBL10 FUBL11 KCAZ13 DCS+L13 TISP15 SEQ2SEQ attention argument SEQ2TREE attention 22 / 32

Evaluation Results on ATIS 85 85% 83% 84% 84% 84% 85% Accuracy (%) 80 75 75% 79% 74% 80% 71% ZC07 UBL10 FUBL11 GUSP-FULL13 GUSP++13 TISP15 SEQ2SEQ attention argument SEQ2TREE attention 23 / 32

Results on IFTTT 50 48% 49% 50% 50% 50% 50% 50% Accuracy (%) 45 40 42% 35 35% 35% retrieval phrasal sync classifier posclass SEQ2SEQ attention argument SEQ2TREE attention (a) Omit non-english 24 / 32

Results on IFTTT Accuracy (%) 60 55 50 45 49% 57% 58% 60% 60% 60% 60% 60% 40 40% 38% retrieval phrasal sync classifier posclass SEQ2SEQ attention argument SEQ2TREE attention (b) Omit non-english & unintelligible 25 / 32

Results on IFTTT 74% 74% 70 65% 67% Accuracy (%) 60 50 56% 46% 43% 40 retrieval phrasal sync classifier posclass 74% 73% 71% SEQ2SEQ attention argument SEQ2TREE attention (c) 3 turkers agree with gold 26 / 32

Example Output: SEQ2SEQ job ( ANS ), salary_greater_than ( ANS, num0, year ), \+ ( ( req_deg ( ANS, degid0 ) ) ) </s> </s> degid0 a requir not do that num0 pay job which <s> ( lambda $0 e( exists $1 ( and ( round_trip $1)( class_type $1 first:cl )( from $1 ci0 )( to $1 ci1 )( = ( fare $1 ) $0 )) ) </s> </s> ci1 to ci0 from trip round fare class first what <s> which jobs pay num0 that do not require a degid0 what s first class fare round trip from ci0 to ci1 27 / 32

Example Output: TREE2SEQ argmin $0 ( and ( flight $0 ) ( from $0 ci0 ) ( to $0 ci1 ) ( tomorrow $0 ) ) ( departure_time $0 ) </s> </s> tomorrow ci1 to ci0 from flight earliest the is what <s> argmax $0 ( and ( place:t $0 ) ( loc:t $0 co0 ) ) ( elevation:i $0 ) </s> </s> co0 the in elev highest the is what <s> what is the earliest flight from ci0 to ci1 what is the highest elevation in the co0 28 / 32

What Have we Learned? Encoder-decoder neural network model for mapping natural language descriptions to meaning representations TREE2SEQ and attention improves performance Model general and could transfer to other tasks/architectures Future work: learn a model from question-answer pairs without access to target logical forms. 29 / 32

Summary For more details see Dong and Lapata (ACL, 2016) Data and code from http://homepages.inf.ed.ac.uk/mlap/ index.php?page=code 30 / 32

Training Details Arguments were identified with plain string matching Parameters updated with RMSProp algorithm (batch size set to 20). Greadients were clipped, dropout was applied, and parameters were randomly initialized from a uniform distribution. A two-layer was used for IFTTT. Dimensions of hidden vector and word embedding from {150, 200, 250}. Early stopping was used to determine the number of epochs. Input sentences were reversed (Sutskever et al., 2014). All experiments were conducted on a single GTX 980 GPU device. 31 / 32

Decoding for SEQ2TREE Algorithm 1 Decoding for SEQ2TREE Input: q: Natural language utterance Output: â: Decoding result 1: Push the encoding result to a queue 2: Q.init({hid : SeqEnc(q)}) 3: Decode until no more nonterminals 4: while (c Q.pop()) do 5: Call sequence decoder 6: c.child SeqDec(c.hid) 7: Push new nonterminals to queue 8: for n nonterminal in c.child do 9: Q.push({hid : HidVec(n)}) 10: â convert decoding tree to output sequence 32 / 32