Semantic Parsing with Combinatory Categorial Grammars

Similar documents
Driving Semantic Parsing from the World s Response

Outline. Learning. Overview Details Example Lexicon learning Supervision signals

Learning Dependency-Based Compositional Semantics

Introduction to Semantic Parsing with CCG

Outline. Learning. Overview Details

Lab 12: Structured Prediction

Natural Language Processing

Machine Learning for NLP

Lecture 5: Semantic Parsing, CCGs, and Structured Classification

Machine Learning for NLP

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Language to Logical Form with Neural Attention

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Learning to translate with neural networks. Michael Auli

CSE446: Clustering and EM Spring 2017

Machine Learning for Structured Prediction

Expectation Maximization (EM)

Logistic Regression & Neural Networks

CSE 490 U Natural Language Processing Spring 2016

A brief introduction to Conditional Random Fields

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CSE 447/547 Natural Language Processing Winter 2018

Structured Prediction

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Sequential Supervised Learning

Statistical Methods for NLP

NLP Programming Tutorial 11 - The Structured Perceptron

Warm up: risk prediction with logistic regression

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Support vector machines Lecture 4

Statistical Methods for NLP

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

6.891: Lecture 24 (December 8th, 2003) Kernel Methods

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 13: Structured Prediction

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Neural Networks: Backpropagation

Expectation Maximization Algorithm

CPSC 540: Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Log-Linear Models, MEMMs, and CRFs

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 12: EM Algorithm

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Binary Classification / Perceptron

Loss Functions, Decision Theory, and Linear Models

Cross-lingual Semantic Parsing

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Natural Language Processing

Probabilistic Graphical Models

lecture 6: modeling sequences (final part)

Policy Shaping and Generalized Update Equations for Semantic Parsing from Denotations

Structured Prediction

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Hidden Markov Models

Logic and machine learning review. CS 540 Yingyu Liang

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Intelligent Systems (AI-2)

Unsupervised Learning

Spectral Unsupervised Parsing with Additive Tree Metrics

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Introduction to Logistic Regression and Support Vector Machine

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

with Local Dependencies

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

CPSC 340: Machine Learning and Data Mining

Probabilistic Context-free Grammars

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Probabilistic Models for Sequence Labeling

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Conditional Random Field

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Expectation maximization tutorial

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

CS60021: Scalable Data Mining. Large Scale Machine Learning

Lecture 12: Algorithms for HMMs

Probabilistic Context-Free Grammar

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Classification Based on Probability

Probabilistic Graphical Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Bringing machine learning & compositional semantics together: central concepts

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Multiword Expression Identification with Tree Substitution Grammars

LECTURER: BURCU CAN Spring

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Logistic Regression Logistic

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Lecture 5 Neural models for NLP

Generalized Linear Classifiers in NLP

Sequence Labeling: HMMs & Structured Perceptron

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Conditional Random Fields for Sequential Supervised Learning

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Lecture 3: ASR: HMMs, Forward, Viterbi

Transcription:

Semantic Parsing with Combinatory Categorial Grammars Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer University of Washington ACL 2013 Tutorial Sofia, Bulgaria

Learning Data Learning Algorithm CCG What kind of data/supervision we can use? What do we need to learn?

Supervised Data show me flights to Boston S/N N P P/NP NP f.f x.flight(x) y. x.to(x, y) BOSTON > PP x.to(x, BOST ON) N\N f. x.f(x) ^ to(x, BOST ON) < N x.flight(x) ^ to(x, BOST ON) > S x.flight(x) ^ to(x, BOST ON)

Supervised Data show me flights to Boston S/N N PP/NP NP f.f x.f light(x) y. x.to(x, y) BOST ON PP x.to(x, BOST ON) Latent > N\N f. x.f(x) ^ to(x, BOST ON) N x.f light(x) ^ to(x, BOST ON) S x.f light(x) ^ to(x, BOST ON) < >

Supervised Data Supervised learning is done from pairs of sentences and logical forms Show me flights to Boston x.flight(x) ^ to(x, BOST ON) I need a flight from baltimore to seattle x.flight(x) ^ from(x, BALT IMORE) ^ to(x, SEAT T LE) what ground transportation is available in san francisco x.ground transport(x) ^ to city(x, SF ) [Zettlemoyer and Collins 2005; 2007]

Weak Supervision Logical form is latent Labeling requires less expertise Labels don t uniquely determine correct logical forms Learning requires executing logical forms within a system and evaluating the result

Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico [Clarke et al. 2010; Liang et al. 2011]

Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico argmax( x.state(x) ^ border(x, T X), y.size(y)) argmax( x.river(x) ^ in(x, T X), y.size(y)) [Clarke et al. 2010; Liang et al. 2011]

Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico argmax( x.state(x) ^ border(x, T X), y.size(y)) New Mexico argmax( x.river(x) ^ in(x, T X), y.size(y)) Rio Grande [Clarke et al. 2010; Liang et al. 2011]

Weak Supervision Learning from Query Answers What is the largest state that borders Texas? New Mexico argmax( x.state(x) ^ border(x, T X), y.size(y)) New Mexico argmax( x.river(x) ^ in(x, T X), y.size(y)) Rio Grande [Clarke et al. 2010; Liang et al. 2011]

Weak Supervision Learning from Demonstrations at the chair, move forward three steps past the sofa [Chen and Mooney 2011; Kim and Mooney 2012; Artzi and Zettlemoyer 2013b]

Weak Supervision Learning from Demonstrations at the chair, move forward three steps past the sofa Some examples from other domains: Sentences and labeled game states [Goldwasser and Roth 2011] Sentences and sets of physical objects [Matuszek et al. 2012] [Chen and Mooney 2011; Kim and Mooney 2012; Artzi and Zettlemoyer 2013b]

Parsing Learning Modeling Structured perceptron A unified learning algorithm Supervised learning Weak supervision

Structured Perceptron Simple additive updates - Only requires efficient decoding (argmax) - Closely related to maxent and other feature rich models - Provably finds linear separator in finite updates, if one exists Challenge: learning with hidden variables

Structured Perceptron Data: {(x i,y i ):i =1...n} For t =1...T : [iterate epochs] For i =1...n: y arg max y h, (x i,y)i If y 6= y i : + (x i,y i ) (x i,y ) [iterate examples] [predict] [check] [update] [Collins 2002]

One Derivation of the Perceptron Log-linear model: Step 1: Differentiate, to maximize data log-likelihood update = X i p(y x) = f(x i,y i ) ew f(x,y) P y 0 e w f(x,y0 ) E p(y xi )f(x i,y) Step 2: Use online, stochastic gradient updates, for example i: update i = f(x i,y i ) E p(y xi )f(x i,y) Step 3: Replace expectations with maxes (Viterbi approx.) update i = f(x i,y i ) f(x i,y ) where y =argmax y w f(x i,y)

The Perceptron with Hidden Variables Log-linear model: p(y x) = X h p(y, h x) p(y, h x) = e w f(x,h,y) P y 0,h 0 e w f(x,h0,y 0 ) Step 1: Differentiate marginal, to maximize data log-likelihood update = X i E p(h yi,x i )[f(x i,h,y i )] E p(y,h xi )[f(x i,h,y)] Step 2: Use online, stochastic gradient updates, for example i: update i = E p(yi,h x i )[f(x i,h,y i )] E p(y,h xi )[f(x i,h,y)] Step 3: Replace expectations with maxes (Viterbi approx.) update i = f(x i,h 0,y i ) f(x i,h,y ) where y,h =argmax y,h w f(x i,h,y) and h 0 =argmax h w f(x i,h,y i )

Hidden Variable Perceptron Data: {(x i,y i ):i =1...n} For t =1...T : [iterate epochs] For i =1...n: y,h arg max y,h h, (x i,h,y)i If y 6= y i : h 0 arg max h h, (x i,h,y i ) + (x i,h 0,y i ) (x i,h,y ) [iterate examples] [predict] [check] [predict hidden] [update] [Liang et al. 2006; Zettlemoyer and Collins 2007]

Hidden Variable Perceptron No known convergence guarantees - Log-linear version is non-convex Simple and easy to implement - Works well with careful initialization Modifications for semantic parsing - Lots of different hidden information - Can add a margin constraint, do probabilistic version, etc.

Validation Function Indicates correctness of a parse y Varying V allows for differing forms of supervision Learning Choices Lexical Generation Procedure V : Y! {t, f} GENLEX(x, V;, ) Given: sentence x validation function V lexicon parameters Produce a overly general set of lexical entries

Initialize using 0, 0 For t =1...T,i=1...n: Step 1: (Lexical generation) a. Set G GENLEX(x i, V i ;, ), [ G b. Let Y be the k highest scoring parses from GEN(x i ; ) c. Select lexical entries from the highest scoring valid parses: S i y2maxv i (Y ; ) LEX(y) d. Update lexicon: [ i Step 2: (Update parameters) Output: Parameters and lexicon

Unification-based GENLEX(x, z;, ) I want a flight to Boston x.f light(x) ^ to(x, BOS) 1. Find highest scoring correct parse 2. Find splits that most increases score 3. Return new lexical entries to I want a flight Boston (S NP)/N P N P y. x.to(x, y) BOS to Boston S/(S NP) S NP f. x.f light(x) ^ f(x) x.to(x, BOS) > S x.f light(x) ^ to(x, BOS) Iteration 2