Outline. Learning. Overview Details Example Lexicon learning Supervision signals

Similar documents
Outline. Learning. Overview Details

Semantic Parsing with Combinatory Categorial Grammars

Driving Semantic Parsing from the World s Response

Outline. Parsing. Approximations Some tricks Learning agenda-based parsers

Learning Dependency-Based Compositional Semantics

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Cross-lingual Semantic Parsing

Introduction to Semantic Parsing with CCG

Lecture 13: Structured Prediction

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 5: Semantic Parsing, CCGs, and Structured Classification

LECTURER: BURCU CAN Spring

Advanced Natural Language Processing Syntactic Parsing

Probabilistic Context-free Grammars

Expectation Maximization (EM)

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Decoding and Inference with Syntactic Translation Models

NLP Programming Tutorial 11 - The Structured Perceptron

Bringing machine learning & compositional semantics together: central concepts

Machine Learning for Structured Prediction

Machine Learning for NLP

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multiword Expression Identification with Tree Substitution Grammars

6.891: Lecture 24 (December 8th, 2003) Kernel Methods

The Infinite PCFG using Hierarchical Dirichlet Processes

PAC Generalization Bounds for Co-training

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Natural Language Processing

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

A DOP Model for LFG. Rens Bod and Ronald Kaplan. Kathrin Spreyer Data-Oriented Parsing, 14 June 2005

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Personal Project: Shift-Reduce Dependency Parsing

Feature selection. Micha Elsner. January 29, 2014

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Lab 5: 16 th April Exercises on Neural Networks

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Machine Learning for NLP

Spectral Unsupervised Parsing with Additive Tree Metrics

Latent Variable Models in NLP

Log-Linear Models, MEMMs, and CRFs

Structured Prediction

6.036 midterm review. Wednesday, March 18, 15

Statistical Methods for NLP

9 Classification. 9.1 Linear Classifiers

Lab 12: Structured Prediction

Lecture 15. Probabilistic Models on Graph

Natural Language Processing

Topics in Lexical-Functional Grammar. Ronald M. Kaplan and Mary Dalrymple. Xerox PARC. August 1995

Hidden Markov Models in Language Processing

Multi-Component Word Sense Disambiguation

Features of Statistical Parsers

ECE521 Lecture 7/8. Logistic Regression

Deep Learning for Computer Vision

Feedforward Neural Networks. Michael Collins, Columbia University

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Probabilistic Context-Free Grammar

Introduction to Computational Linguistics

Language to Logical Form with Neural Attention

Unanimous Prediction for 100% Precision with Application to Learning Semantic Mappings

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Neural Networks and Deep Learning

Natural Language Processing

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Parsing with Context-Free Grammars

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Structured Prediction Models via the Matrix-Tree Theorem

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Logic and machine learning review. CS 540 Yingyu Liang

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Linear Models for Classification

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Surface Reasoning Lecture 2: Logic and Grammar

CS 662 Sample Midterm

CS229 Supplemental Lecture notes

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Learning Tetris. 1 Tetris. February 3, 2009

Parsing with Context-Free Grammars

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Machine Learning 2017

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

Optimization and Gradient Descent

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

MIRA, SVM, k-nn. Lirong Xia

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Intelligent Systems (AI-2)

Natural Language Processing

Lecture 14. Clustering, K-means, and EM

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Transcription:

Outline Learning Overview Details Example Lexicon learning Supervision signals 0

Outline Learning Overview Details Example Lexicon learning Supervision signals 1

Supervision in syntactic parsing Input: S NP VP NP NP V VP ESSLLI 2016 the known summer school is V PP located in Bolzano Output: S They play football NP They V VP NP play football 2

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011] Supervision in semantic parsing Input: Heavy supervision How tall is Lebron James? HeightOf.LebronJames What is Steph Curry s daughter called? ChildrenOf.StephCurry Gender.Female Youngest player of the Cavaliers arg min(playerof.cavaliers, BirthDateOf)... Light supervision How tall is Lebron James? 203cm What is Steph Curry s daughter called? Riley Curry Youngest player of the Cavaliers Kyrie Irving... 3

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011] Supervision in semantic parsing Input: Heavy supervision How tall is Lebron James? HeightOf.LebronJames What is Steph Curry s daughter called? ChildrenOf.StephCurry Gender.Female Youngest player of the Cavaliers arg min(playerof.cavaliers, BirthDateOf)... Light supervision How tall is Lebron James? 203cm What is Steph Curry s daughter called? Riley Curry Youngest player of the Cavaliers Kyrie Irving... Output: WeightOf.ClayThompson Clay Thompson s weight ClayThompson s Weight Weight.ClayThompson 205 lbs Clay Thompson weight 3

Learning in a nutshell utterance 0. Define model for derivations 4

Learning in a nutshell utterance Parsing. 0. Define model for derivations 1. Generate candidate derivations (later) 4

Learning in a nutshell utterance Parsing.. Label 0. Define model for derivations 1. Generate candidate derivations (later) 2. Label as correct and incorrect 4

Learning in a nutshell utterance Parsing.. Label Update model 0. Define model for derivations 1. Generate candidate derivations (later) 2. Label as correct and incorrect 3. Update model to favor correct trees 4

Training intuition Where did Mozart tupress? Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth PlaceOfDeath.WilliamHogarth PlaceOfMarriage.WilliamHogarth London 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth London PlaceOfDeath.WilliamHogarth London PlaceOfMarriage.WilliamHogarth Paddington London 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth London PlaceOfDeath.WilliamHogarth London PlaceOfMarriage.WilliamHogarth Paddington London 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth London PlaceOfDeath.WilliamHogarth London PlaceOfMarriage.WilliamHogarth Paddington London 5

Outline Learning Overview Details Example Lexicon learning Supervision signals 6

Constructing derivations Type.Person PlaceLived.Chicago intersect Type.Person lexicon who PlaceLived.Chicago join people PlaceLived lived in lexicon Chicago Chicago lexicon 7

Many possible derivations! x = people who have lived in Chicago? set of candidate derivations D(x) Type.Person PlaceLived.Chicago intersect Type.Org PresentIn.ChicagoMusical intersect Type.Person lexicon who PlaceLived.Chicago join Type.Org lexicon who PresentIn.ChicagoMusical join people PlaceLived Chicago people PresentIn ChicagoMusical lexicon lexicon lexicon lexicon lived in Chicago lived in Chicago 8

Type.Person PlaceLived.Chicago intersect x: utterance d: derivation Type.Person people lexicon who PlaceLived.Chicago join PlaceLived Chicago lived in lexicon Chicago lexicon Feature vector and parameters in R F : φ(x, d) θ learned apply join 1 1.2 apply intersect 1 0.6 apply lexicon 3 2.1 lived maps to PlacesLived 1 3.1 lived maps to PlaceOfBirth 0-0.4 born maps to PlaceOfBirth 0 2.7......... 9

Type.Person PlaceLived.Chicago intersect x: utterance d: derivation Type.Person people lexicon who PlaceLived.Chicago join PlaceLived Chicago lived in lexicon Chicago lexicon Feature vector and parameters in R F : φ(x, d) θ learned apply join 1 1.2 apply intersect 1 0.6 apply lexicon 3 2.1 lived maps to PlacesLived 1 3.1 lived maps to PlaceOfBirth 0-0.4 born maps to PlaceOfBirth 0 2.7......... Score θ (x, d) = φ(x, d) θ = 1.2 1 + 0.6 1 + 2.1 3 + 3.1 1 + 0.4 0 + 2.7 0 +... 9

Deep learning alert! The feature vector φ(x, d) is constructed by hand 10

Deep learning alert! The feature vector φ(x, d) is constructed by hand Constructing good features is hard 10

Deep learning alert! The feature vector φ(x, d) is constructed by hand Constructing good features is hard Algorithms are likely to do it better 10

Deep learning alert! The feature vector φ(x, d) is constructed by hand Constructing good features is hard Algorithms are likely to do it better Perhaps we can train φ(x, d) φ(x, d) = F ψ (x, d), where ψ are the parameters 10

Candidate derivations: D(x) Log-linear model Model: distribution over derivations d given utterance x p θ (d x) = exp(score θ (x,d)) d D(x) exp(score θ(x,d )) 11

Candidate derivations: D(x) Log-linear model Model: distribution over derivations d given utterance x p θ (d x) = exp(score θ (x,d)) d D(x) exp(score θ(x,d )) score θ (x, d) [1, 2, 3, 4] p θ (d x) [ e e+e 2 +e 3 +e 4, e 2 e+e 2 +e 3 +e 4, e 3 e+e 2 +e 3 +e 4, e 4 e+e 2 +e 3 +e 4 ] 11

Candidate derivations: D(x) Log-linear model Model: distribution over derivations d given utterance x p θ (d x) = exp(score θ (x,d)) d D(x) exp(score θ(x,d )) score θ (x, d) [1, 2, 3, 4] p θ (d x) [ e e+e 2 +e 3 +e 4, e 2 e+e 2 +e 3 +e 4, e 3 e+e 2 +e 3 +e 4, e 4 e+e 2 +e 3 +e 4 ] Parsing: find the top-k derivation trees D θ (x) 11

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 12

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 Sparse features: bridge-binary:study born:placeofbirth city:type.location 12

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 Sparse features: bridge-binary:study born:placeofbirth city:type.location Syntactic features: ent-pos:nnp NNP join-pos:v NN skip-pos:in 12

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 Sparse features: bridge-binary:study born:placeofbirth city:type.location Syntactic features: ent-pos:nnp NNP join-pos:v NN skip-pos:in Grammar features: Binary->Verb 12

Learning θ: maximum-likelihood Training data: What s Bulgaria s capital? Sofia What movies has Tom Cruise been in? TopGun,VanillaSky,...... What s Bulgaria s capital? CapitalOf.Bulgaria What movies has Tom Cruise been in? Type.Movie HasPlayed.TomCruise... 13

Learning θ: maximum-likelihood Training data: What s Bulgaria s capital? Sofia What movies has Tom Cruise been in? TopGun,VanillaSky,...... What s Bulgaria s capital? CapitalOf.Bulgaria What movies has Tom Cruise been in? Type.Movie HasPlayed.TomCruise... arg max θ n i=1 log p θ(y (i) x (i) ) = arg max θ n i=1 log d (i) p θ (d (i) x (i) )R(d (i) ) 13

Learning θ: maximum-likelihood Training data: What s Bulgaria s capital? Sofia What movies has Tom Cruise been in? TopGun,VanillaSky,...... What s Bulgaria s capital? CapitalOf.Bulgaria What movies has Tom Cruise been in? Type.Movie HasPlayed.TomCruise... arg max θ n i=1 log p θ(y (i) x (i) ) = arg max θ n i=1 log d (i) p θ (d (i) x (i) )R(d (i) ) R(d) = { 1 d.z = z (i) 0 o/w R(d) = { 1 [d.z] K = y (i) 0 o/w R(d) = F 1 ([d.z] K, y (i) ) 13

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) q θ (d x) exp(φ(x, d) θ) R(d) 14

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) q θ (d x) exp(φ(x, d) θ) R(d) p θ (D(x)) = [0.2, 0.1, 0.1, 0.6] R(D(x)) = [1, 0, 0, 1] 14

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) p θ (D(x)) = [0.2, 0.1, 0.1, 0.6] R(D(x)) = [1, 0, 0, 1] q θ (D(x)) = [0.25, 0, 0, 0.75] q θ = p θ p θ R q θ (d x) exp(φ(x, d) θ) R(d) 14

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) p θ (D(x)) = [0.2, 0.1, 0.1, 0.6] R(D(x)) = [1, 0, 0, 1] q θ (D(x)) = [0.25, 0, 0, 0.75] q θ = p θ p θ R Gradient: q θ (d x) exp(φ(x, d) θ) R(d) 0.05 φ(x, d 1 ) 0.1 φ(x, d 2 ) 0.1 φ(x, d 3 ) + 0.15 φ(x, d 4 ) 14

Training Input: {x i, y i } n i=1 Output: θ 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i D(x i ) arg max K (p θ (d x i )) 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i D(x i ) arg max K (p θ (d x i )) θ θ + η τ,i (E qθ (d x i )[φ(x i, d)] E pθ (d x i )[φ(x i, d)]) 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i D(x i ) arg max K (p θ (d x i )) θ θ + η τ,i (E qθ (d x i )[φ(x i, d)] E pθ (d x i )[φ(x i, d)]) η τ,i : learning rate Regularization often added (L2, L1,...) 15

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i ˆd arg max(p θ (d x i )) d arg max(q θ (d x i )) 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i ˆd arg max(p θ (d x i )) d arg max(q θ (d x i )) if [d ] K [ ˆd] K θ θ + φ(x i, d ) φ(x i, ˆd)) 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i ˆd arg max(p θ (d x i )) d arg max(q θ (d x i )) if [d ] K [ ˆd] K θ θ + φ(x i, d ) φ(x i, ˆd)) Regularization often added with weight averaging 16

Training Other simple variants exist: E.g., cost-sensitive max-margin training That is, find pairs of good and bad derivations that look different but have similar scores and update on those 17

Outline Learning Overview Details Example Lexicon learning Supervision signals 18

Example./run @mode=simple-lambdadcs \ -Grammar.inPaths esslli 2016/class3 demo.grammar \ -SimpleLexicon.inPaths esslli 2016/class3 demo.lexicon (loadgraph geo880/geo880.kg) size of california size capital california size of capital of california california size 19

Exercise Find a pair of natural language utterances that cannot be distinguished using the current feature representation The utterances can be not fully grammatical in English You can ignore the denotation feature if that helps Verify this in sempre (ask me how to disable features) Design a feature that will solve this problem 20

Outline Learning Overview Details Example Lexicon learning Supervision signals 21

The lexicon problem How is the lexicon generated? Annotation Exhaustive search String matching Supervised alignment Unsupervised alignment Learning 22

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i Add lexicon entries D(x i ) arg max K (p θ (d x i ) θ θ + η τ,i (E qθ (d x i )[φ(x i, d)] E pθ (d x i )[φ(x i, d)]) η τ,i : learning rate Regularization often added (L2, L1,...) 23

Adding lexicon entries [Adapted from semantic parsing tutorial, Artzi et al.] Input: training example (x i, y i ), current lexicon Λ, model θ Λ temp Λ GENLEX(x i, y i ) Create expanded temporaty lexicon 24

Adding lexicon entries [Adapted from semantic parsing tutorial, Artzi et al.] Input: training example (x i, y i ), current lexicon Λ, model θ Λ temp Λ GENLEX(x i, y i ) D(x i ) arg max K (p θ,λtemp (d x i )) Create expanded temporaty lexicon Parse with temporary lexicon 24

Adding lexicon entries [Adapted from semantic parsing tutorial, Artzi et al.] Input: training example (x i, y i ), current lexicon Λ, model θ Λ temp Λ GENLEX(x i, y i ) Create expanded temporaty lexicon D(x i ) arg max K (p θ,λtemp (d x i )) Parse with temporary lexicon Λ = Λ {l l ˆd, ˆd D(x i ), R(d) = 1} Add entries from correct trees 24

Adding lexicon entries [Adapted from semantic parsing tutorial, Artzi et al.] Input: training example (x i, y i ), current lexicon Λ, model θ Λ temp Λ GENLEX(x i, y i ) Create expanded temporaty lexicon D(x i ) arg max K (p θ,λtemp (d x i )) Parse with temporary lexicon Λ = Λ {l l ˆd, ˆd D(x i ), R(d) = 1} Add entries from correct trees Overgenerate lexical entries and add promising ones 24

Lexicon generation [Zettlemoyer and Collins, 2005] Logical form supervision: Largest state bordering California argmax(type(state) Border(California), Area) 25

Lexicon generation [Zettlemoyer and Collins, 2005] Logical form supervision: Largest state bordering California argmax(type(state) Border(California), Area) Enumerate spans Largest state bordering California Largest state... Use rules to extract sub-forumulas California Border(California) λf.f(california) Area λx.argmax(x, Area)... 25

Lexicon generation [Zettlemoyer and Collins, 2005] Logical form supervision: Largest state bordering California argmax(type(state) Border(California), Area) Enumerate spans Largest state bordering California Largest state... Use rules to extract sub-forumulas California Border(California) λf.f(california) Area λx.argmax(x, Area)... Add cross-product to lexicon 25

Lexicon generation Denotation supervision: Largest state bordering California Arizona 26

Lexicon generation Denotation supervision: Largest state bordering California Arizona Enumerate spans Largest state bordering California Largest state... Generate sub-formulas from KB California Border(California) Traverse Type.Mountain λx.argmax(x, Elevation)... Restrict candidates with alignment, string matching,... 26

Lexicon generation Denotation supervision: Largest state bordering California Arizona Enumerate spans Largest state bordering California Largest state... Generate sub-formulas from KB California Border(California) Traverse Type.Mountain λx.argmax(x, Elevation)... Restrict candidates with alignment, string matching,... Fancier methods exist (coarse-to-fine) 26

Unification [Kwiatkowski et al, 2010] Logical form supervision: 27

Unification [Kwiatkowski et al, 2010] Logical form supervision: Initialize lexicon with (x i, z i ): States bordering California Type(State) Border(California) Split lexical entry in all possible ways: 27

Unification [Kwiatkowski et al, 2010] Logical form supervision: Initialize lexicon with (x i, z i ): States bordering California Type(State) Border(California) Split lexical entry in all possible ways: Enumerate spans (states, bordering california) (states bordering, california) Generate sub-formulas from KB (Type(State), λx.x Border(California)) (λx.type(state) x, Border(California)) (λf.type(state) f(california), California)... 27

Unification [Kwiatkowski et al, 2010] For example x i, z i : 28

Unification [Kwiatkowski et al, 2010] For example x i, z i : Find highest scoring correct parse d 28

Unification [Kwiatkowski et al, 2010] For example x i, z i : Find highest scoring correct parse d Split all lexical entries in d in all possible ways 28

Unification [Kwiatkowski et al, 2010] For example x i, z i : Find highest scoring correct parse d Split all lexical entries in d in all possible ways Add to lexicon lexical entry that improves parse score best 28

Do we need a lexicon? Type(State) Border(California) Type(State) Border(California) Type State Border California California neighbors 29

Do we need a lexicon? Type(State) Border(California) Type(State) Border(California) Type State Border California California neighbors Floating parse tree: generalization of bridging 29

Do we need a lexicon? Type(State) Border(California) Type(State) Border(California) Type State Border California California neighbors Floating parse tree: generalization of bridging Perhaps with better learning and search not necessary? 29

Outline Learning Overview Details Example Lexicon learning Supervision signals 30

Supervision signals We discussed training from logical forms and denotations 31

Supervision signals We discussed training from logical forms and denotations Other forms of supervision have been proposed: Demonstrations Distant supervision Conversations Unsupervised Paraphrasing 31

Training from demonstrations [Artzi and Zettlemoyer, 2013] Input: (x i, s i, t i ) x i : utterance s i : start state t i : end state move forward until you reach the intersection 32

Training from demonstrations [Artzi and Zettlemoyer, 2013] Input: (x i, s i, t i ) x i : utterance s i : start state t i : end state move forward until you reach the intersection λa.move(a) dir(a.forward)... 32

Training from demonstrations [Artzi and Zettlemoyer, 2013] Input: (x i, s i, t i ) x i : utterance s i : start state t i : end state move forward until you reach the intersection λa.move(a) dir(a.forward)... An instance of learning from denotations 32

Distant supervision [Reddy et al., 2014] Data generation: Decompose declarative text to questions and answers James Cameron is the director of Titanic Q: X is the director of Titanic A: James Cameron 33

Distant supervision [Reddy et al., 2014] Data generation: Decompose declarative text to questions and answers James Cameron is the director of Titanic Q: X is the director of Titanic A: James Cameron Declarative text is cheap! 33

Distant supervision [Reddy et al., 2014] Training: Use exising non-executable semantic parsers X is the director of Titanic 34

Distant supervision [Reddy et al., 2014] Training: Use exising non-executable semantic parsers X is the director of Titanic λx.director(x) director.of.arg1(e, x) director.of.arg2(e, Titanic) 34

Distant supervision [Reddy et al., 2014] Training: Use exising non-executable semantic parsers X is the director of Titanic λx.director(x) director.of.arg1(e, x) director.of.arg2(e, Titanic) λx.director(x) FilmDirectedBy(e, x) FileDirected(e, Titanic) λx.producer(x) FilmProducedBy(e, x) FileProduced(e, Titanic) 34

Distant supervision [Reddy et al., 2014] Training: Use exising non-executable semantic parsers X is the director of Titanic λx.director(x) director.of.arg1(e, x) director.of.arg2(e, Titanic) λx.director(x) FilmDirectedBy(e, x) FileDirected(e, Titanic) λx.producer(x) FilmProducedBy(e, x) FileProduced(e, Titanic) James Cameron James Camerson, Jon Landau true false 34

Training from conversations 35

Training from conversations z 1 : From(Atlanta) To(London) z 2 : From(Atlanta) From(London) z 3 : To(Atlanta) To(London) z 4 : To(Atlanta) From(London) 35

Training from conversations z 1 : From(Atlanta) To(London) z 2 : From(Atlanta) From(London) z 3 : To(Atlanta) To(London) z 4 : To(Atlanta) From(London) Define loss: Does z align with converation? Does z obey domain constraints? 35

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ θ initialized manually, S = φ 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ θ initialized manually, S = φ Until stopping criterion met for example x i S = S (x i, arg max p θ (d x i )) 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ θ initialized manually, S = φ Until stopping criterion met for example x i S = S (x i, arg max p θ (d x i )) Compute statistics of S 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ θ initialized manually, S = φ Until stopping criterion met for example x i S = S (x i, arg max p θ (d x i )) Compute statistics of S S conf find confident subset 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ θ initialized manually, S = φ Until stopping criterion met for example x i S = S (x i, arg max p θ (d x i )) Compute statistics of S S conf find confident subset Train on S conf 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ θ initialized manually, S = φ Until stopping criterion met for example x i S = S (x i, arg max p θ (d x i )) Compute statistics of S S conf find confident subset Train on S conf Substantially lower performance 36

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? 37

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? Type.HumanLanguage LanguagesSpoken.Brazil... CapitalOf.Brazil 37

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? What language is the language of Brazil?... What city is the capital of Brazil? Type.HumanLanguage LanguagesSpoken.Brazil... CapitalOf.Brazil 37

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? paraphrase model What language is the language of Brazil?... What city is the capital of Brazil? Type.HumanLanguage LanguagesSpoken.Brazil... CapitalOf.Brazil 37

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? paraphrase model What language is the language of Brazil?... What city is the capital of Brazil? Type.HumanLanguage LanguagesSpoken.Brazil... CapitalOf.Brazil Portuguese,... 37

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? paraphrase model What language is the language of Brazil?... What city is the capital of Brazil? Type.HumanLanguage LanguagesSpoken.Brazil... CapitalOf.Brazil Portuguese,... Model: p θ (c, z x) = p(z x) p θ (c x) Idea: train a large paraphrase model p θ (c x) 37

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? paraphrase model What language is the language of Brazil?... What city is the capital of Brazil? Type.HumanLanguage LanguagesSpoken.Brazil... CapitalOf.Brazil Portuguese,... Model: p θ (c, z x) = p(z x) p θ (c x) Idea: train a large paraphrase model p θ (c x) More later 37

Summary We saw how to train from denotations and logical forms We see methods for inducing lexicons during training We reviewed work on how to use even weaker forms of supervision Still an open problem 38