Lecture 5: Semantic Parsing, CCGs, and Structured Classification

Similar documents
Introduction to Semantic Parsing with CCG

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Driving Semantic Parsing from the World s Response

Semantic Parsing with Combinatory Categorial Grammars

Surface Reasoning Lecture 2: Logic and Grammar

Bringing machine learning & compositional semantics together: central concepts

Learning Dependency-Based Compositional Semantics

Log-Linear Models, MEMMs, and CRFs

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Sharpening the empirical claims of generative syntax through formalization

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lab 12: Structured Prediction

Statistical Methods for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Learning to translate with neural networks. Michael Auli

Intelligent Systems (AI-2)

Advanced Natural Language Processing Syntactic Parsing

Intelligent Systems (AI-2)

LECTURER: BURCU CAN Spring

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

CS 4110 Programming Languages & Logics. Lecture 16 Programming in the λ-calculus

Categorial Grammar. Larry Moss NASSLLI. Indiana University

Statistical Methods for NLP

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Outline. Learning. Overview Details Example Lexicon learning Supervision signals

Cross-lingual Semantic Parsing

Machine Learning for NLP

Natural Language Processing

Machine Learning for natural language processing

A proof theoretical account of polarity items and monotonic inference.

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Chapter 14 (Partially) Unsupervised Parsing

Linear Regression (continued)

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

AN ABSTRACT OF THE DISSERTATION OF

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

Seq2Seq Losses (CTC)

Outline. Learning. Overview Details

Probabilistic Context-free Grammars

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning for Structured Prediction

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Features of Statistical Parsers

Naïve Bayes, Maxent and Neural Models

Feedforward Neural Networks

NLU: Semantic parsing

Probabilistic Context Free Grammars. Many slides from Michael Collins

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

CS Lecture 19: Logic To Truth through Proof. Prof. Clarkson Fall Today s music: Theme from Sherlock

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Ling 130 Notes: Syntax and Semantics of Propositional Logic

Ling 130 Notes: Predicate Logic and Natural Deduction

Harvard School of Engineering and Applied Sciences CS 152: Programming Languages

10-701/ Machine Learning - Midterm Exam, Fall 2010

Lecture 5 Neural models for NLP

IBM Model 1 for Machine Translation

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

CS1021. Why logic? Logic about inference or argument. Start from assumptions or axioms. Make deductions according to rules of reasoning.

Midterm sample questions

Lecture 13: Structured Prediction

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Logic for Computer Science - Week 4 Natural Deduction

Logic. Propositional Logic: Syntax

Chapter 4: Computation tree logic

Logic and machine learning review. CS 540 Yingyu Liang

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Hidden Markov Models

Aspects of Tree-Based Statistical Machine Translation

Mathematical Logic Part Three

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Maschinelle Sprachverarbeitung

Logic. Readings: Coppock and Champollion textbook draft, Ch

Maschinelle Sprachverarbeitung

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Decoding and Inference with Syntactic Translation Models

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Multiword Expression Identification with Tree Substitution Grammars

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Machine Learning Lecture 5

Programming Languages

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Logic. Propositional Logic: Syntax. Wffs

The Noisy Channel Model and Markov Models

Logistic Regression & Neural Networks

Machine Learning Lecture 7

Association studies and regression

Natural Language Processing (CSEP 517): Text Classification

Announcements. CS243: Discrete Structures. Propositional Logic II. Review. Operator Precedence. Operator Precedence, cont. Operator Precedence Example

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Linear Classifiers IV

Transcription:

Lecture 5: Semantic Parsing, CCGs, and Structured Classification Kyle Richardson kyle@ims.uni-stuttgart.de May 12, 2016

Lecture Plan paper: Zettlemoyer and Collins (2012) general topics: (P)CCGs, compositional semantic models, log-linear models, (stochastic) gradient descent. 2

The Big Picture (reminder) Standard processing pipeline input List samples that contain every major element Semantic Parsing sem (FOR EVERY X / MAJORELT : T; (FOR EVERY Y / SAMPLE : (CONTAINS Y X); (PRINTOUT Y))) Knowledge Representation Interpretation world sem ={S10019,S10059,...} Lunar QA system (Woods (1973)) 3

Semantic Parsing: Basic Ingredients Compositional Semantic Model: Generates compositional meaning representations (e.g., logical forms, functional representations,...) Translation Model: Maps text input to representations (tools already discussed: string rewriting, CFGs, SCFGs). Rule Extraction: Finds the candidate translation rules (e.g., SILT, word-alignments,...) Probabilistic Model: Used for learning and finding the best translations (PCFGs,PSCFGs, EM). 4

Semantic Parsing: Basic Ingredients Compositional Semantic Model: Generates compositional meaning representations (e.g., logical forms, functional representations,...) Translation Model: Maps text input to representations (tools already discussed: string rewriting, CFGs, SCFGs). Rule Extraction: Finds the candidate translation rules (e.g., SILT, word-alignments,...) Probabilistic Model: Used for learning and finding the best translations (PCFGs,PSCFGs, EM). This lecture: Introduce a new model based on (Combinatory) Categorial Grammar and log-linear models. 4

Classical Categorial Grammar (CG) Lexicalism: lexical entries encode nearly all information about how words are combined, no separate syntactic component. An example (syntactic) lexicon Λ John := NP (basic category) Mary := NP (basic category) sleeps := S\NP (derived category) loves := (S\NP)/NP - quietly := (S\NP)/(S\NP) - 5

Classical Categorial Grammar (CG) An example (syntactic) lexicon Λ John := NP (basic category) Mary := NP (basic category) sleeps := S\NP (derived category) loves := (S\NP)/NP - quietly := (S\NP)/(S\NP) - >>> john = NP ; mary = NP >>> sleeps = lambda x : S if x == NP else None 6

Classical Categorial Grammar (CG) An example (syntactic) lexicon Λ John := NP (basic category) Mary := NP (basic category) sleeps := S\NP (derived category) loves := (S\NP)/NP - quietly := (S\NP)/(S\NP) - >>> john = NP ; mary = NP >>> sleeps = lambda x : S if x == NP else None >>> sleeps(john) => S 7

Classical Categorial Grammar (CG) An example (syntactic) lexicon Λ John := NP (basic category) Mary := NP (basic category) sleeps := S\NP (derived category) loves := (S\NP)/NP - quietly := (S\NP)/(S\NP) - >>> john = NP ; mary = NP >>> loves = lambda x : (lambda y : S if y == NP else None) \ if x == NP else None >>> loves(mary) 8

Classical Categorial Grammar (CG) An example (syntactic) lexicon Λ John := NP (basic category) Mary := NP (basic category) sleeps := S\NP (derived category) loves := (S\NP)/NP - quietly := (S\NP)/(S\NP) - >>> john = NP ; mary = NP >>> loves = lambda x : (lambda y : S if y == NP else None) \ if x == NP else None >>> loves(mary) => <function main. <lambda>> 9

Classical Categorial Grammar (CG) An example (syntactic) lexicon Λ John := NP (basic category) Mary := NP (basic category) sleeps := S\NP (derived category) loves := (S\NP)/NP - quietly := (S\NP)/(S\NP) - >>> john = NP ; mary = NP >>> loves = lambda x : (lambda y : S if y == NP else None) \ if x == NP else None >>> loves(mary)(john) => S 10

CG Derivations Often shown in a tabular proof form, as a series of cancellation steps. loves Mary John NP (S\NP)/NP S S\NP (<) NP (>) 11

CG Derivations Often shown in a tabular proof form, as a series of cancellation steps. loves Mary John NP function application >: A/B B A <: B A\B A (S\NP)/NP S S\NP (<) NP (>) 11

CG Derivations Often shown in a tabular proof form, as a series of cancellation steps. loves Mary John NP (S\NP)/NP S S\NP (<) NP (>) >>> apply right = lambda fun,arg : fun(arg) >>> apply left = lambda arg,fun : fun(arg) >>> apply right(loves,mary) => <function main. <lambda>> 12

CG Derivations Often shown in a tabular proof form, as a series of cancellation steps. loves Mary John NP (S\NP)/NP S S\NP (<) NP (>) >>> apply right = lambda fun, arg: fun(arg) >>> apply left = lambda arg, fun: fun(arg) >>> apply left(john,apply right(loves,mary)) => S 13

CG and Semantics Lexical rules can be extended to have a compositional semantics. John := NP : john Mary := NP : mary sleeps := S\NP : λx.sleep(x) loves := (S\NP)/NP : λy, λx.loves(x, y) quietly := (S\NP)/(S\NP) : λf.λx. f (x) quiet(f, x) 14

CG Derivations Often shown in a tabular proof form, as a series of cancellation steps. loves Mary John (S\NP)/NP : λy, λx.love(x, y) NP : mary NP : john S\NP : λx.love(x, mary ) (<) S : love(john,mary ) (>) function application with semantics >: A/B : f B : g A : f(g) <: B : g A\B : f A : f(g) 15

CG Derivations Often shown in a tabular proof form, as a series of cancellation steps. loves Mary John (S\NP)/NP : λy, λx.love(x, y) NP : mary NP : john S\NP : λx.love(x, mary ) (<) S : love(john,mary ) (>) Derivation: (L, T ), where L is a logical form (top), and T is the derivation steps (parse tree). 16

Combinatory Categorial Grammar (CCG) A particular theory of categorial grammar, which uses additional function application types (see paper for pointers or Steedman (2000)). e.g., composition: A/B : f B/C : g A/C : λx.f (g(x)) Benefits: Is linguistically motivated, much more powerful than context-free grammars (mildly context-sensitive), polynomial parsing. 17

Combinatory Categorial Grammar (CCG) A particular theory of categorial grammar, which uses additional function application types (see paper for pointers or Steedman (2000)). e.g., composition: A/B : f B/C : g A/C : λx.f (g(x)) Benefits: Is linguistically motivated, much more powerful than context-free grammars (mildly context-sensitive), polynomial parsing. Parsing: Extended version of CKY algorithm (Steedman (2000)) 17

A note about mild context-sensitivity regular languages context-free languages a n b m a n b n a n b n c n d n mild. context-sensitive context-sensitive a n b n c n d n e n CCGs and other mildly context-sensitive formalism (TAGs, LIGs,...) allows derivations/graphs more general than trees. 18

A note about mild context-sensitivity regular languages context-free languages a n b m a n b n a n b n c n d n mild. context-sensitive context-sensitive a n b n c n d n e n CCGs and other mildly context-sensitive formalism (TAGs, LIGs,...) allows derivations/graphs more general than trees. Theoretical question: what types of grammar formalisms are best suited for semantic parsing? 18

Earlier Lecture (Liang and Potts 2015) We have already seen something like categorial grammar. Rules ares divided between syntactic and semantic rules. 19

Compositional Semantics (past lecture) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: John studies. john John (λx.(study x)) studies (λx.(study x))(john) (study john ) {True, False} john John (λx.(study x)) studies 20

Compositional Semantics (past lecture) Principle of Compositionality: The meaning of a complex expression is a function of the meaning of its parts and the rules that combine them. Example: John studies. john John (λx.(study x)) studies >>> students studying = set([ john, mary ]) >>> study = lambda x : x in students studying >>> fun application = lambda fun, val : fun(val) >>> fun application(study, bill ) >>> False 21

Geoquery Logical forms Zettlemoyer and Collins (2012) use a conventional logical language. constants: entities, numbers, functions logical connectives: conjunction ( ), disjunction ( ), negation ( ), implication ( ) quantifiers: universal ( ) and existential ( ). lambda expressions: anonymous functions (λx.f (x)) other quantifiers/functions: arg max, definite descriptions (ι),.. Example: What states border Texas? λx.state(x) borders(x, texas) 22

Geoquery Logical forms Zettlemoyer and Collins (2012) use a conventional logical language. constants: entities, numbers, functions logical connectives: conjunction ( ), disjunction ( ), negation ( ), implication ( ) quantifiers: universal ( ) and existential ( ). lambda expressions: anonymous functions (λx.f (x)) other quantifiers/functions: arg max, definite descriptions (ι),.. Example: What is the largest state? arg max(λx.state(x), λx.size(x)) 23

A mini functional interpreter: Constants (Haskell) 1 Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > let state a = (elem a ["nh","ma","vt"]) Prelude > state "nh" => True 1 you can try out these examples using https://tryhaskell.org/ 24

A mini functional interpreter: Constants (Haskell) 1 Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > let state a = (elem a ["nh","ma","vt"]) Prelude > state "nh" => True Prelude > :type state => state :: [Char] -> Bool 1 you can try out these examples using https://tryhaskell.org/ 24

A mini functional interpreter: Constants (Haskell) 1 Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > let state a = (elem a ["nh","ma","vt"]) Prelude > state "nh" => True Prelude > :type state => state :: [Char] -> Bool semantic type: e t 1 you can try out these examples using https://tryhaskell.org/ 24

A mini functional interpreter: Constants (Haskell) Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > let borders ::([Char],[Char]) -> Bool; Prelude borders a = (elem a [("oklahoma","texas")]) 25

A mini functional interpreter: Constants (Haskell) Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > let borders ::([Char],[Char]) -> Bool; Prelude borders a = (elem a [("oklahoma","texas")]) Prelude > borders ("nh","texas") => False 25

CG and Binary Rules CGs typically assume binary rules/functions (or curried functions), so that function application applies to one argument at a time. Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > :type borders => borders :: ([Char], [Char]) -> Bool 26

CG and Binary Rules CGs typically assume binary rules/functions (or curried functions), so that function application applies to one argument at a time. Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > :type borders => borders :: ([Char], [Char]) -> Bool Prelude > let borders c = curry borders 26

CG and Binary Rules CGs typically assume binary rules/functions (or curried functions), so that function application applies to one argument at a time. Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > :type borders => borders :: ([Char], [Char]) -> Bool Prelude > let borders c = curry borders Prelude > :type borders c => borders c :: [Char] -> [Char] -> Bool 26

CG and Binary Rules CGs typically assume binary rules/functions (or curried functions), so that function application applies to one argument at a time. Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > :type borders => borders :: ([Char], [Char]) -> Bool Prelude > let borders c = curry borders Prelude > :type borders c => borders c :: [Char] -> [Char] -> Bool semantic type: e e t 26

CG and Binary Rules CGs typically assume binary rules/functions (or curried functions), so that function application applies to one argument at a time. Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > let borders c = curry borders Prelude > borders c "oklahoma" "texas" => True 27

CG and Binary Rules CGs typically assume binary rules/functions (or curried functions), so that function application applies to one argument at a time. Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > let borders c = curry borders Prelude > borders c "oklahoma" "texas" => True Prelude > borders c 10 "texas" 27

CG and Binary Rules CGs typically assume binary rules/functions (or curried functions), so that function application applies to one argument at a time. Example: What states border Texas? λx.state(x) borders(x, texas) Texas := NP : texas border := (S\NP)/NP : λyλx.borders(x, y) states := S\N : λx.state(x) which := (S/(S\NP))/N : λf, λg, λx.f (x) g(x) Prelude > let borders c = curry borders Prelude > borders c "oklahoma" "texas" => True Prelude > borders c 10 "texas" => type error 27

CG and Partial Function Application borders Texas Oklahoma (S\NP)/NP : λy, λx.borders(x, y) NP : texas NP : oklahoma S\NP : λx.borders(x, texas ) (<) S : borders(oklahoma,texas ) (>) Prelude > let borders texas = borders c "texas" Prelude > :type borders texas 28

CG and Partial Function Application borders Texas Oklahoma (S\NP)/NP : λy, λx.borders(x, y) NP : texas NP : oklahoma S\NP : λx.borders(x, texas ) (<) S : borders(oklahoma,texas ) (>) Prelude > let borders texas = borders c "texas" Prelude > :type borders texas => borders texas :: [Char] -> Bool 28

CG and Partial Function Application borders Texas Oklahoma (S\NP)/NP : λy, λx.borders(x, y) NP : texas NP : oklahoma S\NP : λx.borders(x, texas ) (<) S : borders(oklahoma,texas ) (>) Prelude > let borders texas = borders c "texas" Prelude > borders texas "oklahoma" => True 29

CG and Partial Function Application borders Texas Oklahoma (S\NP)/NP : λy, λx.borders(x, y) NP : texas NP : oklahoma S\NP : λx.borders(x, texas ) (<) S : borders(oklahoma,texas ) (>) Prelude > right apply f x = (f x) Prelude > right apply borders c "texas" 30

Overall Model: Zettlemoyer and Collins (2012) Compositional Semantic Model: assumes the geo-query representations and semantics we ve discussed. Probabilistic Model: Deciding between different analyses, handling spurious ambiguity. Lexical (Rule) Extraction: Finding set of CCG lexical entries in Λ. 31

Probabilistic CCG Model Assumption: Let s say (for now) we have a crude CCG lexicon Λ that over-generates for any given input Derivation: a pair, (L, T ), where L is the final logical form and T is the derivation tree. Example: Oklahoma borders Texas borders Texas Oklahoma (S\NP)/NP : λy, λx.borders(x, y) NP : texas NP : oklahoma S\NP : λx.borders(x, texas ) (<) S : borders(oklahoma,texas ) (>) 32

Probabilistic CCG Model Assumption: Let s say (for now) we have a crude CCG lexicon Λ that over-generates for any given input Derivation: a pair, (L, T ), where L is the final logical form and T is the derivation tree. Example: Oklahoma borders Texas borders Texas Oklahoma NP : texas (S\NP)/NP : λy, λx.borders(x, y) NP : oklahoma S\NP : λx.borders(x, texas ) (<) S : borders(oklahoma,texas ) (<) 33

Probabilistic CCG Model Assumption: Let s say (for now) we have a crude CCG lexicon Λ that over-generates for any given input Derivation: a pair, (L, T ), where L is the final logical form and T is the derivation tree. Example: Oklahoma borders Texas borders Texas Oklahoma (S\NP)/NP : λy, λx.borders(x, y) NP : texas NP : ohio S\NP : λx.borders(x, texas ) (<) S : borders(ohio,texas ) (>) 34

Probabilistic CCG Model Use a log-linear formulation of CCG (Clark and Curran (2003)): Parsing Problem: p(l, T S; θ) = f (L,T,S) θ e (L,T ) ef (L,T,S) θ argmaxp(l S; θ) = L T p(l, T S; θ) 35

Probabilistic CCG Model Use a log-linear formulation of CCG (Clark and Curran (2003)): Parsing Problem: p(l, T S; θ) = f (L,T,S) θ e (L,T ) ef (L,T,S) θ argmaxp(l S; θ) = L T p(l, T S; θ) Note: T might be very large, use dynamic programming. 35

Probabilistic CCG Model Use a log-linear formulation of CCG (Clark and Curran (2003)): Parsing Problem: p(l, T S; θ) = f (L,T,S) θ e (L,T ) ef (L,T,S) θ argmaxp(l S; θ) = L T p(l, T S; θ) Note: T might be very large, use dynamic programming. Learning Setting: The correct derivations are not annotated (latent), no further supervision is provided (weak). Lexicon is learned from data. 35

Log-linear Model: Basics Log-Linear Model: 2 A set X of inputs (e.g. sentences) A set Y of labels/structures. A feature function f : X Y R d for any pair (x,y) A weight vector θ Conditional Model: for x X, y Y p(y x; θ) = ef (x,y) θ Z(x, θ) 2 Examples and ideas from Michael Collin s and Charles Elkan s tutorials (see syllabus). 36

Log-linear Model: Basics Conditional Model: for x X, y Y p(y x; θ) = ef (x,y) θ Z(x, θ) e x : or exponential function exp(x) (keeps scores positive) inner product: (sum of features f j times feature weights θ j ) d f (x, y) θ = θ k f k (x, y) k=1 normalization term or partition function: Z(x, θ) = y Y e f (x,y ) θ 37

Structured Classification and Features Structured classification: We assume that labels in Y has a rich internal structure (e.g., parse trees, pos tag sequences). Individual feature functions: f i (x, y) R for i = 1,...,d 38

Structured Classification and Features Structured classification: We assume that labels in Y has a rich internal structure (e.g., parse trees, pos tag sequences). Individual feature functions: f i (x, y) R for i = 1,...,d In the general case for log-linear models, there is no restriction on the types of features you can define. 38

Structured Classification and Features Structured classification: We assume that labels in Y has a rich internal structure (e.g., parse trees, pos tag sequences). Individual feature functions: f i (x, y) R for i = 1,...,d In the general case for log-linear models, there is no restriction on the types of features you can define. Feature Templates: in terms of more general classes: features are not usually specified individually, but 38

Features: Part of speech tagging Goal: Assign to a given input word w j in a sentences a part-of-speech tag given a set of tags Y={N,ADJ,PN,...} Example: The D dog N sleeps 39

Features: Part of speech tagging Goal: Assign to a given input word w j in a sentences a part-of-speech tag given a set of tags Y={N,ADJ,PN,...} Example: The D dog N sleeps { 1 each word wj x has tag y f id(wj y)(x, y) = 0 otherwise { 1 each word wj and w f id(wj 1 w j y)(x, y) = j 1 have tag y 0 otherwise 39

Features: Part of speech tagging Goal: Assign to a given input word w j in a sentences a part-of-speech tag given a set of tags Y={N,ADJ,PN,...} Example: The D dog N sleeps { 1 each word wj x has tag y f id(wj y)(x, y) = 0 otherwise { 1 each word wj and w f id(wj 1 w j y)(x, y) = j 1 have tag y 0 otherwise { 100 my mother likes y tags f id(mother feature) (x, y) = 0 otherwise 39

Features: Part of speech tagging Goal: Assign to a given input word w j in a sentences a part-of-speech tag given a set of tags Y={N,ADJ,PN,...} Example: The D dog N sleeps { 1 each word wj x has tag y f id(wj y)(x, y) = 0 otherwise { 1 each word wj and w f id(wj 1 w j y)(x, y) = j 1 have tag y 0 otherwise { 100 my mother likes y tags f id(mother feature) (x, y) = 0 otherwise At the end of the day... the important factor is the features used Domingos (2012) 39

Local CCG features: Zettlemoyer and Collins (2012) Output labels: Y is the set of (structured) CCG derivations. Local features: Limit features to lexical rules in derivations 40

Local CCG features: Zettlemoyer and Collins (2012) Output labels: Y is the set of (structured) CCG derivations. Local features: Limit features to lexical rules in derivations borders Texas Oklahoma (S\NP)/NP : λy, λx.borders(x, y) NP : texas NP : oklahoma S\NP : λx.borders(x, texas ) (<) S : borders(oklahoma,texas ) (>) 40

Local CCG features: Zettlemoyer and Collins (2012) Output labels: Y is the set of (structured) CCG derivations. Local features: Limit features to lexical rules in derivations borders Texas Oklahoma (S\NP)/NP : λy, λx.borders(x, y) NP : texas NP : oklahoma S\NP : λx.borders(x, texas ) (<) S : borders(oklahoma,texas ) (>) f id(np : texas ) (x, y) = count(np : texas ) = 1 40

Local versus non-local features Why only local?: Can be efficiently and easy extracted using our normal parsing algorithms and dynamic programming. Chart data-structure (e.g., in CKY) is a flat structure of cell entries. borders Texas Oklahoma (S\NP)/NP : λy, λx.borders(x, y) NP : texas NP : oklahoma S\NP : λx.borders(x, texas ) (<) S : borders(oklahoma,texas ) (>) f j (x, y) = Oklahoma, borders, NP : texas 41

Local versus non-local features Why only local?: Can be efficiently and easy extracted using our normal parsing algorithms and dynamic programming. Chart data-structure (e.g., in CKY) is a flat structure of cell entries. Non-local features: getting around these issues. k-best parsing: train a model on k-best trees (more on this later). forest-reranking: Huang (2008). 42

Parameter Estimation in Log-Linear Models: Basics Learning task: objective. choose values for feature weights that solve some Training Data: D = {(x i, y i )} n i=1 Maximum Likelihood: find a model θ that maximizes the probability of training data (or logarithm of conditional likelihood (LCL)): θ = max θ = max θ n p(y i x i ; θ) i=1 n log p(y i x i ; θ) i=1 43

Optimization and Gradient methods Optimization: method for solving your objective. Intuitively: assigning numbers to feature weights: θ 1,..., θ d θ d 44

Optimization and Gradient methods Optimization: method for solving your objective. Intuitively: assigning numbers to feature weights: θ 1,..., θ d θ d Gradient-based optimization: Uses a tool from calculus, the gradient 44

Optimization and Gradient methods Optimization: method for solving your objective. Intuitively: assigning numbers to feature weights: θ 1,..., θ d θ d Gradient-based optimization: Uses a tool from calculus, the gradient Gradient: A type of derivative, or measure of the rate that a function changes Tells the direction to move in order to get closer to objective. 44

Gradient Methods: Intuitive explanation 3 Imagine your hobby is blindfolded mountain climbing, i.e., someone blindfolds you and puts you on the side of a mountain. Your goal is to reach the peak of the mountain by feeling things out, e.g., moving in the direction that feels upward until you reach the peak. If mountain is concave, you will eventually reach your goal. 3 Example taken from Hal Daume III s book: http://ciml.info/ 45

Gradient Methods: Intuitive explanation 3 Imagine your hobby is blindfolded mountain climbing, i.e., someone blindfolds you and puts you on the side of a mountain. Your goal is to reach the peak of the mountain by feeling things out, e.g., moving in the direction that feels upward until you reach the peak. If mountain is concave, you will eventually reach your goal. Objective: Reach the maximum point (the peak) of the mountain. 3 Example taken from Hal Daume III s book: http://ciml.info/ 45

Gradient Methods: Intuitive explanation 3 Imagine your hobby is blindfolded mountain climbing, i.e., someone blindfolds you and puts you on the side of a mountain. Your goal is to reach the peak of the mountain by feeling things out, e.g., moving in the direction that feels upward until you reach the peak. If mountain is concave, you will eventually reach your goal. Objective: Reach the maximum point (the peak) of the mountain. Gradient: The direction to move at each point to get closer to the point (or your objective). 3 Example taken from Hal Daume III s book: http://ciml.info/ 45

Gradient Methods: Intuitive explanation 3 Imagine your hobby is blindfolded mountain climbing, i.e., someone blindfolds you and puts you on the side of a mountain. Your goal is to reach the peak of the mountain by feeling things out, e.g., moving in the direction that feels upward until you reach the peak. If mountain is concave, you will eventually reach your goal. Objective: Reach the maximum point (the peak) of the mountain. Gradient: The direction to move at each point to get closer to the point (or your objective). Step Size: How much to move at each step (parameter). 3 Example taken from Hal Daume III s book: http://ciml.info/ 45

Gradient Ascent Gradient ascent algorithm: Goal: (abstract form) 1: Initialize θ d to 0 2: While not converged do 3: Calculate δ k = O for k = 1,..,d θ k 4: Set each θ k = θ k + α δ k 9: Return θ d To find some maximum of a function. Gradient δ k : Direction to move each feature θ k to get closer to your objective O (e.g., reaching the peak). line 3: in batch case uses full dataset to estimate. α: learning rate or size of step to take (hyper-parameter). 46

Gradient Ascent: Computing the Gradients Dataset: Assume that we have our dataset D = {(x i, y i )} n i=1, feature vector θ d LCL Objective: n O(θ) = log p(y i x i ; θ) i=1 Computing Gradient (using some calculus, full derivation not shown) θ j log p(y x; θ) = n f j (x i, y n) i=1 n p(y x i ; θ)f j (x i, y) i=1 y Y 47

Gradient Ascent: Computing the Gradients Dataset: Assume that we have our dataset D = {(x i, y i )} n i=1, feature vector θ d LCL Objective: n O(θ) = log p(y i x i ; θ) i=1 Computing Gradient (using some calculus, full derivation not shown) θ j log p(y x; θ) = n f j (x i, y n) i=1 n p(y x i ; θ)f j (x i, y) i=1 y Y The main formula for computing line 3 (last page): Empirical counts: The first half/summation above Expected counts: The second half. 47

Gradient Ascent: Computing the Gradients Computing Gradient (using some calculus, full derivation not shown) θ j log p(y x; θ) = n f j (x i, y n) i=1 n p(y x i ; θ)f j (x i, y) i=1 y Y 1: Initialize θ d to 0 2: While not converged do 3: Calculate δ k = O θ k for k = 1,..,d 4: Set each θ k = θ k + α δ k 9: Return θ d Note: Making updates (line 4) first requires first iterating over our full training training (line 3), is instance of batch learning. 48

Gradients: Zettlemoyer and Collins (2012) Recall that a CCG derivation is a pair (L, T ) LCL objective (same objective, but a slightly different computation) O(θ) = = n log p(l i x i ; θ) i=1 n ( ) log p(l i, T x i ; θ) i=1 T 49

Gradients: Zettlemoyer and Collins (2012) Recall that a CCG derivation is a pair (L, T ) LCL objective (same objective, but a slightly different computation) O(θ) = = n log p(l i x i ; θ) i=1 n ( ) log p(l i, T x i ; θ) i=1 T Computing Gradient: θ j log p(y x; θ) = n n f j (L i, T, x i )p(t L i, x i ; θ) f j (L, T, x i )p(l, T x i ; θ) i=1 T i=1 L,T 49

Gradients: Zettlemoyer and Collins (2012) Computing Gradient: θ j log p(l x; θ) = n n f j (L i, T, x i )p(t L i, x i ; θ) f j (L, T, x i )p(l, T x i ; θ) i=1 T i=1 L,T Note: This involves find the probability of all trees/derivations and their features given an input. 50

Gradients: Zettlemoyer and Collins (2012) Computing Gradient: θ j log p(l x; θ) = n n f j (L i, T, x i )p(t L i, x i ; θ) f j (L, T, x i )p(l, T x i ; θ) i=1 T i=1 L,T Note: This involves find the probability of all trees/derivations and their features given an input. Dynamic programming: Use variant of inside-outside probabilities covered in Lecture 3 (no big deal). 50

Batch vs. Stochastic Gradient Descent Batch Gradient θ j log p(y x; θ) = Stochastic Gradient n f j (x i, y n) i=1 n p(y x i ; θ)f j (x i, y) i=1 y Y 1: Initialize θ d to 0 2: While not converged do 3: Calculate δ k = O θ k for k = 1,..,d 4: Set each θ k = θ k + α δ k 9: Return θ d log p(y x; θ) = f j (x i, y n) θ j y Y p(y x i ; θ)f j (x i, y) 51

Batch vs. Stochastic Gradient Descent Batch Gradient θ j log p(y x; θ) = Stochastic Gradient n f j (x i, y n) i=1 n p(y x i ; θ)f j (x i, y) i=1 y Y 1: Initialize θ d to 0 2: While not converged do 3: Calculate δ k = O θ k for k = 1,..,d 4: Set each θ k = θ k + α δ k 9: Return θ d log p(y x; θ) = f j (x i, y n) θ j y Y p(y x i ; θ)f j (x i, y) Online learning: Updates are made at each example. 51

Stochastic Gradient Ascent: Full Dataset: vector θ d Assume that we have our dataset D = {(x i, y i )} n i=1, feature 1: Initialize θ d to 0 2: While not converged do 3: Repeat for i = 1,.., n 4: θ k = θ k + α (f k (x i, y i ) y Y p(y x i ; θ)f k (x i, y)) 9: Return θ d 52

Stochastic Gradient Ascent: Full Dataset: vector θ d Assume that we have our dataset D = {(x i, y i )} n i=1, feature 1: Initialize θ d to 0 2: While not converged do 3: Repeat for i = 1,.., n 4: θ k = θ k + α (f k (x i, y i ) y Y p(y x i ; θ)f k (x i, y)) 9: Return θ d line 3: line 4: Start iterating through dataset Update at each example for LCL objective 52

Stochastic Gradient Ascent: Full Dataset: vector θ d Assume that we have our dataset D = {(x i, y i )} n i=1, feature 1: Initialize θ d to 0 2: While not converged do 3: Repeat for i = 1,.., n 4: θ k = θ k + α (f k (x i, y i ) y Y p(y x i ; θ)f k (x i, y)) 9: Return θ d line 3: line 4: Start iterating through dataset Update at each example for LCL objective The simplest form, vanilla gradient ascent. 52

Overall Model: Zettlemoyer and Collins (2012) Compositional Semantic Model: assumes the geo-query representations and semantics we ve discussed. Probabilistic Model: Deciding between different analyses, handling spurious ambiguity. Lexical (Rule) Extraction: Finding set of CCG lexical entries in Λ. 53

GenLex: Lexical rule extraction So far we have assumed the existence of a CCG lexical Λ GenLex: Take a sentence and logical form and generates lexical items. GenLex(S,L) = {x := y x W (S), y C(L)} W(S): set of substrings in input S C(L): CCG rule templates or triggers 54

Lexical rule templates (Triggers) Templates specify patterns in logical forms (input triggers) and their mapping to CCG lexical entries (output category). Are hand-engineered (down side), which has been subsequently improved on in Kwiatkowski et al. (2010) 55

Lexical rule templates: Example Example: Oklahoma borders Texas. borders(oklahoma, texas ) 56

Lexical rule templates: Example Example: Oklahoma borders Texas. borders(oklahoma, texas ) W (Oklahoma borders Texas) = { Oklahoma, Texas, Oklahoma borders,...} 56

Lexical rule templates: Example Example: Oklahoma borders Texas. borders(oklahoma, texas ) W (Oklahoma borders Texas) = { Oklahoma, Texas, Oklahoma borders,...} C(borders(oklahoma,texas )) = {borders(...) (S\NP)/NP : λy, λx.borders(x, y); texas NP : texas,...} GenLex: takes the combination of these. 56

Synthesis: Lexical Learning + Parameter Estimation Learning: Complete learning algorithm involves joining lexical learning with log-linear parameter estimation (via stochastic gradient ascent) Big Idea: Learn compact lexicons via greedy iterative method that works with high probability rules/derivations. 57

Synthesis: Lexical Learning + Parameter Estimation Step 1: Search for small set of lexical entries to parse data, then parse and find most probable rules. Step 2: Re-estimate log-linear model based on these compact lexical entries. 58

Results (brief) Two benchmark datasets (still being used today). Highest results reported at the time of publishing. Quite impressive increases in precision (though not so impressive recall). 59

Conclusions and Take-aways Introduced (C)CG, a new formalism for semantic parsing. Lexicalism: lexical entries describe combination rules. Nice formalism for jointly modeling syntax-semantics. 60

Conclusions and Take-aways Introduced (C)CG, a new formalism for semantic parsing. Lexicalism: lexical entries describe combination rules. Nice formalism for jointly modeling syntax-semantics. Log-linear CCG model for parsing from Zettlemoyer and Collins (2012) Log-linear models: In particular, conditional log-linear model. Gradient methods: gradient-based optimization and (stochastic) gradient ascent 60

Roadmap Next session: start of student resentations! 30 minutes each, plus 10-15 for questions. Due data: slides (or draft slides) must be submitted one week advance for approval. Questions: I will submit specific question that I expect you to address in your talk (not exam questions, only meant to help). Schedule update: I will give another lecture on the final class session (another opportunity for writing a reading summary). 61

References I Clark, S. and Curran, J. R. (2003). Log-linear models for wide-coverage ccg parsing. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pages 97 104. Association for Computational Linguistics. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10):78 87. Huang, L. (2008). Forest reranking: Discriminative parsing with non-local features. In ACL, pages 586 594. Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., and Steedman, M. (2010). Inducing probabilistic ccg grammars from logical form with higher-order unification. In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 1223 1233. Association for Computational Linguistics. http://www.aclweb.org/anthology/d/d10/d10-1119.pdf. Steedman, M. (2000). The syntactic process, volume 24. MIT Press. Woods, W. A. (1973). Progress in natural language understanding: an application to lunar geology. In Proceedings of the June 4-8, 1973, National Computer Conference and Exposition, pages 441 450. Zettlemoyer, L. S. and Collins, M. (2012). Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. arxiv preprint arxiv:1207.1420. http://arxiv.org/abs/1207.1420. 62