Outline. Learning. Overview Details Example Lexicon learning Supervision signals

Outline Learning Overview Details Example Lexicon learning Supervision signals 0

Outline Learning Overview Details Example Lexicon learning Supervision signals 1

Supervision in syntactic parsing Input: S NP VP NP NP V VP ESSLLI 2016 the known summer school is V PP located in Bolzano Output: S They play football NP They V VP NP play football 2

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011] Supervision in semantic parsing Input: Heavy supervision How tall is Lebron James? HeightOf.LebronJames What is Steph Curry s daughter called? ChildrenOf.StephCurry Gender.Female Youngest player of the Cavaliers arg min(playerof.cavaliers, BirthDateOf)... Light supervision How tall is Lebron James? 203cm What is Steph Curry s daughter called? Riley Curry Youngest player of the Cavaliers Kyrie Irving... 3

Learning in a nutshell utterance 0. Define model for derivations 4

Learning in a nutshell utterance Parsing. 0. Define model for derivations 1. Generate candidate derivations (later) 4

Learning in a nutshell utterance Parsing.. Label 0. Define model for derivations 1. Generate candidate derivations (later) 2. Label as correct and incorrect 4

Learning in a nutshell utterance Parsing.. Label Update model 0. Define model for derivations 1. Generate candidate derivations (later) 2. Label as correct and incorrect 3. Update model to favor correct trees 4

Training intuition Where did Mozart tupress? Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth PlaceOfDeath.WilliamHogarth PlaceOfMarriage.WilliamHogarth London 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth London PlaceOfDeath.WilliamHogarth London PlaceOfMarriage.WilliamHogarth Paddington London 5

Outline Learning Overview Details Example Lexicon learning Supervision signals 6

Constructing derivations Type.Person PlaceLived.Chicago intersect Type.Person lexicon who PlaceLived.Chicago join people PlaceLived lived in lexicon Chicago Chicago lexicon 7

Many possible derivations! x = people who have lived in Chicago? set of candidate derivations D(x) Type.Person PlaceLived.Chicago intersect Type.Org PresentIn.ChicagoMusical intersect Type.Person lexicon who PlaceLived.Chicago join Type.Org lexicon who PresentIn.ChicagoMusical join people PlaceLived Chicago people PresentIn ChicagoMusical lexicon lexicon lexicon lexicon lived in Chicago lived in Chicago 8

Type.Person PlaceLived.Chicago intersect x: utterance d: derivation Type.Person people lexicon who PlaceLived.Chicago join PlaceLived Chicago lived in lexicon Chicago lexicon Feature vector and parameters in R F : φ(x, d) θ learned apply join 1 1.2 apply intersect 1 0.6 apply lexicon 3 2.1 lived maps to PlacesLived 1 3.1 lived maps to PlaceOfBirth 0-0.4 born maps to PlaceOfBirth 0 2.7......... 9

Deep learning alert! The feature vector φ(x, d) is constructed by hand 10

Deep learning alert! The feature vector φ(x, d) is constructed by hand Constructing good features is hard 10

Deep learning alert! The feature vector φ(x, d) is constructed by hand Constructing good features is hard Algorithms are likely to do it better 10

Deep learning alert! The feature vector φ(x, d) is constructed by hand Constructing good features is hard Algorithms are likely to do it better Perhaps we can train φ(x, d) φ(x, d) = F ψ (x, d), where ψ are the parameters 10

Candidate derivations: D(x) Log-linear model Model: distribution over derivations d given utterance x p θ (d x) = exp(score θ (x,d)) d D(x) exp(score θ(x,d )) 11

Candidate derivations: D(x) Log-linear model Model: distribution over derivations d given utterance x p θ (d x) = exp(score θ (x,d)) d D(x) exp(score θ(x,d )) score θ (x, d) [1, 2, 3, 4] p θ (d x) [ e e+e 2 +e 3 +e 4, e 2 e+e 2 +e 3 +e 4, e 3 e+e 2 +e 3 +e 4, e 4 e+e 2 +e 3 +e 4 ] 11

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 12

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 Sparse features: bridge-binary:study born:placeofbirth city:type.location 12

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 Sparse features: bridge-binary:study born:placeofbirth city:type.location Syntactic features: ent-pos:nnp NNP join-pos:v NN skip-pos:in 12

Learning θ: maximum-likelihood Training data: What s Bulgaria s capital? Sofia What movies has Tom Cruise been in? TopGun,VanillaSky,...... What s Bulgaria s capital? CapitalOf.Bulgaria What movies has Tom Cruise been in? Type.Movie HasPlayed.TomCruise... arg max θ n i=1 log p θ(y (i) x (i) ) = arg max θ n i=1 log d (i) p θ (d (i) x (i) )R(d (i) ) 13

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) q θ (d x) exp(φ(x, d) θ) R(d) 14

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) q θ (d x) exp(φ(x, d) θ) R(d) p θ (D(x)) = [0.2, 0.1, 0.1, 0.6] R(D(x)) = [1, 0, 0, 1] 14

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) p θ (D(x)) = [0.2, 0.1, 0.1, 0.6] R(D(x)) = [1, 0, 0, 1] q θ (D(x)) = [0.25, 0, 0, 0.75] q θ = p θ p θ R q θ (d x) exp(φ(x, d) θ) R(d) 14

Training Input: {x i, y i } n i=1 Output: θ 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i D(x i ) arg max K (p θ (d x i )) 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i D(x i ) arg max K (p θ (d x i )) θ θ + η τ,i (E qθ (d x i )[φ(x i, d)] E pθ (d x i )[φ(x i, d)]) 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i D(x i ) arg max K (p θ (d x i )) θ θ + η τ,i (E qθ (d x i )[φ(x i, d)] E pθ (d x i )[φ(x i, d)]) η τ,i : learning rate Regularization often added (L2, L1,...) 15

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i ˆd arg max(p θ (d x i )) d arg max(q θ (d x i )) 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i ˆd arg max(p θ (d x i )) d arg max(q θ (d x i )) if [d ] K [ ˆd] K θ θ + φ(x i, d ) φ(x i, ˆd)) 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i ˆd arg max(p θ (d x i )) d arg max(q θ (d x i )) if [d ] K [ ˆd] K θ θ + φ(x i, d ) φ(x i, ˆd)) Regularization often added with weight averaging 16

Training Other simple variants exist: E.g., cost-sensitive max-margin training That is, find pairs of good and bad derivations that look different but have similar scores and update on those 17

Outline Learning Overview Details Example Lexicon learning Supervision signals 18

Example./run @mode=simple-lambdadcs \ -Grammar.inPaths esslli 2016/class3 demo.grammar \ -SimpleLexicon.inPaths esslli 2016/class3 demo.lexicon (loadgraph geo880/geo880.kg) size of california size capital california size of capital of california california size 19

Exercise Find a pair of natural language utterances that cannot be distinguished using the current feature representation The utterances can be not fully grammatical in English You can ignore the denotation feature if that helps Verify this in sempre (ask me how to disable features) Design a feature that will solve this problem 20

Outline Learning Overview Details Example Lexicon learning Supervision signals 21

The lexicon problem How is the lexicon generated? Annotation Exhaustive search String matching Supervised alignment Unsupervised alignment Learning 22

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i Add lexicon entries D(x i ) arg max K (p θ (d x i ) θ θ + η τ,i (E qθ (d x i )[φ(x i, d)] E pθ (d x i )[φ(x i, d)]) η τ,i : learning rate Regularization often added (L2, L1,...) 23

Adding lexicon entries [Adapted from semantic parsing tutorial, Artzi et al.] Input: training example (x i, y i ), current lexicon Λ, model θ Λ temp Λ GENLEX(x i, y i ) D(x i ) arg max K (p θ,λtemp (d x i )) Create expanded temporaty lexicon Parse with temporary lexicon 24

Adding lexicon entries [Adapted from semantic parsing tutorial, Artzi et al.] Input: training example (x i, y i ), current lexicon Λ, model θ Λ temp Λ GENLEX(x i, y i ) Create expanded temporaty lexicon D(x i ) arg max K (p θ,λtemp (d x i )) Parse with temporary lexicon Λ = Λ {l l ˆd, ˆd D(x i ), R(d) = 1} Add entries from correct trees 24

Lexicon generation [Zettlemoyer and Collins, 2005] Logical form supervision: Largest state bordering California argmax(type(state) Border(California), Area) 25

Lexicon generation [Zettlemoyer and Collins, 2005] Logical form supervision: Largest state bordering California argmax(type(state) Border(California), Area) Enumerate spans Largest state bordering California Largest state... Use rules to extract sub-forumulas California Border(California) λf.f(california) Area λx.argmax(x, Area)... 25

Lexicon generation Denotation supervision: Largest state bordering California Arizona 26

Lexicon generation Denotation supervision: Largest state bordering California Arizona Enumerate spans Largest state bordering California Largest state... Generate sub-formulas from KB California Border(California) Traverse Type.Mountain λx.argmax(x, Elevation)... Restrict candidates with alignment, string matching,... 26

Unification [Kwiatkowski et al, 2010] Logical form supervision: 27

Unification [Kwiatkowski et al, 2010] Logical form supervision: Initialize lexicon with (x i, z i ): States bordering California Type(State) Border(California) Split lexical entry in all possible ways: Enumerate spans (states, bordering california) (states bordering, california) Generate sub-formulas from KB (Type(State), λx.x Border(California)) (λx.type(state) x, Border(California)) (λf.type(state) f(california), California)... 27

Unification [Kwiatkowski et al, 2010] For example x i, z i : 28

Unification [Kwiatkowski et al, 2010] For example x i, z i : Find highest scoring correct parse d 28

Unification [Kwiatkowski et al, 2010] For example x i, z i : Find highest scoring correct parse d Split all lexical entries in d in all possible ways 28

Unification [Kwiatkowski et al, 2010] For example x i, z i : Find highest scoring correct parse d Split all lexical entries in d in all possible ways Add to lexicon lexical entry that improves parse score best 28

Do we need a lexicon? Type(State) Border(California) Type(State) Border(California) Type State Border California California neighbors 29

Do we need a lexicon? Type(State) Border(California) Type(State) Border(California) Type State Border California California neighbors Floating parse tree: generalization of bridging 29

Do we need a lexicon? Type(State) Border(California) Type(State) Border(California) Type State Border California California neighbors Floating parse tree: generalization of bridging Perhaps with better learning and search not necessary? 29

Outline Learning Overview Details Example Lexicon learning Supervision signals 30

Supervision signals We discussed training from logical forms and denotations 31

Supervision signals We discussed training from logical forms and denotations Other forms of supervision have been proposed: Demonstrations Distant supervision Conversations Unsupervised Paraphrasing 31

Training from demonstrations [Artzi and Zettlemoyer, 2013] Input: (x i, s i, t i ) x i : utterance s i : start state t i : end state move forward until you reach the intersection 32

Training from demonstrations [Artzi and Zettlemoyer, 2013] Input: (x i, s i, t i ) x i : utterance s i : start state t i : end state move forward until you reach the intersection λa.move(a) dir(a.forward)... 32

Distant supervision [Reddy et al., 2014] Data generation: Decompose declarative text to questions and answers James Cameron is the director of Titanic Q: X is the director of Titanic A: James Cameron 33

Distant supervision [Reddy et al., 2014] Training: Use exising non-executable semantic parsers X is the director of Titanic 34

Distant supervision [Reddy et al., 2014] Training: Use exising non-executable semantic parsers X is the director of Titanic λx.director(x) director.of.arg1(e, x) director.of.arg2(e, Titanic) 34

Distant supervision [Reddy et al., 2014] Training: Use exising non-executable semantic parsers X is the director of Titanic λx.director(x) director.of.arg1(e, x) director.of.arg2(e, Titanic) λx.director(x) FilmDirectedBy(e, x) FileDirected(e, Titanic) λx.producer(x) FilmProducedBy(e, x) FileProduced(e, Titanic) 34

Training from conversations 35

Training from conversations z 1 : From(Atlanta) To(London) z 2 : From(Atlanta) From(London) z 3 : To(Atlanta) To(London) z 4 : To(Atlanta) From(London) 35

Training from conversations z 1 : From(Atlanta) To(London) z 2 : From(Atlanta) From(London) z 3 : To(Atlanta) To(London) z 4 : To(Atlanta) From(London) Define loss: Does z align with converation? Does z obey domain constraints? 35

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ θ initialized manually, S = φ 36

Unsupervised learning [Goldwasser et al., 2011] Intuition: assume repeating patterns are correct Input: {x i } n i=1 Output: θ θ initialized manually, S = φ Until stopping criterion met for example x i S = S (x i, arg max p θ (d x i )) 36

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? 37

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? Type.HumanLanguage LanguagesSpoken.Brazil... CapitalOf.Brazil 37

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? What language is the language of Brazil?... What city is the capital of Brazil? Type.HumanLanguage LanguagesSpoken.Brazil... CapitalOf.Brazil 37

Paraphrasing B. and Liang, 2014 What languages do people in Brazil use? paraphrase model What language is the language of Brazil?... What city is the capital of Brazil? Type.HumanLanguage LanguagesSpoken.Brazil... CapitalOf.Brazil Portuguese,... 37

Summary We saw how to train from denotations and logical forms We see methods for inducing lexicons during training We reviewed work on how to use even weaker forms of supervision Still an open problem 38