Outline. Learning. Overview Details

Similar documents
Outline. Learning. Overview Details Example Lexicon learning Supervision signals

Driving Semantic Parsing from the World s Response

Outline. Parsing. Approximations Some tricks Learning agenda-based parsers

Semantic Parsing with Combinatory Categorial Grammars

Machine Learning for NLP

Machine Learning for NLP

Structured Prediction

Lecture 13: Structured Prediction

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Machine Learning for Structured Prediction

Introduction to Semantic Parsing with CCG

CS229 Supplemental Lecture notes

Optimization and Gradient Descent

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

The Infinite PCFG using Hierarchical Dirichlet Processes

NLP Programming Tutorial 11 - The Structured Perceptron

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

6.891: Lecture 24 (December 8th, 2003) Kernel Methods

Linear Models for Classification

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Bringing machine learning & compositional semantics together: central concepts

Lecture 5: Semantic Parsing, CCGs, and Structured Classification

Machine Learning Linear Models

Learning Dependency-Based Compositional Semantics

Feedforward Neural Networks. Michael Collins, Columbia University

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Logistic Regression. COMP 527 Danushka Bollegala

MIRA, SVM, k-nn. Lirong Xia

Neural Networks and Deep Learning

Lab 12: Structured Prediction

9 Classification. 9.1 Linear Classifiers

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

EE-559 Deep learning 3.5. Gradient descent. François Fleuret Mon Feb 18 13:34:11 UTC 2019

Graphical Models for Collaborative Filtering

Wasserstein Training of Boltzmann Machines

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Learning for Computer Vision

Personal Project: Shift-Reduce Dependency Parsing

Lecture 11 Linear regression

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Feedforward Neural Networks

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Expectation Maximization (EM)

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Statistical NLP Spring A Discriminative Approach

ECS171: Machine Learning

Generative v. Discriminative classifiers Intuition

Linear Classifiers. Michael Collins. January 18, 2012

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

CPSC 540: Machine Learning

Linear Discrimination Functions

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning for natural language processing

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression (continued)

Spectral Unsupervised Parsing with Additive Tree Metrics

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Introduction to Machine Learning

Natural Language Processing (CSEP 517): Text Classification

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Generative v. Discriminative classifiers Intuition

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Multi-Component Word Sense Disambiguation

Feature selection. Micha Elsner. January 29, 2014

Probabilistic Context-free Grammars

Linear Models in Machine Learning

lecture 6: modeling sequences (final part)

Learning Tetris. 1 Tetris. February 3, 2009

CSCI567 Machine Learning (Fall 2014)

Log-Linear Models, MEMMs, and CRFs

Cross-lingual Semantic Parsing

Advanced computational methods X Selected Topics: SGD

LECTURER: BURCU CAN Spring

Classification Based on Probability

Linear discriminant functions

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Lecture 15. Probabilistic Models on Graph

Probabilistic Context Free Grammars. Many slides from Michael Collins

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

ECS171: Machine Learning

Lecture 4 Logistic Regression

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Statistical Methods for NLP

Bits of Machine Learning Part 1: Supervised Learning

Empirical Risk Minimization

Introduction to Logistic Regression and Support Vector Machine

Machine Learning and Data Mining. Linear classification. Kalev Kask

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Overfitting, Bias / Variance Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Transcription:

Outline Learning Overview Details 0

Outline Learning Overview Details 1

Supervision in syntactic parsing Input: S NP VP NP NP V VP ESSLLI 2016 the known summer school is V PP located in Bolzano Output: S They play football NP They V VP NP play football 2

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011] Supervision in semantic parsing Input: Heavy supervision How tall is Lebron James? HeightOf.LebronJames What is Steph Curry s daughter called? ChildrenOf.StephCurry Gender.Female Youngest player of the Cavaliers arg min(playerof.cavaliers, BirthDateOf)... Light supervision How tall is Lebron James? 203cm What is Steph Curry s daughter called? Riley Curry Youngest player of the Cavaliers Kyrie Irving... 3

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011] Supervision in semantic parsing Input: Heavy supervision How tall is Lebron James? HeightOf.LebronJames What is Steph Curry s daughter called? ChildrenOf.StephCurry Gender.Female Youngest player of the Cavaliers arg min(playerof.cavaliers, BirthDateOf)... Light supervision How tall is Lebron James? 203cm What is Steph Curry s daughter called? Riley Curry Youngest player of the Cavaliers Kyrie Irving... Output: WeightOf.ClayThompson Clay Thompson s weight ClayThompson s Weight Weight.ClayThompson 205 lbs Clay Thompson weight 3

Learning in a nutshell utterance 0. Define model for derivations 4

Learning in a nutshell utterance Parsing. 0. Define model for derivations 1. Generate candidate derivations (later) 4

Learning in a nutshell utterance Parsing.. Label 0. Define model for derivations 1. Generate candidate derivations (later) 2. Label as correct and incorrect 4

Learning in a nutshell utterance Parsing.. Label Update model 0. Define model for derivations 1. Generate candidate derivations (later) 2. Label as correct and incorrect 3. Update model to favor correct trees 4

Training intuition Where did Mozart tupress? Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth PlaceOfDeath.WilliamHogarth PlaceOfMarriage.WilliamHogarth London 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth London PlaceOfDeath.WilliamHogarth London PlaceOfMarriage.WilliamHogarth Paddington London 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth London PlaceOfDeath.WilliamHogarth London PlaceOfMarriage.WilliamHogarth Paddington London 5

Training intuition Where did Mozart tupress? PlaceOfBirth.WolfgangMozart PlaceOfDeath.WolfgangMozart PlaceOfMarriage.WolfgangMozart Vienna Salzburg Vienna Vienna Where did Hogarth tupress? PlaceOfBirth.WilliamHogarth London PlaceOfDeath.WilliamHogarth London PlaceOfMarriage.WilliamHogarth Paddington London 5

Outline Learning Overview Details 6

Constructing derivations Type.Person PlaceLived.Chicago intersect Type.Person lexicon who PlaceLived.Chicago join people PlaceLived lived in lexicon Chicago Chicago lexicon 7

Many possible derivations! x = people who have lived in Chicago? set of candidate derivations D(x) Type.Person PlaceLived.Chicago intersect Type.Org PresentIn.ChicagoMusical intersect Type.Person lexicon who PlaceLived.Chicago join Type.Org lexicon who PresentIn.ChicagoMusical join people PlaceLived Chicago people PresentIn ChicagoMusical lexicon lexicon lexicon lexicon lived in Chicago lived in Chicago 8

Type.Person PlaceLived.Chicago intersect x: utterance d: derivation Type.Person people lexicon who PlaceLived.Chicago join PlaceLived Chicago lived in lexicon Chicago lexicon Feature vector and parameters in R F : φ(x, d) θ learned apply join 1 1.2 apply intersect 1 0.6 apply lexicon 3 2.1 lived maps to PlacesLived 1 3.1 lived maps to PlaceOfBirth 0-0.4 born maps to PlaceOfBirth 0 2.7......... 9

Type.Person PlaceLived.Chicago intersect x: utterance d: derivation Type.Person people lexicon who PlaceLived.Chicago join PlaceLived Chicago lived in lexicon Chicago lexicon Feature vector and parameters in R F : φ(x, d) θ learned apply join 1 1.2 apply intersect 1 0.6 apply lexicon 3 2.1 lived maps to PlacesLived 1 3.1 lived maps to PlaceOfBirth 0-0.4 born maps to PlaceOfBirth 0 2.7......... Score θ (x, d) = φ(x, d) θ = 1.2 1 + 0.6 1 + 2.1 3 + 3.1 1 + 0.4 0 + 2.7 0 +... 9

Deep learning alert! The feature vector φ(x, d) is constructed by hand 10

Deep learning alert! The feature vector φ(x, d) is constructed by hand Constructing good features is hard 10

Deep learning alert! The feature vector φ(x, d) is constructed by hand Constructing good features is hard Algorithms are likely to do it better 10

Deep learning alert! The feature vector φ(x, d) is constructed by hand Constructing good features is hard Algorithms are likely to do it better Perhaps we can train φ(x, d) φ(x, d) = F ψ (x, d), where ψ are the parameters 10

Candidate derivations: D(x) Log-linear model Model: distribution over derivations d given utterance x p θ (d x) = exp(score θ (x,d)) d D(x) exp(score θ(x,d )) 11

Candidate derivations: D(x) Log-linear model Model: distribution over derivations d given utterance x p θ (d x) = exp(score θ (x,d)) d D(x) exp(score θ(x,d )) score θ (x, d) [1, 2, 3, 4] p θ (d x) [ e e+e 2 +e 3 +e 4, e 2 e+e 2 +e 3 +e 4, e 3 e+e 2 +e 3 +e 4, e 4 e+e 2 +e 3 +e 4 ] 11

Candidate derivations: D(x) Log-linear model Model: distribution over derivations d given utterance x p θ (d x) = exp(score θ (x,d)) d D(x) exp(score θ(x,d )) score θ (x, d) [1, 2, 3, 4] p θ (d x) [ e e+e 2 +e 3 +e 4, e 2 e+e 2 +e 3 +e 4, e 3 e+e 2 +e 3 +e 4, e 4 e+e 2 +e 3 +e 4 ] Parsing: find the top-k derivation trees D θ (x) 11

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 12

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 Sparse features: bridge-binary:study born:placeofbirth city:type.location 12

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 Sparse features: bridge-binary:study born:placeofbirth city:type.location Syntactic features: ent-pos:nnp NNP join-pos:v NN skip-pos:in 12

Features Dense features: intersection=0.67 ent-popularity:high denoation-size:1 Sparse features: bridge-binary:study born:placeofbirth city:type.location Syntactic features: ent-pos:nnp NNP join-pos:v NN skip-pos:in Grammar features: Binary->Verb 12

Learning θ: marginal maximum-likelihood Training data: What s Bulgaria s capital? Sofia What movies has Tom Cruise been in? TopGun,VanillaSky,...... What s Bulgaria s capital? CapitalOf.Bulgaria What movies has Tom Cruise been in? Type.Movie HasPlayed.TomCruise... 13

Learning θ: marginal maximum-likelihood Training data: What s Bulgaria s capital? Sofia What movies has Tom Cruise been in? TopGun,VanillaSky,...... What s Bulgaria s capital? CapitalOf.Bulgaria What movies has Tom Cruise been in? Type.Movie HasPlayed.TomCruise... arg max θ n i=1 log p θ(y (i) x (i) ) = arg max θ n i=1 log d (i) p θ (d (i) x (i) )R(d (i) ) 13

Learning θ: marginal maximum-likelihood Training data: What s Bulgaria s capital? Sofia What movies has Tom Cruise been in? TopGun,VanillaSky,...... What s Bulgaria s capital? CapitalOf.Bulgaria What movies has Tom Cruise been in? Type.Movie HasPlayed.TomCruise... arg max θ n i=1 log p θ(y (i) x (i) ) = arg max θ n i=1 log d (i) p θ (d (i) x (i) )R(d (i) ) R(d) = { 1 d.z = z (i) 0 o/w R(d) = { 1 [d.z] K = y (i) 0 o/w R(d) = F 1 ([d.z] K, y (i) ) 13

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) q θ (d x) exp(φ(x, d) θ) R(d) 14

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) q θ (d x) exp(φ(x, d) θ) R(d) p θ (D(x)) = [0.2, 0.1, 0.1, 0.6] R(D(x)) = [1, 0, 0, 1] 14

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) p θ (D(x)) = [0.2, 0.1, 0.1, 0.6] R(D(x)) = [1, 0, 0, 1] q θ (D(x)) = [0.25, 0, 0, 0.75] q θ = p θ p θ R q θ (d x) exp(φ(x, d) θ) R(d) 14

Optimization: stochastic gradient descent For every example: O(θ) = log d p θ(d x)r(d) O(θ) = E qθ (d x)[φ(x, d)] E pθ (d x)[φ(x, d)] p θ (d x) exp(φ(x, d) θ) p θ (D(x)) = [0.2, 0.1, 0.1, 0.6] R(D(x)) = [1, 0, 0, 1] q θ (D(x)) = [0.25, 0, 0, 0.75] q θ = p θ p θ R Gradient: q θ (d x) exp(φ(x, d) θ) R(d) 0.05 φ(x, d 1 ) 0.1 φ(x, d 2 ) 0.1 φ(x, d 3 ) + 0.15 φ(x, d 4 ) 14

Training Input: {x i, y i } n i=1 Output: θ 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i D(x i ) arg max K (p θ (d x i )) 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i D(x i ) arg max K (p θ (d x i )) θ θ + η τ,i (E qθ (d x i )[φ(x i, d)] E pθ (d x i )[φ(x i, d)]) 15

Training Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i D(x i ) arg max K (p θ (d x i )) θ θ + η τ,i (E qθ (d x i )[φ(x i, d)] E pθ (d x i )[φ(x i, d)]) η τ,i : learning rate Regularization often added (L2, L1,...) 15

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i ˆd arg max(p θ (d x i )) d arg max(q θ (d x i )) 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i ˆd arg max(p θ (d x i )) d arg max(q θ (d x i )) if [d ] K [ ˆd] K θ θ + φ(x i, d ) φ(x i, ˆd)) 16

Training (structured perceptron) Input: {x i, y i } n i=1 Output: θ θ 0 for iteration τ and example i ˆd arg max(p θ (d x i )) d arg max(q θ (d x i )) if [d ] K [ ˆd] K θ θ + φ(x i, d ) φ(x i, ˆd)) Regularization often added with weight averaging 16

Training Other simple variants exist: E.g., cost-sensitive max-margin training That is, find pairs of good and bad derivations that look different but have similar scores and update on those 17