Discriminative Training

Similar documents
Discriminative Training. March 4, 2014

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

The Geometry of Statistical Machine Translation

Tuning as Linear Regression

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Word Alignment. Chris Dyer, Carnegie Mellon University

Learning to translate with neural networks. Michael Auli

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

MIRA, SVM, k-nn. Lirong Xia

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

CRF Word Alignment & Noisy Channel Translation

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Massachusetts Institute of Technology

Machine Translation Evaluation

Phrase-Based Statistical Machine Translation with Pivot Languages

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

IBM Model 1 for Machine Translation

Lab 12: Structured Prediction

Midterm sample questions

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 3, 7 Sep., 2016

Minimum Error Rate Training Semiring

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Natural Language Processing (CSEP 517): Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

A Discriminative Model for Semantics-to-String Translation

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Conditional Language Modeling. Chris Dyer

CS 188: Artificial Intelligence. Outline

COMS 4705 (Fall 2011) Machine Translation Part II

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

Word Alignment III: Fertility Models & CRFs. February 3, 2015

From perceptrons to word embeddings. Simon Šuster University of Groningen

1 Evaluation of SMT systems: BLEU

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

CS 188: Artificial Intelligence Spring Announcements

Naïve Bayes, Maxent and Neural Models

Introduction to Signal Detection and Classification. Phani Chavali

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Machine Translation 1 CS 287

Multiclass and Introduction to Structured Prediction

Warm up: risk prediction with logistic regression

Multiclass and Introduction to Structured Prediction

Logistic Regression. Machine Learning Fall 2018

Machine Learning for NLP

LECTURER: BURCU CAN Spring

Multiclass Classification

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Multi-Label Informed Latent Semantic Indexing

CS 188: Artificial Intelligence Spring Announcements

Chapter 3: Basics of Language Modeling

statistical machine translation

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

CPSC 340: Machine Learning and Data Mining

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Chapter 3: Basics of Language Modelling

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

Probabilistic Context-Free Grammar

Classification Logistic Regression

Linear Classifiers IV

Learning Linear Detectors

Name: Student number:

CMU-Q Lecture 24:

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

The Noisy Channel Model and Markov Models

Variational Decoding for Statistical Machine Translation

Multiclass Boosting with Repartitioning

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Statistical Ranking Problem

Topics in Natural Language Processing

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Learning Features from Co-occurrences: A Theoretical Analysis

Discriminative part-based models. Many slides based on P. Felzenszwalb

Lexical Translation Models 1I. January 27, 2015

Applied Natural Language Processing

6.864 (Fall 2007) Machine Translation Part II

Improved Decipherment of Homophonic Ciphers

AN ABSTRACT OF THE DISSERTATION OF

Theory of Alignment Generators and Applications to Statistical Machine Translation

Learning by constraints and SVMs (2)

Classification & Information Theory Lecture #8

Lexical Translation Models 1I

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Methods: Naïve Bayes

with Local Dependencies

Spectral Unsupervised Parsing with Additive Tree Metrics

Statistical Machine Translation of Natural Languages

Polyhedral Outer Approximations with Application to Natural Language Parsing

CS230: Lecture 10 Sequence models II

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

Lecture 12: Algorithms for HMMs

Advanced Natural Language Processing Syntactic Parsing

Transcription:

Discriminative Training February 19, 2013

Noisy Channels Again p(e) source English

Noisy Channels Again p(e) p(g e) source English German

Noisy Channels Again p(e) p(g e) source English German decoder e = arg max e p(e g) = arg max e p(g e) p(g) p(e) = arg max e p(g e) p(e)

Noisy Channels Again e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e)

Noisy Channels Again e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e) = arg max e = arg max e log p(g e) + log p(e) apple > apple 1 log p(g e) 1 log p(e) {z } {z } w > h(g,e)

Noisy Channels Again e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e) = arg max e = arg max e log p(g e) + log p(e) apple > apple 1 log p(g e) 1 log p(e) {z } {z } w > h(g,e)

Noisy Channels Again e = arg max e = arg max e p(e g) p(g e) p(g) p(e) = arg max p(g e) p(e) e Does this = arglook max log p(g familiar? e) + log p(e) e = arg max e apple > apple 1 log p(g e) 1 log p(e) {z } {z } w > h(g,e)

The Noisy Channel -log p(g e) -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) Improvement 1: change ~w to find better ~w translations g -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) Improvement 2: Add dimensions to make points separable ~w -log p(e)

Linear Models e = arg max e w > h(g, e) Improve the modeling capacity of the noisy channel in two ways Reorient the weight vector Add new dimensions (new features) Questions What features? h(g, e) How do we set the weights? w

Mann beißt Hund 18

Mann beißt Hund x BITES y 18

Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 19

Feature Classes Lexical Are lexical choices appropriate? bank = River bank vs. Financial institution 20

Feature Classes Lexical Are lexical choices appropriate? bank = River bank vs. Financial institution Configurational Are semantic/syntactic relations preserved? Dog bites man vs. Man bites dog 20

Feature Classes Lexical Are lexical choices appropriate? bank = River bank vs. Financial institution Configurational Are semantic/syntactic relations preserved? Dog bites man vs. Man bites dog Grammatical Is the output fluent / well-formed? Man bites dog vs. Man bite dog 20

What do lexical features look like? Mann beißt Hund man bites cat 21

What do lexical features look like? Mann beißt Hund man bites cat 21

What do lexical features look like? Mann beißt Hund man bites cat First attempt: score(g, e) =w > h(g, e) ( 1, 9i, j : g i = Hund,e j = cat h 15,342 (g, e) = 0, otherwise 21

What do lexical features look like? Mann beißt Hund man bites cat First attempt: score(g, e) =w > h(g, e) ( 1, 9i, j : g i = Hund,e j = cat h 15,342 (g, e) = 0, otherwise But what if a cat is being chased by a Hund? 21

What do lexical features look like? Mann beißt Hund man bites cat Latent variables enable more precise features: score(g, e, a) =w > h(g, e, a) X ( 1, if g i = Hund,e j = cat h 15,342 (g, e, a) = 0, otherwise (i,j)2a 22

Standard Features Target side features log p(e) [ n-gram language model ] Number of words in hypothesis Non-English character count Source + target features log relative frequency e f of each rule [ log #(e,f) - log #(f) ] log relative frequency f e of each rule [ log #(e,f) - log #(e) ] lexical translation log probability e f of each rule [ log pmodel1(e f) ] lexical translation log probability f e of each rule [ log pmodel1(f e) ] Other features Count of rules/phrases used Reordering pattern probabilities 23

Parameter Learning 24

Hypothesis Space h 1 h 2 25

Hypothesis Space h 1 Hypotheses h 2 25

Hypothesis Space h 1 References h 2 26

Preliminaries We assume a decoder that computes: he, a i = arg max he,ai w> h(g, e, a) And K-best lists of, that is: { e i, a i } i=k i=1 = arg i th - max he,ai w> h(g, e, a) Standard, efficient algorithms exist for this. 27

Learning Weights Try to match the reference translation exactly Conditional random field Maximize the conditional probability of the reference translations Average over the different latent variables Max-margin Find the weight vector that separates the reference translation from others by the maximal margin Maximal setting of the latent variables 28

Learning Weights Try to match the reference translation exactly Conditional random field Maximize the conditional probability of the reference translations Average over the different latent variables Max-margin Find the weight vector that separates the reference translation from others by the maximal margin Maximal setting of the latent variables 28

Problems These methods give full credit when the model exactly produces the reference and no credit otherwise What is the problem with this? There are many ways to translate a sentence What if we have multiple reference translations? What about partial credit? 29

Problems These methods give full credit when the model exactly produces the reference and no credit otherwise What is the problem with this? There are many ways to translate a sentence What if we have multiple reference translations? What about partial credit? 29

Cost-Sensitive Training Assume we have a cost function that gives a score for how good/bad a translation is (ê, E) [0, 1] Optimize the weight vector by making reference to this function We will talk about two ways to do this 30

K-Best List Example h 1 ~w h 2 31

K-Best List Example h 1 ~w #8 #7 #3 #6 #5 #4 #2#1 #10 #9 h 2 31

K-Best List Example h 1 ~w #3 #6 #5 #4 #2#1 0.8 apple < 1.0 0.6 apple < 0.8 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 32

Training as Classification Pairwise Ranking Optimization Reduce training problem to binary classification with a linear model Algorithm For i=1 to N Pick random pair of hypotheses (A,B) from K-best list Use cost function to determine if is A or B better Create ith training instance Train binary linear classifier 33

h 1 #6 #3 #2 #1 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 h 1 h 2 34

h 1 #6 #3 #2 #1 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 h 1 h 2 34

h 1 #6 #3 #2 #1 Worse! 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 h 1 h 2 35

h 1 #6 #3 #2 #1 Worse! 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 h 1 h 2 36

h 1 #6 #3 #2 #1 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 h 1 h 2 37

h 1 #6 #3 #2 #1 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 Better! 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 h 1 h 2 38

h 1 #6 #3 #2 #1 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 Better! 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 h 1 h 2 39

h 1 #6 #3 #2 #1 Worse! 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 h 1 h 2 40

h 1 #6 #3 #2 #1 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 Better! #10 #9 h 2 h 1 h 2 41

h 1 #6 #3 #2 #1 0.8 apple < 1.0 0.6 apple < 0.8 #5 #4 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 h 1 h 2 42

h 1 h 2 Fit a linear model 43

h 1 h 2 ~w Fit a linear model 44

K-Best List Example h 1 #3 #2#1 0.8 apple < 1.0 #6 #5 #4 0.6 apple < 0.8 0.4 apple < 0.6 ~w #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 45

MERT Minimum Error Rate Training Directly target an automatic evaluation metric BLEU is defined at the corpus level MERT optimizes at the corpus level Downsides Does not deal well with > ~20 features 46

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 47

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 48

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 49

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 50

MERT Given weight vector w, any hypothesis he, ai will have a (scalar) score m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b + v > h(g, e, a) {z } a m = a + b 51

Given weight vector MERT w will have a (scalar) score, any hypothesis he, ai m = w > h(g, e, a) Now pick a search vector v, and consider how the score of this hypothesis will change: w new = w + v m =(w + v) > h(g, e, a) = w > h(g, e, a) {z } b = a + b + v > h(g, e, a) {z } a m Linear function in 2D! 51

MERT m 52

MERT m Recall our k-best set { e i, a i } K i=1 53

MERT m Recall our k-best set { e i, a i } K i=1 54

MERT m 55

MERT he 162, a 162i m he 28, a 28i he 73, a 73i 56

MERT he 162, a 162i m he 28, a 28i he 73, a 73i 57

MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 58

MERT he 162, a 162i m he 28, a 28i he 73, a 73i errors 59

MERT m errors 60

MERT m errors 61

errors Let w new = v + w 62

MERT In practice errors are sufficient statistics for evaluation metrics (e.g., BLEU) Can maximize or minimize! Envelope can also be computed using dynamic programming Interesting complexity bounds How do you pick the search direction? 63

Summary Evaluation metrics Figure out how well we re doing Figure out if a feature helps But ALSO: train your system! What s a great way to improve translation? Improve evaluation! 64

Thank You! m he 162, a 162i he 28, a 28i he 73, a 73i 65