Discriminative Training. March 4, 2014

Similar documents
Discriminative Training

The Geometry of Statistical Machine Translation

CRF Word Alignment & Noisy Channel Translation

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Classification Logistic Regression

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

IBM Model 1 for Machine Translation

Massachusetts Institute of Technology

Phrase-Based Statistical Machine Translation with Pivot Languages

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Tuning as Linear Regression

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

MIRA, SVM, k-nn. Lirong Xia

Lexical Translation Models 1I

Active Learning and Optimized Information Gathering

CS 188: Artificial Intelligence. Outline

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Learning to translate with neural networks. Michael Auli

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

CS 188: Artificial Intelligence Spring Announcements

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Multiclass and Introduction to Structured Prediction

Multiclass Classification

What s so Hard about Natural Language Understanding?

Word Alignment. Chris Dyer, Carnegie Mellon University

Improved Decipherment of Homophonic Ciphers

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Statistical Ranking Problem

Announcements - Homework

Lexical Translation Models 1I. January 27, 2015

Midterm Exam, Spring 2005

Midterm sample questions

Decoding and Inference with Syntactic Translation Models

Multiclass and Introduction to Structured Prediction

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

CS 188: Artificial Intelligence Fall 2011

Warm up: risk prediction with logistic regression

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

Probabilistic Context Free Grammars. Many slides from Michael Collins

CS 188: Artificial Intelligence Spring Announcements

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning for NLP

COMS 4705 (Fall 2011) Machine Translation Part II

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

Learning Linear Detectors

COMS 4705, Fall Machine Translation Part III

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Chapter 3: Basics of Language Modelling

Minimum Error Rate Training Semiring

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

statistical machine translation

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

CC283 Intelligent Problem Solving 28/10/2013

CS446: Machine Learning Spring Problem Set 4

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing (CSEP 517): Machine Translation

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Topics in Natural Language Processing

Logistic Regression. Machine Learning Fall 2018

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 3, 7 Sep., 2016

CPSC 340: Machine Learning and Data Mining

Natural Language Processing SoSe Words and Language Model

Word Alignment III: Fertility Models & CRFs. February 3, 2015

Learning by constraints and SVMs (2)

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

CS 343: Artificial Intelligence

Naïve Bayes, Maxent and Neural Models

Linear Classifiers IV

Support Vector Machines: Kernels

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review

Structured Prediction Theory and Algorithms

Introduction to Signal Detection and Classification. Phani Chavali

Introduction to Support Vector Machines

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Translation Evaluation

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Machine Learning Lecture 1

] Automatic Speech Recognition (CS753)

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Naïve Bayes classification

Today s Lecture 2/9/10. Symbolizations continued Chapter 7.2: Truth conditions for our logical operators

A Discriminative Model for Semantics-to-String Translation

Evaluation. Andrea Passerini Machine Learning. Evaluation

A Support Vector Method for Multivariate Performance Measures

The Perceptron. Volker Tresp Summer 2018

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Advanced Natural Language Processing Syntactic Parsing

Hidden Markov Models

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

Lab 12: Structured Prediction

Learning From Data Lecture 2 The Perceptron

Transcription:

Discriminative Training March 4, 2014

Noisy Channels Again p(e) source English

Noisy Channels Again p(e) p(g e) source English German

Noisy Channels Again p(e) p(g e) source English German decoder e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e)

Noisy Channels Again e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e)

Noisy Channels Again e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e) = arg max e = arg max e log p(g e) + log p(e) apple > apple 1 log p(g e) 1 log p(e) {z } {z } w > h(g,e)

Noisy Channels Again e = arg max e = arg max e = arg max e p(e g) p(g e) p(g) p(g e) p(e) p(e) = arg max e = arg max e log p(g e) + log p(e) apple > apple 1 log p(g e) 1 log p(e) {z } {z } w > h(g,e) Log-linear Model

Discriminative modeling Depart from generative modeling Goal: Directly optimize for translation performance by discriminating between good/bad translation, and adjusting our model to give preference to good translations

Discriminative modeling Possible translations of a sentence are represented using a set of features h Each feature hi derives from one property of the translation Its feature weight wi indicates its relative importance The feature weights and feature values are combined into an overall score

Discriminative modeling Re-ranking - a two stage process 1: generate a candidate set of translations 2: add additional features and re-score the candidates according to the discriminative model

Discriminative modeling Optimize the features used in decoding Use more features than just the language model and translation model Tune their parameters and use the weights during decoding We can optimize a small handful of features, or we can use large-scale discriminative training for millions of features

n-best translations Discriminative training operates on candidate translations of a sentence In theory we could enumerate all possible translations, but in practice there are too many Typically, we operate on the 1,000-best or the 10,000-best translations, or the n-best

n-best translations

n-best translations

The Noisy Channel -log p(g e) -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) Improvement 1: change ~w to find better ~w translations g -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) ~w -log p(e)

As a Linear Model -log p(g e) Improvement 2: Add dimensions to make points separable ~w -log p(e)

Linear Models e = arg max e w > h(g, e) Improve the modeling capacity of the noisy channel in two ways Reorient the weight vector Add new dimensions (new features) Questions What features? h(g, e) How do we set the weights? w

Mann beißt Hund x BITES y 25

Mann beißt Hund x BITES y Mann beißt Hund man bites cat Mann beißt Hund dog man chase Mann beißt Hund man bite cat Mann beißt Hund man bite dog Mann beißt Hund dog bites man Mann beißt Hund man bites dog 26

Feature Classes Lexical Are lexical choices appropriate? bank = River bank vs. Financial institution Configurational Are semantic/syntactic relations preserved? Dog bites man vs. Man bites dog Grammatical Is the output fluent / well-formed? Man bites dog vs. Man bite dog 27

What do lexical features look like? Mann beißt Hund man bites cat First attempt: score(g, e) =w > h(g, e) ( 1, 9i, j : g i = Hund,e j = cat h 15,342 (g, e) = 0, otherwise But what if a cat is being chased by a Hund? 28

What do lexical features look like? Mann beißt Hund man bites cat Latent variables enable more precise features: score(g, e, a) =w > h(g, e, a) X ( 1, if g i = Hund,e j = cat h 15,342 (g, e, a) = 0, otherwise (i,j)2a 29

Standard Features Target side features log p(e) [ n-gram language model ] Number of words in hypothesis Non-English character count Source + target features log relative frequency e f of each rule [ log #(e,f) - log #(f) ] log relative frequency f e of each rule [ log #(e,f) - log #(e) ] lexical translation log probability e f of each rule [ log p model1 (e f) ] lexical translation log probability f e of each rule [ log p model1 (f e) ] Other features Count of rules/phrases used Reordering pattern probabilities 30

Parameter Learning 31

Hypothesis Space h 1 Hypotheses h 2 32

Hypothesis Space h 1 References h 2 33

Preliminaries We assume a decoder that computes: he, a i = arg max he,ai w> h(g, e, a) And K-best lists of, that is: { e i, a i } i=k i=1 = arg i th - max he,ai w> h(g, e, a) Standard, efficient algorithms exist for this. 34

Learning Weights Try to match the reference translation exactly Conditional random field Maximize the conditional probability of the reference translations Average over the different latent variables Max-margin Find the weight vector that separates the reference translation from others by the maximal margin Maximal setting of the latent variables 35

Problems These methods give full credit when the model exactly produces the reference and no credit otherwise What is the problem with this? There are many ways to translate a sentence What if we have multiple reference translations? What about partial credit? 36

Cost-Sensitive Training Assume we have a cost function that gives a score for how good/bad a translation is! (ê, E) [0, 1]! Optimize the weight vector by making reference to this function We will talk about two ways to do this 37

K-Best List Example h 1 ~w #8 #7 #3 #6 #5 #4 #2#1 #10 #9 h 2 38

K-Best List Example h 1 ~w #3 #6 #5 #4 #2#1 0.8 apple < 1.0 0.6 apple < 0.8 0.4 apple < 0.6 #8 #7 0.2 apple < 0.4 0.0 apple < 0.2 #10 #9 h 2 39

Training as Classification Pairwise Ranking Optimization Reduce training problem to binary classification with a linear model Algorithm For i=1 to N Pick random pair of hypotheses (A,B) from K-best list Use cost function to determine if is A or B better Create ith training instance Train binary linear classifier 40

Reading Read 9 from the textbook

Announcements HW3 due on Thursday at 11:59pm Jonny has office hours tomorrow 2-3pm 1st term project is due by the end of Spring break Upcoming language-in-10 Thursday: Anshul - Japanese