INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 3, 7 Sep., 2016

Similar documents
1 Evaluation of SMT systems: BLEU

Machine Translation Evaluation

CS230: Lecture 10 Sequence models II

statistical machine translation

Discriminative Training

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

The Geometry of Statistical Machine Translation

Enhanced Bilingual Evaluation Understudy

The Noisy Channel Model and Markov Models

Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics

Naïve Bayes, Maxent and Neural Models

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Discriminative Training. March 4, 2014

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Évalua&on)des)systèmes)de)TA)

Tuning as Linear Regression

Neural Networks Language Models

Learning to translate with neural networks. Michael Auli

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

An Empirical Study on Computing Consensus Translations from Multiple Machine Translation Systems

Lexical Translation Models 1I

Multiple System Combination. Jinhua Du CNGL July 23, 2008

Statistical Ranking Problem

Computing Lattice BLEU Oracle Scores for Machine Translation

Lattice-based System Combination for Statistical Machine Translation

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Maja Popović Humboldt University of Berlin Berlin, Germany 2 CHRF and WORDF scores

Conditional Language Modeling. Chris Dyer

A Discriminative Model for Semantics-to-String Translation

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

CS 188: Artificial Intelligence Spring Announcements

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Class Notes. Topic. Questions, Subtitles, Headings, Etc. 3 to 4 sentence summary across the bottom of the last page of the day s notes 8/21/ /2

Maximal Lattice Overlap in Example-Based Machine Translation

What we have done so far

Variational Decoding for Statistical Machine Translation

A Systematic Comparison of Training Criteria for Statistical Machine Translation

Kinetic energy. Objectives. Equations. Energy of motion 6/3/14. Kinetic energy is energy due to motion. kinetic energy kinetic en

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Consensus Training for Consensus Decoding in Machine Translation

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

Right Side NOTES ONLY

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Minimum Error Rate Training Semiring

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Natural Language Processing (CSEP 517): Machine Translation

Today. Couple of more induction proofs. Stable Marriage.

IBM Model 1 for Machine Translation

Midterm sample questions

Statistical NLP Spring HW2: PNP Classification

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Natural Language Processing. Statistical Inference: n-grams

CSE 4111/5111/6111 Computability Jeff Edmonds Assignment 6: NP & Reductions Due: One week after shown in slides

Decoding in Statistical Machine Translation. Mid-course Evaluation. Decoding. Christian Hardmeier

Automatic Generation of Shogi Commentary with a Log-Linear Language Model

4 Bias-Variance for Ridge Regression (24 points)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Machine Translation. Uwe Dick

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

When numbers are multiplied together, you the exponents and the bases. When numbers are divided, you the exponents and the bases.

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Support vector machines Lecture 4

Automatic Speech Recognition and Statistical Machine Translation under Uncertainty

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Cross-Lingual Language Modeling for Automatic Speech Recogntion

CHAPTER 9: HYPOTHESIS TESTING

CRF Word Alignment & Noisy Channel Translation

Multi-Task Minimum Error Rate Training for SMT

1.1 Numbers & Number Operations

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation

ANLP Lecture 6 N-gram models and smoothing

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Computational Models - Lecture 3

Algorithms for Syntax-Aware Statistical Machine Translation

Unsupervised Vocabulary Induction

I Can Identify the characteristics of an expression, equation and an inequality. I Can Simplify Number Expressions with Exponents.

Solving Exponential and Logarithmic Equations

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

A fast and simple algorithm for training neural probabilistic language models

N-gram Language Modeling

Exponential and Logarithmic Functions. Copyright Cengage Learning. All rights reserved.

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Chapter 8: Confidence Intervals

Transcription:

1 INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS Jan Tore Lønning, Lecture 3, 7 Sep., 2016 jtl@ifi.uio.no

Machine Translation Evaluation 2 1. Automatic MT-evaluation: 1. BLEU 2. Alternatives 3. Evaluation evaluation 4. Criticism 2. Starting STMT 1. The noisy channel model 2. Language models (n-grams)

Last week Human evaluation Machine evaluation Recall and precision Word error rate BLEU

BLEU A Bilingual Evaluation Understudy Score Main ideas: Use several reference translations Count precision of n-grams: For each n-gram in output: does it occur in at least one reference? Don t count recall but use a penalty for brevity

BLEU p n = Candidates: C { Candidates} n gram C Count C { Candidates} n gram C clip Count ( n ( n the set of sentences output by trans. system Count(n-gram, C): the number of times n-gram occurs in C Count clip (n-gram, C, C.refs): the number of times the n.gram occurs in both C and the reference translation for the same sentence where n.gram occurs most frequent gram, C, C. refs) gram, C)

Technicality: If the same n-gram has several occurrences in a candidate translation sentence, it should not be counted more times than the number of occurrences in the reference sentence with the largest number of occurrences of the same n-gram.

Example, p 1 and p 2 Hyp, C: One of the girls gave one of the boys one of the boys. C-Refs: A girl gave a boy one of the toy cars One of the girls gave a boy one of the cars. Count_clip( one, C, C-refs)=2 one of the girls gave boys 2 (3) 2 (3) 2 (3) 1 1 0(2) P 1 = 8/13 one of of the the girls girls gave gave one the boys boys one 2 (3) 2 (3) 1 1 0 (1) 0(2) 0 (1) P 2 = 6/12

Example, p 3 Hyp, C: One of the girls gave one of the boys one of the boys. C-Refs: A girl gave a boy one of the toy cars One of the girls gave a boy one of the cars. Count_clip( one of the, C, C-refs)=2 one of the of the girls the girls gave girls gave one 2 (3) 1 1 0 (1) gave one of of the boys the boys one boys one of 0 (1) 0 (2) 0 (1) 0 (1) P 3 = 4/11

Example continued 4 p i = p 1 p 2 p 3 p 4 = 8 13 6 12 4 11 2 10 0.02238 i=1 4 p i i=1 1 4 0.02238 1 4 0.39

BLEU How to combine the n-gram precisions? Remember p1 p2 p n = p i n i= 1 ln( n i= 1 p ) = ln( p i 1 p 2 p ) = ln( p ) + ln( p ) + + ln( p One can add weights, typically ai = 1/n n 1 2 n ) = n i= 1 ln p i a1 a2 an ln( p p2 pn ) = a1ln( p1) + a2ln( p2 ) + + How long n-grams? Max 4-grams seems to work best anln( p 1 n )

Brevity penalty c is the length of the candidates r is the length of the reference translations: for each C choose the R most similar in length Penalty applies if c < r: BP = 1 BP = BLEU ln ( 1 r / c) e = BLEU BP exp if c > r otherwise n i= 1 = min(1 r c w n ln p i,0) + n i= 1 w n ln p i c = length( C) C Candidates r = length( R. sim. C) C Candidates This is correct Error in K:SMT Use logarithms to avoid underflow!

BLEU-4 4 BLEU 4 = eee min 1 r c, 0 1 4 ln p i i=1 BLEU 4 = min e 1 r c, 1 4 p i i=1 1 4

Machine Translation Evaluation 2 1. Automatic MT-evaluation: 1. BLEU 2. Alternatives 3. Evaluation evaluation 4. Criticism 2. Starting STMT 1. The noisy channel model 2. Language models (n-grams)

NIST score National Institute of Standards and Technology Evaluated BLEU score further Proposed an alternative formula: N-grams are weighed by their inverse frequency Sums (instead of products) of counts over n-grams Modified Brevity Penalty Freely available software

Machine Translation Evaluation 2 1. Automatic MT-evaluation: 1. BLEU 2. Alternatives 3. Evaluation evaluation 4. Criticism 2. Starting STMT 1. The noisy channel model 2. Language models (n-grams)

Evaluating the automatic evaluation Is the automatic evaluation correct? Yes, if it gives the same results as human evaluators. Best measured as ranking of MT systems: Does BLEU rank a set of MT systems in the same order as human evaluators?

BLEU original paper H1, H2 2 different human translations S1, S2, S3 different MT systems

Machine Translation Evaluation 2 1. Automatic MT-evaluation: 1. BLEU 2. Alternatives 3. Evaluation evaluation 4. Criticism 2. Starting STMT 1. The noisy channel model 2. Language models (n-grams)

Shortcomings of automatic MT Re-evaluating the Role of BLEU in Machine Translation Research, 2006 Chris Callison-Burch, Miles Osborne, Philipp Koehn Theoretically: From a reference translation one may Construct a string of words, which: Gets a high BLEU score Is gibberish Empirically: (next slides)

Automatic evaluation Cheap Reusable in development phase A touch of objectivity Useful tool for machine learning, e.g. reranking Does not measure MT quality, only (more or less) correlated with MT quality Favors statistical approaches, disfavors humans The numbers don t say anything across different evaluations Depends on number and type of reference translations Danger of system tuning towards BLEU on the cost of quality In particular in machine learning

Hypothesis testing You may skip sec. 8.3 Though: 8.3.1 for they who have INF5830 8.3.2, when you have 2 different systems You might evaluate first one system, then the other on the whole material and compare the results Often better: Compare item by item which system is the better and do statistics on the results

Machine Translation Evaluation 2 1. Automatic MT-evaluation: 1. BLEU 2. Alternatives 3. Evaluation evaluation 4. Criticism 2. Starting STMT 1. The noisy channel model 2. Language models (n-grams)

SMT example 25 En kokk lagde en rett med bygg. a 0.9 chef 0.6 made 0.3 a 0.9 right 0.19 with 0.4 building 0.45 cook 0.3 created 0.25 straight 0.17 by 0.3 construction 0.33 prepared 0.15 court 0.12 of 0.2 barley 0.11 constructed 0.12 dish 0.11 cooked 0.05 course 0.07 Similarly for: pos 0-2 (2x3) pos 1-3 pos 2-4 pos 3-5 (4x5) pos 6-8 Pos4 pos 6 (1x3x3 many) Pos5 pos 7 (5x3x3 many) a right with 2.7x10-12 right with building 1.7x10-18 a right of 1.5x10-10 right with construction 5.4x10-18 a right by 9.7x10-12 right with barley 8.7x10-19 a course of 1.5x10-14 course of barley 1.5x10-16

26