CS230: Lecture 10 Sequence models II

Similar documents
1 Evaluation of SMT systems: BLEU

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

statistical machine translation

Conditional Language Modeling. Chris Dyer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Variational Decoding for Statistical Machine Translation

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Learning to translate with neural networks. Michael Auli

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Machine Translation. Uwe Dick

6.864 (Fall 2006): Lecture 21. Machine Translation Part I

Lexical Ambiguity (Fall 2006): Lecture 21 Machine Translation Part I. Overview. Differing Word Orders. Example 1: book the flight reservar

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

with Local Dependencies

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

From perceptrons to word embeddings. Simon Šuster University of Groningen

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 3, 7 Sep., 2016

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Machine Learning Basics

The Noisy Channel Model and Markov Models

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Chapter 3: Basics of Language Modelling

Natural Language Processing and Recurrent Neural Networks

Log-Linear Models, MEMMs, and CRFs

Lecture 5 Neural models for NLP

Neural Architectures for Image, Language, and Speech Processing

Natural Language Processing

Deep Learning for Speech Recognition. Hung-yi Lee

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Temporal Modeling and Basic Speech Recognition

CSC321 Lecture 16: ResNets and Attention

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Naïve Bayes, Maxent and Neural Models

Topics in Natural Language Processing

CS 188: Artificial Intelligence Fall 2011

Deep Learning for Natural Language Processing

Structured Neural Networks (I)

ANLP Lecture 6 N-gram models and smoothing

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Sequence Modeling with Neural Networks

Integrating Order Information and Event Relation for Script Event Prediction

Memory-Augmented Attention Model for Scene Text Recognition

Attention Based Joint Model with Negative Sampling for New Slot Values Recognition. By: Mulan Hou

NEURAL LANGUAGE MODELS

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Chapter 3: Basics of Language Modeling

Deep Learning for NLP

Lecture 13: Structured Prediction

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

WaveNet: A Generative Model for Raw Audio

An overview of word2vec

Seq2Seq Losses (CTC)

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Midterm sample questions

Task-Oriented Dialogue System (Young, 2000)

Machine Learning for NLP

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Évalua&on)des)systèmes)de)TA)

Natural Language Processing

Machine Translation 1 CS 287

arxiv: v3 [cs.lg] 14 Jan 2018

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

CS 188: Artificial Intelligence Spring Announcements

Lecture 17: Neural Networks and Deep Learning

CS 188: Artificial Intelligence Spring Announcements

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Machine Learning for NLP

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Sequence Transduction with Recurrent Neural Networks

Machine Learning for natural language processing

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Discriminative Training

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Statistical Methods for NLP

Design and Implementation of Speech Recognition Systems

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Lecture 3: ASR: HMMs, Forward, Viterbi

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Recurrent Neural Network

Forward algorithm vs. particle filtering

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

arxiv: v1 [cs.ne] 14 Nov 2012

Linear and Logistic Regression. Dr. Xiaowei Huang

Loss Functions and Optimization. Lecture 3-1

Natural Language Processing

Transcription:

CS23: Lecture 1 Sequence models II

Today s outline We will learn how to: - Automatically score an NLP model I. BLEU score - Improve Machine II. Beam Search Translation results with Beam search III. Speech Recognition - Build a speech recognition application

BLEU score Neural Machine Translation Brian 1 W [1]....32.64 is 1 W [1]....32.64 in 1 W [1]....32.64 softmax.11.2 Brian h 6 Encoder Decoder h 1 h 1 h 2 h 2 h 3 h 3 the kitchen <eos> 1 W [1]....32.64 1 W [1]....32.64 1 W [1]....32.64 h 4 h 5 h 4 h 5 h 6 s 1 softmax.11.2 est s 2 softmax.11.2 dans s 3 softmax.11.2 la s 4 Brian 1 W [1]....32.64 est 1 W [1]....32.64 dans 1 W [1]....32.64 softmax.11.2 cuisine softmax.11.2 <eos> s 5 s 6 la 1 W [1]....32.64 cuisine 1 W [1]....32.64

BLEU score The baby to be walking by him The baby walks by himself Motivation: Human Evaluation of MT are extensive but expensive. Goal: Construct a quick, inexpensive, language independent (and correlates highly with human evaluation) method to automatically evaluate Machine Translation models. Centrale idea: The closer a machine translation is to a professional human translation, the better it is. Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22

BLEU score Needs two ingredients: - a numerical translation closeness metric - a corpus of good quality human reference translations In speech recognition: a successful metric is word error rate. BLEU s closeness metric was built after it. Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22

BLEU score Machine Translations Candidate 1 It is a guide to action which ensures that the military always obeys the commands of the party Candidate 2 It is to insure the troops forever hearing the activity guidebook that party directs. Human Translations Reference 1 Reference 2 Reference 3 It is a guide to action that ensures that the military will forever heed Party commands It is the guiding principle which guarantees the military forces always being under the command of the Party It is the practical guide for the army always to heed the directions of the party Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22

BLEU score Unigram count as precision metric: Machine Translation Candidate the the the the the the the Human Translations Reference 1 The cat is on the mat. Reference 2 There is a cat on the mat. Standard Unigram Precision = # MT words occurring in any reference HT # MT words = 1% Modified Unigram Precision = # MT words occurring in any reference HT (clipped) # MT words = 2/6 = 33.3% Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22

BLEU score Machine Translations Candidate 1 Candidate 2 It is a guide to action which ensures that the military always obeys the commands of the party It is to insure the troops forever hearing the activity guidebook that party directs. Human Translations Reference 1 Reference 2 Reference 3 It is a guide to action that ensures that the military will forever heed Party commands It is the guiding principle which guarantees the military forces always being under the command of the Party It is the practical guide for the army always to heed the directions of the party Modified Unigram Precision (1) = # MT words occurring in any reference HT (clipped) # MT words = 17/18 = 94% # MT words occurring in any reference HT (clipped) Modified Unigram Precision (2) = = 8/14 = 57% # MT words Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22

BLEU score Generalizing to modified n-gram precision metric: Modified n-gram precision = # MT n-grams occurring in any reference HT (clipped) # MT n-grams Machine Translations Candidate 1 Candidate 2 It is a guide to action which ensures that the military always obeys the commands of the party It is to insure the troops forever hearing the activity guidebook that party directs. Human Translations Reference 1 Reference 2 Reference 3 It is a guide to action that ensures that the military will forever heed Party commands It is the guiding principle which guarantees the military forces always being under the command of the Party It is the practical guide for the army always to heed the directions of the party Modified bi-gram precision (Candidate 1) = Modified bi-gram precision (Candidate 2) = 1/17 1/13 Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22

BLEU score Generalizing from a sentence precision, to a corpus precision: p n = C {Candidates} C ' {Candidates} ngram C ngram' C Count (ngram) clip Count(ngram') Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22

BLEU Score Geometric average of modified n-grams precision measures = exp( w log(p )) = p w 1 p w 2...p w N n n 1 2 N N n=1 Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22

BLEU Score Sentence length too short translation too long translation Candidate Machine Translations of the Candidate 1 Candidate 2 Machine Translations I always invariably perpetually do I always do Human Translations Human Translations Reference 1 It is a guide to action that ensures that the military will forever heed Party commands Reference 1 I always do Reference 2 It is the guiding principle which guarantees the military forces always being under the command of the Party Reference 2 I invariably do Reference 3 It is the practical guide for the army always to heed the directions of the party Reference 3 I perpetually do Conclusion: longer translations are already penalized by the modified n-gram precision measure, not the too short translations BP = Brevity penalty = decaying exponential ~ Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22 Total length of the machine s translation Total length of the candidate translation corpus

BLEU Score BLEU definition: BLEU = (p 1 w 1 p 2 w 2...p N w N ) BP where 1 BP = e 1 c r if c > r if c <= r log(bleu) definition: N log(bleu) = min(1 r c,) + w n log(p n ) n=1 Kishore Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation, 22

Beam search Questions What is the main advantage of Beam search compared to other search algorithms? It is fast, and requires less computations. What is the main disadvantage of Beam search compared to other search algorithms? It may not result in the optimal solution in terms of probability. What is the time and memory complexity of Beam search? It is O(b*Tx) in memory and O(b*Tx) in time.

Speech Recognition Pipeline Audio Data: model? X? Y? frequency Hello time

Speech Recognition cross-entropy H E L L O probability distribution shape = (28,1).31.2.3.1.1.2.8.23.11.2 softmax softmax softmax softmax softmax h 1 h 2 h 3 h 4 h 5 This never happens in practice because: input length output length Spectrogram.32.64.21.43.33.15.12.14.22.3 Raw Audio

Speech Recognition https://distill.pub/217/ctc/ β(..) ŷ = "HELLO" H H E E E L L _ L O O probability distribution shape = (28,1).31.2.3.1.1.2.8.23.11.2.3.1.1.2.8.23.11.2.8.23.11.2 softmax softmax softmax softmax softmax h 1 h 2 h 3 h 4 h 5 softmax softmax softmax softmax h 6 h 7 h 8 h 9 softmax softmax h 1 h 11 Spectrogram.32.64.21.43.33.15.12.14.22.3.21.43.33.15.12.14.22.3.12.14.22.3 Raw Audio Graves et al., 26, Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Speech Recognition Examples β(hh _ EEEE _ LL _ LOO) = "HELLO" β(h _ E _ L _ LOO) = "HELLO" β(h LLL _OO) = "HELO" β(bbaa _ NA _ NN AA _ A) = "BANANAA"

Speech Recognition Independence assumption P(c 1 x) = P(c 1 t T x t=1 x).13 HH _ E L _ LL _OO P(c x) =.4 H _ EE _ L _ LL _ HO.3 HH _ E L _ LL _OO.1 H _ EE L _ LL _OO.1 H _ I _ ILL _ L _OOOO P(y x) = P(c x) c:β (c)=y P("HELLO") =.13+.3+.1+...

Speech Recognition Loss? β(..) ŷ = "HELLO" L = log P(ŷ (i) x (i) ) = log P(c x (i) ) i i c:β (c)=ŷ (i ) H H E E E L L - L O O probability distribution shape = (28,1).31.2.3.1.1.2.8.23.11.2.3.1.1.2.8.23.11.2.8.23.11.2 softmax softmax softmax softmax softmax h 1 h 2 h 3 h 4 h 5 softmax softmax softmax softmax h 6 h 7 h 8 h 9 softmax softmax h 1 h 11 Spectrogram.32.64.21.43.33.15.12.14.22.3.21.43.33.15.12.14.22.3.12.14.22.3 Raw Audio Graves et al., 26, Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Speech Recognition Inference? BEAM SEARCH > MAX DECODING :) Graves et al., 26, Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Speech Recognition L = log P(ŷ (i) x (i) ) = log P(c x (i) ) i i c:β (c)=ŷ (i ) Implementations of CTC loss tf.nn.ctc_loss( ) Keras -> Custom loss Graves et al., 26, Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

Speech Recognition How to incorporate information about the future? Closing questions: A good way to efficiently incorporate future information in speech recognition is still an open problem. What s the consequence of (conditional independence)? P(c 1 x) = T x t=1 P(c 1 t x) A model like CTC may have trouble producing such diverse transcripts for the same utterance because of conditional independence assumptions between frames. But, on the other hand, it makes the model more robust to a change of settings. What is the problem with our output? ŷ CTC model makes a lot of spelling and linguistic mistakes because P(y\x) directly models audio data. Some words are hard to spell based on their audios. Can you think of any practical applications leveraging this model? Lipreading. Assael et al., LipNet: end-to-end sentence-level lipreading, 216 Hannun, "Sequence Modeling with CTC", Distill, 217. Chan et al.,215, Listen, Attend and Spell