Learning to translate with neural networks. Michael Auli

Similar documents
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Lecture 5 Neural models for NLP

Conditional Language Modeling. Chris Dyer

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

The Geometry of Statistical Machine Translation

Multi-Source Neural Translation

Multi-Source Neural Translation

Tuning as Linear Regression

Variational Decoding for Statistical Machine Translation

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Phrase-Based Statistical Machine Translation with Pivot Languages

CSC321 Lecture 15: Recurrent Neural Networks

arxiv: v1 [cs.cl] 22 Jun 2015

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Neural Networks Language Models

Discriminative Training

Natural Language Processing

Improved Learning through Augmenting the Loss

Minimum Error Rate Training Semiring

Lecture 15: Recurrent Neural Nets

arxiv: v2 [cs.cl] 1 Jan 2019

Today s Lecture. Dropout

Natural Language Processing

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

From perceptrons to word embeddings. Simon Šuster University of Groningen

Lecture 13: Structured Prediction

Deep Learning For Mathematical Functions

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Deep Learning Recurrent Networks 2/28/2018

CS230: Lecture 10 Sequence models II

A Discriminative Model for Semantics-to-String Translation

NEURAL LANGUAGE MODELS

Marrying Dynamic Programming with Recurrent Neural Networks

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Deep learning for Natural Language Processing and Machine Translation

Improving Sequence-to-Sequence Constituency Parsing

Deep Learning for Natural Language Processing

An overview of word2vec

Deep Learning for NLP

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Recurrent Neural Networks. Jian Tang

Deep Learning. Language Models and Word Embeddings. Christof Monz

Chapter 3: Basics of Language Modelling

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

DISTRIBUTIONAL SEMANTICS

ANLP Lecture 6 N-gram models and smoothing

with Local Dependencies

Lecture 6: Neural Networks for Representing Word Meaning

statistical machine translation

The Noisy Channel Model and Markov Models

Lab 12: Structured Prediction

Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Discriminative Training. March 4, 2014

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Naïve Bayes, Maxent and Neural Models

Language Model. Introduction to N-grams

Task-Oriented Dialogue System (Young, 2000)

Bayesian Paragraph Vectors

Structured Prediction Theory and Algorithms

Decoding and Inference with Syntactic Translation Models

Conditional Language modeling with attention

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Analysis of techniques for coarse-to-fine decoding in neural machine translation

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

arxiv: v1 [cs.cl] 21 May 2017

A Support Vector Method for Multivariate Performance Measures

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Language Model Rest Costs and Space-Efficient Storage

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Consensus Training for Consensus Decoding in Machine Translation

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Introduction to Deep Neural Networks

word2vec Parameter Learning Explained

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Topics in Natural Language Processing

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Natural Language Processing (CSEP 517): Machine Translation

CSC321 Lecture 16: ResNets and Attention

Multiword Expression Identification with Tree Substitution Grammars

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Slide credit from Hung-Yi Lee & Richard Socher

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Transcription:

Learning to translate with neural networks Michael Auli 1

Neural networks for text processing Similar words near each other France Spain dog cat

Neural networks for text processing Similar words near each other Changing model parameters for one example effects similar words in similar contexts dog cat France Spain

Neural networks for text processing Similar words near each other Changing model parameters for one example effects similar words in similar contexts Traditional discrete models treat each word separately dog cat France Spain

Neural networks Error 26% 15% (Krizhevsky 2012) Error 27% 18 % (Hinton 2012) Language Modeling PPLX 141 101 (Mikolov 2011)

Neural networks Error 26% 15% (Krizhevsky 2012) Error 27% 18 % (Hinton 2012) Language Modeling Machine Translation PPLX 141 101 (Mikolov 2011) This talk Le (2012), Kalchbrenner (2013), Devlin (2014), Sutskever (2014), Cho (2014),

What happened in MT over the past 10 years?

What happened in MT over the past 10 years? Learning simple models from large bi-texts is a solved problem!!!!!!!!! (Lopez & Post, 2013)

What happened in MT over the past 10 years? Learning simple models from large bi-texts is a solved problem!!!!!!!!! (Lopez & Post, 2013) WMT 2013 9/10 times

Machine translation development and progress of the region.

Machine translation Translation modeling development and progress of the region.

Machine translation Translation modeling Language modeling development and progress of the region.

Machine translation Translation modeling Language modeling development and progress of the region. Optimization

Machine translation Translation modeling Language modeling development and progress of the region. Optimization Reordering

Machine translation Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.

Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.

Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.

Discrete phrase-based translation Koehn et al. (2003) development and progress of the region. of the region development and progress

Discrete phrase-based translation Koehn et al. (2003) 70% 47% development and progress of the region. 23% Train Test of the region development and progress 0% 1 2 3 4 5 6 7 Phrase Length

Discrete phrase-based translation Koehn et al. (2003) 70% 47% development and progress of the region. 23% Train Test of the region development and progress 0% 1 2 3 4 5 6 7 Phrase Length

Discrete phrase-based translation Koehn et al. (2003) 70% Average in train: 2.3 words Average in test: 1.5 words 47% development and progress of the region. 23% Train Test of the region development and progress 0% 1 2 3 4 5 6 7 Phrase Length

Kneser & Ney (1996) Discrete n-gram language modeling p(progress in the region) =

Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = Train data: development and progress of the region. in

Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = p(progress) p(in) p(the) p(region the) Train data: development and progress of the region. in 40% 30% 20% 10% 0% 1 2 3 4 5 n-gram order Does not include out-of-vocabulary tokens

Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = p(progress) p(in) p(the) p(region the) Train data: development and progress of the region. in 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order Does not include out-of-vocabulary tokens

How can we improve this? Or: how to capture relationships beyond 1.5 to 2.7 words? Neural nets: distributional representations make it easier to capture relationships Recurrent nets: easy to model variable-length sequences dog cat France Spain

This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

e.g. bigram LM Feed-forward network

Feed-forward network e.g. bigram LM 0 0 0 1 0 and

e.g. bigram LM Feed-forward network U h t = (Ux t ) 1 (z) = 1+exp{ z} 0 0 0 1 0 and

e.g. bigram LM Feed-forward network V U y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 h t = (Ux t ) 1 (z) = 1+exp{ z} 0 0 0 1 0 and

e.g. bigram LM Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 U h t = (Ux t ) 1 (z) = 1+exp{ z} 0 0 0 1 0 and

e.g. bigram LM Classification Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 Feature learning U 0 0 0 1 0 and h t = (Ux t ) 1 (z) = 1+exp{ z}

e.g. bigram LM Classification Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 Still based on limited context! Feature learning U 0 0 0 1 0 and h t = (Ux t ) 1 (z) = 1+exp{ z}

Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t ) U 0 0 0 1 0 and

Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t ) Dependence on previous time step W U 0 0 0 1 0 and

Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t + W h t 1 ) V W U U 0 0 0 1 0 and 0 1 0 0 0

Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t + W h t 1 ) V W U W U 0 1 0 0 0 0 0 0 1 0 and

<s> Recurrent network

Recurrent network <s> p(development <s>)

Recurrent network <s> development

Recurrent network <s> development

Recurrent network <s> development p(and development,<s>)

Recurrent network <s> development and

Recurrent network <s> development and

Recurrent network <s> development and p(progress development,and, <s>)

Recurrent network <s> development and progress

Recurrent network History of inputs up to current time-step <s> development and progress

Recurrent network History of inputs up to current time-step <s> development and progress State of the art in language modeling (Mikolov 2011) More accurate than feed-forward nets (Sundermeyer 2013)

Combined language and translation model Auli et al., EMNLP 2013

Combined language and translation model Auli et al., EMNLP 2013

Combined language and translation model + + + + + Auli et al., EMNLP 2013

Combined language and translation model s Auli et al., EMNLP 2013 Entire source sentence representation

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s>

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and progress

Combined language and translation model h t = (Ux t + Wh t 1 + Fs)

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s>

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) Source word-window <s> development

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and progress

Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and progress Le (2012), Kalchbrenner (2013), Devlin (2014), Cho (2014), Sutskever (2014)

Experimental setup WMT 2012 French-English translation task Data: 100M words Baseline: Phrase-based model similar to Moses Rescoring Mini-batch gradient descent Class-structured output layer (Goodman, 1996) 15

Does the neural model learn to translate? -Discrete translation model WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring

Does the neural model learn to translate? 24.0 23.5 -Discrete translation model BLEU 23.0 22.5 22.0 Phrase-based RNNLM RNNTM full-sent RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring

Does the neural model learn to translate? 24.0 -Discrete translation model +1.5 23.5 +0.8 BLEU 23.0 22.5 22.0 Phrase-based RNNLM RNNTM full-sent RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring

Improving a phrase-based baseline 27.0 26.5 +Discrete translation model BLEU 26.0 25.5 25.0 24.5 Phrase-based RNNLM RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, lattice rescoring

Improving a phrase-based baseline +Discrete translation model 27.0 +1.4 26.5 BLEU 26.0 25.5 25.0 24.5 Phrase-based RNNLM RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, lattice rescoring

Combining recurrent nets with discrete models Auli & Gao, ACL 2014

Combining recurrent nets with discrete models 27.0 26.3 Auli & Gao, ACL 2014 BLEU 25.5 24.8 24.0 Phrase-based n-best rescore lattice rescore decoder integration

Combining recurrent nets with discrete models Auli & Gao, ACL 2014 27.0 26.3 26.4 +2.0 26.9 BLEU 25.5 25.7 24.8 24.9 24.0 Phrase-based n-best rescore lattice rescore decoder integration

Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order

Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 Neural models more robust when discrete models rely on sparse estimates? n-gram order

Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order h 2 (f,e) =log p rnn (e)

Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) log p lm (e) = 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order h 2 (f,e) =log p rnn (e) Split each model into five features, one for each n-gram order s.t. 5X h i (f,e) log p rnn (e) = i=1 X10 i=6 h i (f,e)

Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) log p lm (e) = 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order h 2 (f,e) =log p rnn (e) Split each model into five features, one for each n-gram order s.t. 5X h i (f,e) log p rnn (e) = Standard optimizer (MERT) to find weights for each n-gram order i=1 X10 i=6 h i (f,e)

Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order Optimzer: MERT

Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order 0.05 Feature weight 0.04 0.03 0.01 0 1 2 3 4 5 n-gram order Optimzer: MERT

more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order 0.05 Feature weight 0.04 0.03 0.01 0 less important 1 2 3 4 5 n-gram order Optimzer: MERT

more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order 0.05 Feature weight 0.04 0.03 0.01 n-gram RNN 0 less important 1 2 3 4 5 n-gram order Optimzer: MERT

more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order 0.05 Feature weight 0.04 0.03 0.01 n-gram RNN 0 less important 1 2 3 4 5 n-gram order Optimzer: MERT

more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order 0.05 Feature weight 0.04 0.03 0.01 n-gram RNN 0 less important 1 2 3 4 5 n-gram model relies on unigram counts n-gram order Optimzer: MERT

more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words 1 2 3 4 5 n-gram order 0.05 Feature weight 0.04 0.03 0.01 n-gram RNN 0 less important 1 2 3 4 5 n-gram model relies on unigram counts n-gram order Optimzer: MERT n-gram model has higher-order statistics

Recurrent Minimum Translation Unit Models development and progress of the region

Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region

Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region M1 M2 M3 the region of

Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region M1 M2 M3 the region of n-gram models over MTUs:! p(m1) p(m2 M1) p(m3 M1,M2) Banchs et al. (2005), Quirk & Menezes (2006)

Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s>

Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development

Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development

Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and

Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and

Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and progress

Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 Singletons <s> development and progress

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and progress

Recurrent Minimum Translation Unit Models

Recurrent Minimum Translation Unit Models 27.0 26.4 BLEU 25.8 25.1 24.5 Phrase-based n-gram MTU RNN MTU Koehn et al. (2003) Banchs et al. (2005), Quirk & Menezes (2006)

Recurrent Minimum Translation Unit Models 27.0 +1.5 26.4 BLEU 25.8 25.1 24.5 Phrase-based n-gram MTU RNN MTU Koehn et al. (2003) Banchs et al. (2005), Quirk & Menezes (2006)

This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

Back propagation with cross entropy error Model distribution (y) 0.2 0.1 0.4 0.2 0.1 delayed 0 1 0 0 0

Back propagation with cross entropy error True distribution (t) 0 0 1 0 0 Model distribution (y) 0.2 0.1 0.4 0.2 0.1 delayed 0 1 0 0 0

Back propagation with cross entropy error True distribution (t) 0 0 1 0 0 Model distribution (y) 0.2 0.1 0.4 0.2 0.1 J = X i t i log y i delayed Goal: Make correct outputs most likely 0 1 0 0 0

Back propagation with cross entropy error True distribution (t) 0 0 1 0 0 Error vector (ẟ) 0.2 0.1-0.6 0.2 0.1 i = y i t i Model distribution (y) 0.2 0.1 0.4 0.2 0.1 J = X i t i log y i delayed Goal: Make correct outputs most likely 0 1 0 0 0

Back propagation with cross entropy error True distribution (t) 0 0 1 0 0 Error vector (ẟ) 0.2 0.1-0.6 0.2 0.1 i = y i t i Model distribution (y) 0.2 0.1 0.4 0.2 0.1 J = X i t i log y i delayed Goal: Make correct outputs most likely 0 1 0 0 0

Back propagation with cross entropy error True distribution (t) 0 0 1 0 0 Error vector (ẟ) 0.2 0.1-0.6 0.2 0.1 i = y i t i Model distribution (y) 0.2 0.1 0.4 0.2 0.1 J = X i t i log y i delayed Goal: Make correct outputs most likely 0 1 0 0 0

Back propagation through time <s> development and

Back propagation through time <s> development and

Back propagation through time <s> development and

Back propagation through time <s> development and

Optimization Likelihood training very common Auli & Gao, ACL 2014 Optimizing for evaluation metrics difficult, but empirically successful (Och 2003, Smith 2006, Chiang 2009, Gimpel 2010, Hopkins 2011)

Optimization Likelihood training very common Auli & Gao, ACL 2014 Optimizing for evaluation metrics difficult, but empirically successful (Och 2003, Smith 2006, Chiang 2009, Gimpel 2010, Hopkins 2011) Next: Task-specific training of neural nets for translation

BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002)

BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty

BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

Expected BLEU Training (Smith 2006, He 2012, Gao 2014) L: max p(ẽ f; )

Expected BLEU Training (Smith 2006, He 2012, Gao 2014) Desired translation output L: max p(ẽ f; )

Expected BLEU Training (Smith 2006, He 2012, Gao 2014) Desired translation output L: max p(ẽ f; ) xbleu: max X e2e(f) sbleu(e, ẽ)p(e f; ) Generated outputs e.g. n-best, lattice Gain function Model probability

Expected BLEU Training Human:! development and progress of the region

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu 0.8 0.5 0.3

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu 0.8 0.5 0.3 p t (e f; ) 0.2 0.3 0.5

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu 0.8 0.5 0.3 p t (e f; ) 0.2 0.3 0.5

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu 0.8 0.5 0.3 p t (e f; ) 0.2 0.3 0.5

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu 0.8 0.5 0.3 p t (e f; ) 0.2 0.3 0.5 xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu 0.8 0.5 0.3 p t (e f; ) 0.2 0.3 0.5 t 0.3 0-0.2 xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu 0.8 0.5 0.3 p t (e f; ) 0.2 0.3 0.5 t 0.3 0-0.2 p t+1 (e f; ) 0.5 0.3 0.2 xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu 0.8 0.5 0.3 p t (e f; ) 0.2 0.3 0.5 t 0.3 0-0.2 p t+1 (e f; ) 0.5 0.3 0.2 xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu 0.8 0.5 0.3 p t (e f; ) 0.2 0.3 0.5 t 0.3 0-0.2 p t+1 (e f; ) 0.5 0.3 0.2 xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

Results 28.0 27.3 BLEU 26.5 25.8 25.0 Phrase-based RNN Cross Entropy RNN +xbleu Koehn et al. (2003)

Results 28.0 27.3 +1.4 BLEU 26.5 25.8 +0.8 25.0 Phrase-based RNN Cross Entropy RNN +xbleu Koehn et al. (2003)

Scaling linear reordering models xbleu training of millions of linear features Auli et al., EMNLP 2014

Scaling linear reordering models xbleu training of millions of linear features Auli et al., EMNLP 2014 28.0 26.6% 27.1 +2.0 26.4% +0.7 BLEU 26.3 26.2% 26.0% news2011% 25.4 25.8% 24.5 Baseline 4.5K 9K 900K 3M Features 25.6% 2K% 4K% 8K% 16K% 32K% 64K% 136K%272K% Training set size

My other neural network projects Social media response generation with RNNs building neural net-based conversational agents based on twitter conversations Semi-supervised phrase table expansion with word embeddings using distributional word and phrase representations and by mapping between distributional source and target spaces with RBVs CCG parsing & tagging with RNNs

Semantic CCG Parsing Zettlemoyer (2005, 2007), Bos (2008), Kwiatkowski (2010, 2013) Krishnamurthy (2012), Lewis (2013a,b) Marcel proved completeness NP : marcel (S\NP)/NP : λy.λx.proved(x,y) NP : completeness S\NP : λx.proved(x, completeness) S : proved(marcel, completeness) Combinatory Categorial Grammar (CCG; Steedman 2000)

How is this useful? Marcel proved completeness NP : marcel (S\NP)/NP : NP : λy.λx.proved(x,y) completeness S\NP : λx.proved(x, completeness) Answer S : proved(marcel, completeness) User query (Semantic) parsing Knowledge Base

How is this useful? Marcel proved completeness Parse failures and lexical ambiguity are a NP : (S\NP)/NP : NP : marcel λy.λx.proved(x,y) completeness major source of errors in semantic parsing S\NP : λx.proved(x, completeness) (Kwiatkowski 2013) S : proved(marcel, completeness) Answer User query (Semantic) parsing Knowledge Base

Integrated Parsing & Tagging with belief propagation, dual decomposition and softmax-margin training Auli & Lopez ACL 2011a,b Auli & Lopez EMNLP 2011 parsing factor p i,j = 1 Z f i,je i,j b i,j o i,j inside-outside supertagging factor forward-backward Marcel proved completeness

Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger). 87.5 Auli & Lopez EMNLP 2011 best CCG parsing results to date 87.0 86.5 85.9 85.4 C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger). 87.5 Auli & Lopez EMNLP 2011 best CCG parsing results to date 87.0 86.5 85.9 85.4 85.7 C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger). 87.5 Auli & Lopez EMNLP 2011 best CCG parsing results to date 87.0 86.5 85.9 85.4 85.7 86.0 C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger). 87.5 Auli & Lopez EMNLP 2011 best CCG parsing results to date 87.0 86.5 85.9 85.4 85.7 86.0 86.1 C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger). 87.5 Auli & Lopez EMNLP 2011 best CCG parsing results to date 87.0 86.5 86.8 85.9 85.4 85.7 86.0 86.1 C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger). 87.5 Auli & Lopez EMNLP 2011 best CCG parsing results to date +1.5 87.0 86.5 86.8 87.2 85.9 85.4 85.7 86.0 86.1 C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

Recurrent nets for CCG supertagging & parsing... with Wenduan Xu Marcel NP proved S\NP/NP

Recurrent nets for CCG supertagging & parsing... with Wenduan Xu Marcel NP proved S\NP/NP tagging parsing CRF (Clark & Curran 07) 91.5 85.3 FFN (Lewis & Steedman, 14) 91.5 86.0 RNN 92.3 86.5

Two RNN translation models Summary Neural nets help most when discrete models sparse Task-specific objective gives best performance Next: Better modeling of source-side, e.g., bi-directional RNNs, different architectures