Factored Neural Machine Translation Architectures

Similar documents
Edinburgh Research Explorer

arxiv: v1 [cs.cl] 21 May 2017

Applications of Deep Learning

Deep learning for Natural Language Processing and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Neural Hidden Markov Model for Machine Translation

Word Attention for Sequence to Sequence Text Understanding

Coverage Embedding Models for Neural Machine Translation

Multi-Source Neural Translation

Multi-Source Neural Translation

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Improved Learning through Augmenting the Loss

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Better Conditional Language Modeling. Chris Dyer

CSC321 Lecture 15: Recurrent Neural Networks

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CSC321 Lecture 10 Training RNNs

Self-Attention with Relative Position Representations

Improving Lexical Choice in Neural Machine Translation. Toan Q. Nguyen & David Chiang

CKY-based Convolutional Attention for Neural Machine Translation

Neural Networks Language Models

Today s Lecture. Dropout

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

arxiv: v2 [cs.cl] 22 Mar 2016

Learning to translate with neural networks. Michael Auli

Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation

arxiv: v1 [cs.cl] 22 Jun 2017

Analysis of techniques for coarse-to-fine decoding in neural machine translation

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

arxiv: v1 [cs.cl] 22 Jun 2015

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Slide credit from Hung-Yi Lee & Richard Socher

Key-value Attention Mechanism for Neural Machine Translation

NEURAL LANGUAGE MODELS

Introduction to RNNs!

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

An empirical study on the effectiveness of images in Multimodal Neural Machine Translation

IMPROVING STOCHASTIC GRADIENT DESCENT

Character-level Language Modeling with Gated Hierarchical Recurrent Neural Networks

Deep Neural Machine Translation with Linear Associative Unit

Modelling Latent Variation in Translation Data

Conditional Language modeling with attention

From perceptrons to word embeddings. Simon Šuster University of Groningen

arxiv: v5 [cs.ai] 17 Mar 2018

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Based on the original slides of Hung-yi Lee

Random Coattention Forest for Question Answering

CSC321 Lecture 16: ResNets and Attention

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

with Local Dependencies

EE-559 Deep learning LSTM and GRU

Implicitly-Defined Neural Networks for Sequence Labeling

arxiv: v2 [cs.cl] 1 Jan 2019

How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures

arxiv: v3 [cs.cl] 7 Oct 2014

arxiv: v2 [cs.cl] 20 Jun 2017

Stochastic Dropout: Activation-level Dropout to Learn Better Neural Language Models

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Sequence Modeling with Neural Networks

CSC321 Lecture 7 Neural language models

A Recurrent Neural Networks Approach for Estimating the Quality of Machine Translation Output

Improving Sequence-to-Sequence Constituency Parsing

CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation

Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

Stephen Scott.

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Lecture 6: Neural Networks for Representing Word Meaning

WEIGHTED TRANSFORMER NETWORK FOR MACHINE TRANSLATION

Phrase-Based Statistical Machine Translation with Pivot Languages

Neural Network Language Modeling

Computing Lattice BLEU Oracle Scores for Machine Translation

Translator

Long-Short Term Memory and Other Gated RNNs

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing with Deep Learning CS224N/Ling284

IBM Model 1 for Machine Translation

Applied Natural Language Processing

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

DenseRAN for Offline Handwritten Chinese Character Recognition

A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Les systèmes autonomes sont des outils informatiques comme les autres

Statistical NLP Spring Corpus-Based MT

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Conditional Language Modeling. Chris Dyer

arxiv: v5 [stat.ml] 19 Jun 2017

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition

Minimum Error Rate Training Semiring

Exploiting Contextual Information via Dynamic Memory Network for Event Detection

Recurrent Neural Networks. Jian Tang

Natural Language Processing (CSE 517): Machine Translation

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Neural Architectures for Image, Language, and Speech Processing

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Transcription:

Factored Neural Machine Translation Architectures Mercedes García-Martínez, Loïc Barrault and Fethi Bougares LIUM - University of Le Mans IWSLT 2016 December 8-9, 2016 1 / 19

Neural Machine Translation Sequence to sequence implemented in an encoder-decoder RNN Problems: Softmax normalization has a high computational cost Dealing with unknown words 2 / 19

Neural Machine Translation / Solutions Short-list: most frequent words are used and the rest are mapped to unknown simple / model only few words, does generate many UNK Structured output layer LM / SOUL (short-list + subclasses) [Le,2011] manage more vocabulary / more complex architecture Select the batches so that the softmax can be applied to only a subset of the output layer [Jean,2015] architecture remains the same / mismatch between train and test modes Subword units extracted by BPE [Sennrich,2015] or characters [Chung,2016] simple, no other resources required, unseen words can be generated no UNK / less control on the output, incorrect words can be generated BPE is the main method used in recent evaluation campaigns (WMT, IWSLT) 3 / 19

Motivation None of the methods presented before includes linguistics to solve this problem We propose to use a factored word representation Based on the way humans learn how to construct inflected word forms Handling larger effective vocabulary using same output size Generation of words that did not appear in the vocabulary Ex. EN: be { am, is, are, was, were, been, being } Ex. FR: être { suis, es, est, sommes, êtes, sont, serai,..., étais,.., été,...} Related factors works: Factored NLM, in addition to words [Alexandrescu,2006; Wu,2012; Niehues,2016] Additional information in the input side in NMT [Sennrich, 2016] Our method uses factors (not words) at the output side of the NN 4 / 19

Overview 1 Factored NMT architectures 2 Experiments on TED talks IWSLT 15 3 Qualitative analysis 4 Conclusions and future work 5 / 19

Factored NMT approach Standard NMT: Factored NMT: Morphological analyser: word lemma factors (POS+tense+person+gender+number) devient devenir verb + present + 3rd person + # + singular Generate surface form using lemma and factors Same output size for lemma + small output for factors larger vocabulary 6 / 19

Base NMT model Il y a bien longtemps dans une galaxie lointaine, très lointaine 5 4 3 u j p j z j DECODER 2 1 W a U a tanh v a h i Attention mechanism lin eij 1,6 2,6 + 6,6 7,6 Softmax ij = X eij eij 11,6 i ENCODER ( A long time ago in a galaxy far, far away ) NMT by jointly learning to align and translate [Bahdanau et al.,2014] Conditional GRU 2015 DL4MT Winter School Code: https://github.com/nyu-dl/dl4mt-tutorial 7 / 19

Our Factored NMT model Base NMT decoder extended to get 2 outputs: u j lemma factors lemma factors lemma factors lemma factors p j z j fb fb fb fb 2 symbols generated synchronously: (1) lemma and (2) factors Factors sequence length = lemma sequence length For comparison: multiway multilingual setup from [Firat,2016] 8 / 19

Experiments EN-FR, IWSLT 15 corpus (data selection and filter out long sentences): Sentences 2M EN unique words 147K FR unique words 266K Word input size 30K Lemma/word output size 30K Factors output size 142 Neural network settings: } FNMT word vocabulary 172K RNN dim 1000 Embedding dim 620 Minibatch size 80 Gradient clipping 1 Weight initialization Xavier [Glorot,2010] Learning rate scheme Adadelta Beam size 12 9 / 19

First Results %BLEU Model Feedback %MET. word lemma factors #UNK FNMT Lemma 56.96 34.56 37.44 42.44 798 NMT Words 55.87 34.69 35.10 40.79 1841 BPE Subwords 55.52 34.34 34.64 40.25 0 Multilingual Lemma/Factors 52.48 28.70 37.72 45.81 871 Chain NMT Lemma/Factors 56.00 33.82 37.38 90.54 773 FNMT reduces #UNK compared to NMT 10 / 19

First Results %BLEU Model Feedback %MET. word lemma factors #UNK FNMT Lemma 56.96 34.56 37.44 42.44 798 NMT Words 55.87 34.69 35.10 40.79 1841 BPE Subwords 55.52 34.34 34.64 40.25 0 Multilingual Lemma/Factors 52.48 28.70 37.72 45.81 871 Chain NMT Lemma/Factors 56.00 33.82 37.38 90.54 773 FNMT reduces #UNK compared to NMT BPE does not generate UNK, but no impact on BLEU FNMT can apply UNK replacement to improve results 10 / 19

First Results %BLEU Model Feedback %MET. word lemma factors #UNK FNMT Lemma 56.96 34.56 37.44 42.44 798 NMT Words 55.87 34.69 35.10 40.79 1841 BPE Subwords 55.52 34.34 34.64 40.25 0 Multilingual Lemma/Factors 52.48 28.70 37.72 45.81 871 Chain NMT Lemma/Factors 56.00 33.82 37.38 90.54 773 FNMT reduces #UNK compared to NMT BPE does not generate UNK, but no impact on BLEU FNMT can apply UNK replacement to improve results Multilingual obtains higher BLEU in lemma and factors but lower in word 10 / 19

First Results %BLEU Model Feedback %MET. word lemma factors #UNK FNMT Lemma 56.96 34.56 37.44 42.44 798 NMT Words 55.87 34.69 35.10 40.79 1841 BPE Subwords 55.52 34.34 34.64 40.25 0 Multilingual Lemma/Factors 52.48 28.70 37.72 45.81 871 Chain NMT Lemma/Factors 56.00 33.82 37.38 90.54 773 FNMT reduces #UNK compared to NMT BPE does not generate UNK, but no impact on BLEU FNMT can apply UNK replacement to improve results Multilingual obtains higher BLEU in lemma and factors but lower in word Chain model: Low results can be explained by distinct training of two systems 10 / 19

First Results %BLEU Model Feedback %MET. word lemma factors #UNK FNMT Lemma 56.96 34.56 37.44 42.44 798 NMT Words 55.87 34.69 35.10 40.79 1841 BPE Subwords 55.52 34.34 34.64 40.25 0 Multilingual Lemma/Factors 52.48 28.70 37.72 45.81 871 Chain NMT Lemma/Factors 56.00 33.82 37.38 90.54 773 FNMT reduces #UNK compared to NMT BPE does not generate UNK, but no impact on BLEU FNMT can apply UNK replacement to improve results Multilingual obtains higher BLEU in lemma and factors but lower in word Chain model: Low results can be explained by distinct training of two systems Could expect a better Factor level BLEU? only 142 tokens 10 / 19

Changing the feedback u j p j z j lemma fb factors %BLEU Model Feedback word lemma factors NMT Words 34.69 35.10 40.79 FNMT Lemma 34.56 37.44 42.44 FNMT Factors 31.49 34.05 44.73 FNMT Sum 34.34 37.03 44.16 FNMT Concatenation 34.58 37.32 44.33 Lemma embedding contains more information from the generated target word No big differences in word level BLEU Concatenation is good at lemma and factors level BUT no impact on word level BLEU 11 / 19

Dependency model Lemma dependency: u j lemma factors Factors dependency: u j lemma factors p j p j z j z j %BLEU Model Depend. word lemma factors NMT - 34.69 35.10 40.79 FNMT - 34.56 37.44 42.44 FNMT prev. lem. 34.34 37.39 42.33 FNMT curr. lem. 34.62 37.30 43.36 FNMT prev. fact. 34.72 37.56 43.09 12 / 19

Results using different sentence length %BLEU 38 36 34 32 30 28 26 24 22 20 18 Comparison of the systems according to the maximum length NMT FNMT 10 20 30 40 50 60 70 80 90 100 Inverval of source sentence length FNMT helps when translating sentences longer than 80 words Might be due to less sparsity on the lemma and factors space 13 / 19

Qualitative analysis. FNMT vs. NMT 1 2 3 Src... set of adaptive choices that our lineage made... (len=90) Ref... de choix adaptés établis par notre lignée... NMT... de choix UNK que notre UNK a fait... FNMT... de choix adaptatifs que notre lignée a fait... Src... enzymes that repair them and put them together (len=23) Ref... enzymes qui les réparent et les assemblent. NMT... enzymes qui les UNK et les UNK. FNMT... enzymes qui les réparent et les mettent ensemble. Src... santa marta in north colombia (len=26) Ref... santa marta au nord de la colombie. NMT... santa UNK dans le nord de la colombie. FNMT... santa marta dans le nord de la colombie. FNMT generates less UNK (adaptés and lignée are in the NMT target voc.) %BLEU penalizes some correct translations that are not the same as reference réparent and marta are not included in NMT vocabulary 14 / 19

Qualitative analysis. FNMT vs. FNMT with current lemma dependency W Src no one knows what the hell we do W Ref personne ne sait ce que nous faisons. W FNMT personne ne sait ce qu être l enfer. L personne ne savoir ce qu être l enfer. F pro-s advn v-p-3-s prep prorel cln-3-s det nc-m-s poncts W FNMT personne ne sait ce que nous faisons. L dep. personne ne savoir ce que nous faire. F nc-f-s advn v-p-3-s det prorel cln-1-p v-p-1-p poncts Dependency improves factors prediction 15 / 19

Qualitative analysis. FNMT vs. BPE Subwords (BPE) versus factors: Src we in medicine, I think, are baffled Ref Je pense que en médecine nous sommes dépassés BPE w nous, en médecine, je pense, sommes bafés BPE nous, en médecine, je pense, sommes b+af+és FNMT w nous, en médecine, je pense, sont déconcertés FNMT l lui, en médecine, je penser, être déconcerter FNMT f pro-1-p-l pct-l prep-l nc-f-s-l pct-l cln-1-s-l v-p-1-s-l pct-l v-p-3-p-l vppart-k-m-p-l BPE translates baffled to bafés that does not exist in French FNMT translates baffled to déconcertés 16 / 19

Conclusions Factored NMT based on linguistic a priori knowledge Cons. require up-to-date linguistic resources (difficult in low resources languages) slight increase of model complexity (architecture with 2 outputs) Pros handles very large vocabulary (6 times bigger in our experiments) generates new words not included in the shortlist performs better than other state of the art systems reduces the generation of #UNK tokens produces only correct words compared to subwords 17 / 19

Future work Include factors in input side Decrease the number of UNK in the source sequence Motivated by [Sennrich et al,2016] Extend for N factors Actually, the factors vocabulary is determined by the training set Increase the generalisation power of the system to unseen word forms Apply to highly inflected languages e.g. Arabic language 18 / 19

Thanks for your attention! Le Mans, France 19 / 19

FNMT model (N, 1, 1000) RNN Forward x 1 x 2 x 3... x N (N, 1, 2000) Phrase Context Vectors Lemmas Target Embeddings Factors Target Embeddings + Predicted Target Lemma Predicted Target Factors *L E Softmax Softmax *L OF *L OL Φ (Σ) c 1 c N x N... x 3 x 2 x 1 GRU GRU *L S *LC (N, 1, 1000) RNN Backward *W att Φ (Σ) *Uatt *Wc att NMT Attention Conditional GRU 1 / 4

Handling beam search with 2 outputs Timestep generation: Sum both costs to get 1-best translation 1 lemma - 1 factors: Limit length of the factors to the lemmas length Cross product of the nbest of the 2 outputs of each produced word Limit the options to the beam size 2 / 4

Feedback equations GRU 1 (y j 1, s j 1 ) = (1 z j ) s j + z j s j 1, s j = tanh (Wfb(y j 1 ) + r j (Us j 1 )), r j = σ (W r fb(y j 1 ) + U r s j 1 ), z j = σ (W z fb(y j 1 ) + U z s j 1 ), Lemma : fb(y t 1 ) = yt 1 L Factors: fb(y t 1 ) = yt 1 F Sum: fb(y t 1 ) = yt 1 L + y t 1 F Linear sum: fb(y t 1 ) = (yt 1 L + y t 1 F ) W fb Tanh sum: fb(y t 1 ) = tanh ( (yt 1 L + y t 1 F ) W ) fb Linear concat: fb(y t 1 ) = [yt 1 L ; y t 1 F ] W fb Tanh concat: fb(y t 1 ) = tanh ( [yt 1 L ; y t 1 F ] W ) fb yt 1 L yt 1 F : lemma embedding at previous timestep : factors embedding at previous timestep 3 / 4

References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473. Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. CoRR, abs/1603.06147. Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. CoRR, abs/1412.2007. Philipp Koehn et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 07, pages 177 180, Stroudsburg, PA, USA. Association for Computational Linguistics. Hai-Son. Le, Ilya Oparin, Abdel. Messaoudi, Alexandre Allauzen, Jean-Luc Gauvain, and François Yvon. 2011. Large vocabulary SOUL neural network language models. In INTERSPEECH. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909. Barry Sennrich, Rico; Haddow. 2016. Linguistic input features improve neural machine translation. aeprint arxiv:1606.02892, June. 4 / 4