Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Similar documents
Today s Lecture. Dropout

EE-559 Deep learning LSTM and GRU

Neural Hidden Markov Model for Machine Translation

Long-Short Term Memory and Other Gated RNNs

Self-Attention with Relative Position Representations

CSC321 Lecture 15: Exploding and Vanishing Gradients

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures

Better Conditional Language Modeling. Chris Dyer

Neural Architectures for Image, Language, and Speech Processing

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Multi-Source Neural Translation

WEIGHTED TRANSFORMER NETWORK FOR MACHINE TRANSLATION

Lecture 15: Exploding and Vanishing Gradients

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Multi-Source Neural Translation

Sequence Modeling with Neural Networks

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

EE-559 Deep learning Recurrent Neural Networks

Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

CSC321 Lecture 10 Training RNNs

arxiv: v2 [cs.cl] 1 Jan 2019

NEURAL LANGUAGE MODELS

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to RNNs!

CSC321 Lecture 16: ResNets and Attention

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

arxiv: v1 [cs.lg] 28 Dec 2017

Conditional Image Generation with PixelCNN Decoders

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Lecture 17: Neural Networks and Deep Learning

text classification 3: neural networks

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CSC321 Lecture 15: Recurrent Neural Networks

Neural Networks Language Models

Recurrent Neural Networks

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Learning to translate with neural networks. Michael Auli

Structured Neural Networks (I)

Conditional Language modeling with attention

Word Attention for Sequence to Sequence Text Understanding

arxiv: v1 [cs.cl] 21 May 2017

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Recurrent Neural Network

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Introduction to Deep Neural Networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Conditional Language Modeling. Chris Dyer


Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

From perceptrons to word embeddings. Simon Šuster University of Groningen

Tracking the World State with Recurrent Entity Networks

Edinburgh Research Explorer

Slide credit from Hung-Yi Lee & Richard Socher

CKY-based Convolutional Attention for Neural Machine Translation

Recurrent and Recursive Networks

Lecture 5 Neural models for NLP

CSCI 315: Artificial Intelligence through Deep Learning

Neural Networks in Structured Prediction. November 17, 2015

Faster Training of Very Deep Networks Via p-norm Gates

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Lecture 11 Recurrent Neural Networks I

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

CSC321 Lecture 20: Reversible and Autoregressive Models

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Lecture 11 Recurrent Neural Networks I

arxiv: v1 [cs.cl] 22 Jun 2015

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Long-Short Term Memory

CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation

arxiv: v3 [cs.lg] 14 Jan 2018

Recurrent Neural Networks. Jian Tang

Lecture 15: Recurrent Neural Nets

Improved Learning through Augmenting the Loss

arxiv: v1 [cs.cl] 30 Oct 2018

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Character-level Language Modeling with Gated Hierarchical Recurrent Neural Networks

arxiv: v2 [cs.cl] 9 Mar 2018

Deep Learning Recurrent Networks 2/28/2018

CSC321 Lecture 9: Generalization

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Natural Language Processing and Recurrent Neural Networks

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation

An exploration of dropout with LSTMs

arxiv: v3 [cs.cl] 1 Nov 2018

Deep Learning & Neural Networks Lecture 4

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CSC321 Lecture 5: Multilayer Perceptrons

Transcription:

Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26

Today s Lecture so far today we discussed RNNs as encoder and decoder we discussed some architecture variants: RNN vs. GRU vs. LSTM attention mechanisms some important components of neural MT architectures: dropout layer normalization deep networks non-recurrent architectures: convolutional networks self-attentional networks R. Sennrich MT 2018 10 1 / 26

MT 2018 10 1 General Architecture Variants 2 NMT with Convolutional Neural Networks 3 NMT with Self-Attention R. Sennrich MT 2018 10 2 / 26

Dropout Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov (a) Standard Neural Net (b) After applying dropout. Figure 1: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Crossed units have been dropped. wacky idea: randomly set hidden states to 0 during training motivation: prevent "co-adaptation" of hidden units better generalization, less overfitting its posterior probability given the training data. This can sometimes be approximated quite well for simple or small models (Xiong et al., 2011; Salakhutdinov and Mnih, 2008), but we would like to approach the performance of the Bayesian gold standard using considerably less computation. We propose to do this by approximating an equally weighted geometric mean of the predictions of an exponential number of learned models that share parameters. Model combination nearly always improves the performance of machine learning methods. With large neural networks, however, the obvious idea of averaging the outputs of many separately trained nets is prohibitively expensive. Combining several models is most helpful when the individual models are different from each other and in order to make [Srivastava et al., 2014] neural net models different, they should either have different architectures or be trained on different data. Training many different architectures is hard because finding optimal hyperparameters for each architecture is a daunting task and training each large network requires a lot of computation. Moreover, large networks normally require large amounts of training data and there may not be enough data available to train different networks on different subsets of the data. Even if one was able to train many different large networks, R. Sennrich MT 2018 10 3 / 26

Dropout Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov (a) Standard Neural Net (b) After applying dropout. implementation: Figure 1: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Crossed units have been dropped. for training, multiply layer with "dropout mask" its posterior probability given the training data. This can sometimes be approximated quite randomly well for simple sample or smallnew models (Xiong mask et al., for 2011; each Salakhutdinov layer and Mnih, training 2008), but wexample would like to approach the performance of the Bayesian gold standard using considerably hyperparameter less computation. Wep: propose probability to do this by that approximating stateanisequally retained weighted geometric mean of the predictions of an exponential number of learned models that share parameters. (some tools use p as probability that state is dropped) at test time, don t apply dropout, but re-scale layer with p to ensure expected output is the same (you can also re-scale by 1 p Model combination nearly always improves the performance of machine learning methods. With large neural networks, however, the obvious idea of averaging the outputs of many separately trained nets is prohibitively expensive. Combining several models is most helpful when the individual models are different from each other and in order to make neural net models different, they shouldat either training have different time architectures instead) or be trained on different data. Training many different architectures is hard because finding optimal hyperparameters for each architecture is a daunting task and training each large network requires a lot of computation. Moreover, large networks normally require large amounts of training data and there may not be enough data available to train different networks on different subsets of the data. Even if one was able to train many different large networks, [Srivastava et al., 2014] using them all at test time is infeasible in applications where it is important to respond quickly. R. Sennrich MT 2018 10 4 / 26

Dropout and RNNs h l t 1 h l 1 t h l t 1 h l 1 t Input i o Output gate gate Cell h l 1 t 3 g ct h l t h l t 1 Input modulation gate f Forget gate for recurrent connections, applying dropout at every time step blocks h information flow l t 1 h l 1 t Figure 1: A graphical representation of LSTM memory cells used in this paper (there are minor differences in comparison to Graves (2013)). solution 1: only apply dropoput to feedforward connections yt 2 yt 1 yt yt+1 yt+2 xt 2 xt 1 xt xt+1 xt+2 Figure 2: Regularized multilayer RNN. The dashed arrows indicate connections where dropout is applied, and the solid lines indicate connections where dropout is not applied. connections (Figure 2). The following equation describes it more precisely, where D is the dropout operator that sets a random subset of its argument to zero: i sigm ( ) f sigm D(h l 1 o = t ) sigm T2n,4n h l t 1 g tanh [Zaremba et al., 2014] c l t = f cl t 1 + i g R. h l t = Sennrich o tanh(c l t) MT 2018 10 5 / 26

Dropout and RNNs for recurrent connections, applying dropout at every time step blocks information flow solution 2: variational dropout: use same dropout mask at each time step [Gal, 2015] R. Sennrich MT 2018 10 5 / 26

Layer Normalization if input distribution to NN layer changes, parameters need to adapt to this covariate shift especially bad: RNN state grows/shrinks as we go through sequence normalization of layers reduces shift, and improves training stability re-center and re-scale each layer a (with H units) two bias parameters, g and b, restore original representation power µ = 1 H a i H i=1 σ = 1 H (a i µ) 2 H i=1 [ g ] h = σ (a µ) + b R. Sennrich MT 2018 10 6 / 26

Deep Networks increasing model depth often increases model performance example: stack RNN: h i,1 = g(u 1 h i 1,1 + W 1 x i ) h i,2 = g(u 2 h i 1,2 + W 2 h i,1 ) h i,3 = g(u 3 h i 1,3 + W 3 h i,2 )......... R. Sennrich MT 2018 10 7 / 26

Deep Networks often necessary to combat vanishing gradient: residual connections between layers: h i,1 = g(u 1 h i 1,1 + W 1 x i ) h i,2 = g(u 2 h i 1,2 + W 2 h i,1 )+h i,1 h i,3 = g(u 3 h i 1,3 + W 3 h i,2 )+h i,2 R. Sennrich MT 2018 10 8 / 26

Layer Normalization and Deep Models: Results from UEDIN@WMT17 CS EN DE EN LV EN RU EN TR EN ZH EN system 2017 2017 2017 2017 2017 2017 baseline 27.5 32.0 16.4 31.3 19.7 21.7 +layer normalization 28.2 32.1 17.0 32.3 18.8 22.5 +deep model 28.9 33.5 16.6 32.7 20.6 22.9 layer normalization and deep models generally improve quality layer normalization also speeds up convergence when training (fewer updates needed) dropout used for low-resource system (TR EN) R. Sennrich MT 2018 10 9 / 26

MT 2018 10 1 General Architecture Variants 2 NMT with Convolutional Neural Networks 3 NMT with Self-Attention R. Sennrich MT 2018 10 10 / 26

Convolutional Networks core idea: rather than using fully connected matrix between two layers, repeatedly compute dot product with small filter (or kernel) 2d convolution with 3x3 kernel https://cambridgespark.com/content/tutorials/ convolutional-neural-networks-with-keras/index.html R. Sennrich MT 2018 10 11 / 26

Convolutional Networks when working with sequences, we often use 1d convolutions 1d convolution with width-3 kernel https://www.slideshare.net/xavigiro/ recurrent-neural-networks-2-d2l3-deep-learning-for-speech-and-language-upc-2017 R. Sennrich MT 2018 10 12 / 26

Convolutional Networks (this is similar to how we obtained hidden state for n-gram LM) [Vaswani et al., 2013] R. Sennrich MT 2018 10 13 / 26

Convolutional Neural Machine Translation convolutional encoder actually predates RNN encoder 84 CHAPTER 13. NEURAL MACHINE TRANSLATION Input Word Embeddings K2 Encoding Layer K2 Encoding Layer Transfer Layer K3 Decoding Layer K2 Decoding Layer Selected Word Output Word Embedding Figure 13.42: Refinement of the convolutional neural network model. Convolutions do not result in a single sentence embedding but a sequence. The encoder is also informed by a recurrent neural network (connections from output word embeddings to final decoding layer. awful lot from the resulting sentence embedding to represents the meaning of an entire sentence of arbitrary length. architecture of [Kalchbrenner and Blunsom, 2013], as illustrated in P. Koehn, Neural Machine Translation Generating the output sentence translation reverses the bottom-up process. One problem for the decoder is to decide the length of the output sentence. One option to address this problem is to add a model that predicts output length from input length. This then leads to the R. Sennrich MT 2018 10 14 / 26

Convolutional Neural Machine Translation with Attention to keep representation size constant, use padding similar variable-size representation as RNN encoder kernel can be applied to all windows in parallel 13.7. ALTERNATE ARCHITECTURES 85 0 0 Input Word Embeddings 0 0 0 0 Convolution Layer 1 Convolution Layer 2 Convolution Layer 3 Figure 13.43: Encoder using stacked convolutional layers. Any number of layers may be used. 13.7.2 Convolutional Neural Networks With Attention Gehring et al. (2017) propose an architecture for neural networks that combines the ideas of convolutional neural networks and the attention mechanism. It is essentially the sequence-tosequence attention that we described as the canonical neural machine translation approach, but with the recurrent neural networks replaced by convolutional layers. architecture of [Gehring et al., 2017], as illustrated in P. Koehn, Neural Machine Translation We introduced convolutions in the previous section. The idea is to combine a short sequence of neighboring words into a single representation. To look at it in another way, a convolution encodes a word with its left and right context, in a limited window. Let us now describe in more detail what this means for R. the Sennrich encoder and the MT decoder 2018 in the 10 neural model. 15 / 26

Convolutional Neural Machine Translation with Attention use your favourite attention mechanism to obtain input context in decoder, information from future tokens is masked during training effective context window depends on network depth and kernel size 86 CHAPTER 13. NEURAL MACHINE TRANSLATION Input Context Output Word Predictions Decoder Convolution 2 Decoder Convolution 1 0 Output Word Embedding Selected Word Figure 13.44: Decoder in convolutional neural network with attention. The decoder state is computed as a sequence of convolutional layers (here: 2) over the already predicted output words. Each convolutional state is also informed by the input context computed from the input sentence and attention. Decoder The decoder in the canonical model also has at its core a recurrent neural network. Recall its state progression defined in Equation 13.75 on page 49: architecture of [Gehring et al., 2017], as illustrated in P. Koehn, Neural Machine Translation si = f(si 1, Eyi 1, ci) (13.95) where si is the encoder state, R. Eyi 1 Sennrich the embedding MT of2018 the previous 10 output word, and ci the 16 / 26

Convolutional Neural Machine Translation (ByteNet) architecture of [Kalchbrenner et al., 2016] R. Sennrich MT 2018 10 17 / 26

MT 2018 10 1 General Architecture Variants 2 NMT with Convolutional Neural Networks 3 NMT with Self-Attention R. Sennrich MT 2018 10 18 / 26

Attention Is All You Need [Vaswani et al., 2017] same criticisms of recurrent architecture: recurrent computations cannot be parallelized core idea: instead of fixed-width convolutional filter, use attention Self-Attention there are different flavours of self-attention here: attend over previous layer of deep network Convolution Self-Attention https://nlp.stanford.edu/seminar/details/lkaiser.pdf R. Sennrich MT 2018 10 19 / 26

Attention Is All You Need [Vaswani et al., 2017] Transformer architecture stack of N self-attention layers self-attention in decoder is masked decoder also attends to encoder states Add & Norm: residual connection and layer normalization Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer [Vaswani et al., 2017] itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention R. Sennrich MT 2018 sub-layer in10 the decoder stack to prevent positions from attending to subsequent20 positions. / 26 This

Multi-Head Attention basic attention mechanism in AIAYN: Scaled Dot-Product Attention Attention(Q, K, V ) = softmax( QKT dk )V query Q is decoder/encoder state (for attention/self-attention) key K and value V are encoder hidden states multi-head attention: use h parallel attention mechanisms with low-dimensional, learned projections of Q, K, and V Scaled Dot-Product Attention Multi-Head Attention R. Sennrich MT 2018 10 21 / 26

Multi-Head Attention motivation for multi-head attention: different heads can attend to different states Attention Input-Input Visualizations Layer5 It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or voting process more difficult <EOS> <pad> <pad> <pad> <pad> <pad> <pad> It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or voting process more difficult <EOS> <pad> <pad> <pad> <pad> <pad> <pad>.. Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb making, completing the phrase making...more difficult. Attentions here shown only for the word making. Different colors R. represent Sennrich different MT heads. 2018 Best 10 viewed in color. 22 / 26

Comparison empirical comparison difficult some components could be mix-and-matched choice of attention mechanism choice of positional encoding hyperparameters and training tricks different test sets and/or evaluation scripts R. Sennrich MT 2018 10 23 / 26

Comparison SOCKEYE [Hieber et al., 2017] (EN-DE; newstest2017) Marian (EN-DE; newstest2016) system BLEU deep LSTM 25.6 Convolutional 24.6 Transformer 27.5 system BLEU deep LSTM 32.6 Transformer 33.4 https://github.com/marian-nmt/marian-dev/issues/116#issuecomment-340212787 R. Sennrich MT 2018 10 24 / 26

Empiricism vs. Theory our theoretical understanding of neural networks lags behind empirical progress there are some theoretical arguments why architectures work well... (e.g. self-attention reduces distance in network between words)...but these are very speculative R. Sennrich MT 2018 10 25 / 26

Further Reading required reading: Koehn, 13.7 consider original literature cited on relevant slides R. Sennrich MT 2018 10 26 / 26

Bibliography I Gal, Y. (2015). A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. ArXiv e-prints. Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. CoRR, abs/1705.03122. Hieber, F., Domhan, T., Denkowski, M., Vilar, D., Sokolov, A., Clifton, A., and Post, M. (2017). Sockeye: A Toolkit for Neural Machine Translation. ArXiv e-prints. Kalchbrenner, N. and Blunsom, P. (2013). Recurrent Continuous Translation Models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle. Association for Computational Linguistics. Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., and Kavukcuoglu, K. (2016). Neural Machine Translation in Linear Time. ArXiv e-prints. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15:1929 1958. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. CoRR, abs/1706.03762. R. Sennrich MT 2018 10 27 / 26

Bibliography II Vaswani, A., Zhao, Y., Fossum, V., and Chiang, D. (2013). Decoding with Large-Scale Neural Language Models Improves Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, pages 1387 1392, Seattle, Washington, USA. Zaremba, W., Sutskever, I., and Vinyals, O. (2014). Recurrent Neural Network Regularization. CoRR, abs/1409.2329. R. Sennrich MT 2018 10 28 / 26