Analysis of techniques for coarse-to-fine decoding in neural machine translation

Analysis of techniques for coarse-to-fine decoding in neural machine translation Soňa Galovičová E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science by Research School of Informatics University of Edinburgh 2016

Abstract In this report, I first briefly explain how the neural machine translation models work. I introduce the issues connected to them, specifically the slow decoding and the lack of recombination. I propose techniques to tackle these problems, and analyze their efficiency using several metrics. First, I investigate coordinate-wise quantization. Afterwards, I describe the idea of a coarse-to-fine decoding algorithm, and I suggest and explore three different clustering techniques - locality sensitive hashing, K-means clustering and word history similarity. At the end I compare these methods against each other, and evaluate their suitability for the future work. iii

Acknowledgements This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh. iv

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Soňa Galovičová) v

Contents 1 Introduction 1 2 Background 5 2.1 Neural models.............................. 5 2.1.1 Language models........................ 5 2.1.2 Machine translation models.................. 6 2.2 Decoding................................ 7 2.2.1 Generating the probability distribution............. 8 2.3 Byte pair encoding........................... 9 2.4 Metrics for experiment evaluation................... 10 2.4.1 BLEU score........................... 10 2.4.2 Kullback-Leibler divergence.................. 10 3 Coordinate-wise Quantization 13 3.1 Motivation................................ 13 3.2 Related work.............................. 13 3.3 Quantization............................... 14 3.3.1 Creating the quantization set Q................. 14 3.4 Evaluation metrics........................... 15 3.4.1 Metrics propagating the modifications............. 16 3.4.2 Metrics evaluating the modifications locally.......... 16 3.5 Experiments............................... 18 3.5.1 System description....................... 18 3.5.2 Data exploration........................ 19 3.5.3 Results............................. 22 3.5.4 Time experiments........................ 24 3.6 Summary................................ 25 vii

4 Clusterings Based on Neural State Similarity 27 4.1 Motivation................................ 27 4.2 Related work.............................. 28 4.3 Future work: Coarse-to-fine decoding................. 28 4.4 Clusterings............................... 29 4.4.1 Locality sensitive hashing................... 29 4.4.2 Online k-means clustering................... 30 4.5 Evaluation metrics........................... 31 4.6 Experiments............................... 32 4.6.1 Implementation......................... 32 4.6.2 Results............................. 32 4.7 Summary................................ 33 5 Clusterings Based on Word History 39 5.1 Motivation................................ 39 5.2 Related work.............................. 39 5.3 Clustering................................ 40 5.4 Evaluation metrics........................... 40 5.5 Experiments............................... 41 5.5.1 Results............................. 41 5.6 Comparison to the neural state clusterings............... 41 5.7 Summary................................ 45 6 Conclusion and Future Work 47 Bibliography 49 viii

Chapter 1 Introduction Machine translation (MT) models based on neural networks (NNs) [Ñeco and Forcada, 1996] were invented with the aim to build a single system which takes a sentence as an input and outputs its translation directly, with all the subparts of the system tuned simultaneously [Bahdanau et al., 2014]. In this work, I will focus on infinite context models, which use networks such as recurrent NNs or LSTMs (Section 2.1). When searching for and scoring possible translations, these take the source sentence and the output words produced so far (which together form a history), and create a vector summarization of the parts most relevant to predicting the next word. As a result, they can capture long distance dependencies between words. This is not the case in traditional statistical MT models (e.g. phrase-based N-gram models) which use only local information to make predictions, and so any effect of words further away is lost [Katz, 1987, Koehn et al., 2003]. Unfortunately, NNs are limited by their speed, both in training and subsequent applications. For instance, they are too slow to translate the monolingual data for back translation (i.e. creating synthetic parallel data), which has been shown to be beneficial for training MT systems [Sennrich et al., 2015a]. There are 1.8 trillion words of English we want to exploit, but training the NN already takes weeks for billions of words in the training dataset. We might also be interested in constructing more complicated systems, for example combined left-to-right and right-to-left translation models. Sennrich et al. [2016] improved their system by using left-to-right model to produce an n-best list of possible translations, and then reranking the output with the right-to-left model (i.e. they scored 1

2 Chapter 1. Introduction the n sentences with the second model, and sorted them accordingly). However, reranking is of limited use in case the n-best lists are already of low quality (e.g. because the long distance dependencies were ignored). The fact that it works well suggests that we can achieve even better improvements by implementing a similar idea to decoding. If we found a decoding technique able to do a fast pass through the left-to-right model to approximately complete the sentence, we could use this to implement the bidirectional neural decoding: to predict next target word, we can use the completed sentence to directly obtain a score from the right-to-left model as well. Another problem of pure neural models is that they can handle only a limited vocabulary, and so are not well fitted for filling in rare words [Chen et al., 2015]. By using the neural scores in reranking of n-best lists, it has been shown that the combination of NNs and statistical MT works better than either of them on its own [Sutskever et al., 2014, Neubig et al., 2015]. Alkhouli et al. [2015] have investigated applying the RNN scores to decoding (Section 5.2), and reported that their method worked at least as well as reranking. Unfortunately, implementing NNs in the decoding is not trivial. As mentioned above, NNs are very slow to query, and so if we want to consider a large number of possible translations, scoring them all separately would slow down the decoding process greatly. Even though the resultant translation might be of higher quality, if the task requires much longer running time, it is not very practical. There are procedures for traditional statistical MT which allow us to consider big number of translations without scoring every one of them. Let us define hypothesis to be a possible partial translation, and state to be the inputs to the scoring function. By definition, if multiple hypotheses have exactly the same state, it is ensured that the future predictions will be identical for them. As a result, we can use dynamic programming to merge these hypotheses, and only explore the currently highest scoring one from each such group; as the future costs are identical, we know the other ones in each group will be at most as good in the future. This is called recombination. Approximate search techniques such as beam search [Chiang, 2005] or cube pruning [Chiang, 2007] (Section 2.2) constrain us to consider a limited number of recombined hypotheses, which is linear in the sentence length. In the case of simple N-gram models, this means we actually consider exponentially many hypotheses, as all the hypotheses finishing with the same (N 1) words and covering the same source words

3 can be merged. On the contrary, in the infinite context NN models two hypotheses have the same state only if they represent the same word sequence so far (because the neural state is a real valued vector, often of a high dimension). Consequently, few hypotheses are actually recombined. Thus using the same search techniques as mentioned above, we can only consider a linear number of hypotheses. As usually considering more hypotheses increases our chances to find the best one, this lowers the resultant translation quality. The goal of my project is to find a way to restore the benefits of recombination in neural models. I will apply different coarse-to-fine approximations [Petrov et al., 2008] which would allow us to consider more hypotheses in the same time, and measure how much approximation error they introduce. To do this, I will need to make the search more extensive by increasing the beam size (Section 2.2); this might cause the model to perform worse, as with the perplexity objective, scores actually go down with better search. This is a model problem that Andor et al. [2016] are trying to fix. In this thesis, I first introduce the relevant background (Chapter 2). I analyze coordinatewise quantization of the neural states (Chapter 3), which can make the decoding faster by using cheaper operations within the NNs. I show that using only 5 bits (25 distinct values) is sufficient to restore the quality of the unmodified model. I also investigate several ways to bundle multiple hypotheses together and score them simultaneously, namely two clusterings based on the similarity of the neural state vectors (Chapter 4) locality sensitive hashing and online K-means clustering and another one based on the word history (Chapter 5). I present the results showing that LSH is not suitable for future work, but both word history and K-means provide clusterings that look promising for the use in coarse-to-fine decoding. I conclude in chapter 6, comparing the methods to each other, evaluating their efficiency, and proposing possible future extensions.

Chapter 2 Background 2.1 Neural models 2.1.1 Language models The simplest neural language model (LM) is based on feed-forward NNs [Bengio et al., 2003]. It is an N-gram model: it takes vector representations of the last (N 1) words as an input, concatenates them, and propagates the resulting vector through a feedforward NN. This produces a probability distribution over the next word. This does not solve the problem of finite context models mentioned in Section 1, namely the inability to capture long distance dependencies between the words. For this, we can use a LM based on the recurrent neural networks (RNNLM) [Mikolov et al., 2010]. In contrast to the above, RNNLM represents the full history in a continuous state vector. The state vector is updated with each word added to the output sentence, and the information about any preceding word can influence the predictions for arbitrarily long time [Boden, 2002]. Unfortunately, RNNs are hard to train because of the vanishing gradient problem [Bengio et al., 1994, Hochreiter et al., 2001]. This can be partly solved using a special type of RNN, Long Short-Term Memory network (LSTM) [Hochreiter and Schmidhuber, 1997]. This differs from RNN, but it is also an infinite context model, carrying a state vector. For the scope of this project, I will consider these two models - RNN and LSTM - to be the same one. The same techniques (Chapters 3, 4 and 5) will apply to both, although the architecture can influence their effectiveness. 5

6 Chapter 2. Background 2.1.2 Machine translation models I will work with a left-to-right recurrent neural machine translation model, which belongs to a family of encoder-decoders (Figure 2.1) [Sutskever et al., 2014]. The encoder is an infinite context network trained on the parallel data, and its purpose is to take the source sentence and turn it into a single continuous valued vector, called the source context. In this project, I do not alter the encoding. The decoder is another network, trained jointly with the encoder. It can be used both for scoring and producing potential translations of the encoded source sentence. We start by plugging in the source context and the beginning of sentence symbol <s>. After this, the RNN produces a hidden state vector, which now corresponds to the sentence "<s>", and a probability distribution over the following word. Then we take the next word (either we have it as we just want to score a hypothesis, or we can choose according to the distribution produced), and plug it back together with the new state vector and the source context. These steps are repeated until we reach the end of the sentence. <s> Das ist ein Hund </s> source 0 h 1 h 2 h 3 h 4 context This is a dog source context s 1 s 2 s 3 s 4 s 5 (a) Encoder: this encodes the source sentence This is a dog to a source context vector, by pushing the words to the RNN one by one. (b) Decoder: scores the translation Das ist ein Hund" (the tags <s></s> mark the beginning and the end of the sentence). Each score is calculated using the source context, hidden state s i and the last word added. Figure 2.1: The machine translation recurrent neural model. Attention The simplest encoder-decoder MT models work as described above. However, in this setup the decoder only sees the last state from the encoder. To make it more flexible, we can use a model with attention [Bahdanau et al., 2014]. This is another NN trained

2.2. Decoding 7 <s> Great! </s> <s> Great! </s> source context = <s> Great! </s> <s> Super... w 1,0 w 1,1 w 1,2 w 1,3 Figure 2.2: The attentional encoder-decoder model. Encoder (red) creates context vectors for each word. Decoder (blue) queries the attention network to get weights w i, j for the words, and uses the corresponding linear combination to predict the next word. together with encoder and decoder, and its purpose is to tell the decoder how important the different source words are when predicting the next word. (This can be understood as a neural version of alignment.) More specifically, we feed the sentence to the encoder from left to right, and store the intermediate states (called h i in figure 2.1a). Then we do the same, but this time from right to left, obtaining a second state for each word in the sentence. We concatenate these two vectors for each word; together they form the source context. When the decoder wants to predict the next word, the attention network assigns weights to all the source words; the corresponding linear combination of the context vectors is then passed to decoder as the aligned source context (Figure 2.2). 2.2 Decoding For the purpose of this project, it is necessary to understand how the neural decoder works. I will explain it on the example of a decoding algorithm used in AmuNMT 1 [Junczys-Dowmunt et al., 2016]. This is a C++ implementation of an attentional encoder-decoder [Bahdanau et al., 2014] which I will be using for the experiments in chapters 3, 4 and 5. This algorithm uses a search technique called beam search [Chiang, 2005], and we have to specify its parameter, the beam size, in advance. To translate a sentence, it is first put to an encoder to obtain the source context (Section 2.1.2). A beam B, an 1 https://github.com/emjotde/amunmt

8 Chapter 2. Background array of current hypotheses, is initialized to contain a single hypothesis marking the beginning of a sentence. Then, until all the hypotheses end (or the translations reach the maximum length allowed; for example, this can be a constant multiple of the source sentence length), we iteratively decode one word at a time. To decode one word, we take the neural states of the hypotheses in B and use them to produce a probability distribution over the whole vocabulary (Section 2.2.1). These probabilities are then combined with the scores of the hypotheses in B, and scores for all the possible extended hypotheses, E, are obtained. The number of elements in E is B vocabulary. The beam size specifies how many top elements from E we carry forward and extend in the next step. Out of these we store the ones which have just finished (i.e. end with the character </s>), and subtract their number from the beam size for the future steps. The beam B is then replaced by the remaining hypotheses, and the procedure repeats to translate the next word. After the loop is finished, we look at the complete hypotheses that were stored during the whole process, and output the highest scoring one as the final translation (or highest scoring n if we are interested in an n-best list). Cube pruning Most statistical MT models do not use beam search anymore, abandoning it in favour of other search algorithms, most commonly cube pruning [Chiang, 2007]. Instead of scoring all the hypotheses in the current beam and then choosing the top ones, it prioritizes their exploration based on the scores of the hypotheses without the last word, and the individual scores of the words to be appended. The advantage is that since we are not scoring all the hypotheses separately, we can afford to increase the beam size, and so consider a larger number of hypotheses in the same amount of time. 2.2.1 Generating the probability distribution Diagram in figure 2.3 shows how the probability distribution over the next word is obtained. There is a hypothesis as an input: it consists of a previous state (the neural state of the hypothesis without the last word), and embedding of the last word added.

2.3. Byte pair encoding 9 Source context Aligned source context Previous State Hidden state Neural state Probabilities Embedding hypothesis Figure 2.3: Assembling the neural state, and calculating the probabilities for the next word. They are then combined by a NN to form a hidden state. Together with the source context (which is the same for all decoded words), we put the hidden state to the attention NN, which outputs aligned source context - this is the weighted sum of the context vectors for all the source words. Afterwards, the hidden state and the aligned source context are combined in another NN to produce a neural state. The final step is to feed the neural state, aligned source context and again the embedding to the last NN, which has a softmax layer producing the probability distribution over the next word. In this project, I am amending (quantizing in chapter 3 and clustering in chapters 4 and 5) the neural state vectors. 2.3 Byte pair encoding One of the limitations of neural MT models is that they can only handle a fixed vocabulary; it is usually limited to 30,000 50,000 words. This makes translation of rare words problematic. A possible way to introduce an open vocabulary in word-level neural MT models is by segmenting the words to smaller units, and then training the model on them instead. This can be helpful especially for languages forming words by agglugation and compounding, such as German (e.g. a language school eine Sprach schule). In my work, models operate on subword units created using byte pair encoding (BPE) [Sennrich et al., 2015b], which is an algorithm based on the BPE compression [Gage, 1994]. It starts with each word split to characters (ended by a special end of word character), and then iteratively merges the most frequent pairs of adjacent characters

10 Chapter 2. Background or character sequences, until the wanted vocabulary size is reached. Very frequent character n-grams or whole words are eventually merged, and so end up included in the vocabulary; rare words will be always passed to the model in their segmented form. 2.4 Metrics for experiment evaluation 2.4.1 BLEU score Bilingual evaluation understudy (BLEU) [Papineni et al., 2002] is a commonly used metric to evaluate the quality of Machine Translation systems. It is completely automated, and it has been shown to highly correlate with human judgment (which is expensive and time-consuming). It compares the translation generated by the system c (candidate translation) to a human generated reference r, and evaluates how much their N-grams overlap. The exact formula is: ( N n=1w n log p n ) BLEU = BP exp BP is the brevity penalty, and it is penalizing translations shorter than the reference. It is crucial to ensure that models outputting only words (phrases) which they are really sure about would not score the highest. It is calculated as: 1 if c is longer than r BP = e (1 r/c) if c is shorter or equal in length to r N is the maximal length of n-grams used; this is usually set to 4; w n are positive weights summing to 1; uniform weights have w n = 1/N for all n; p n is the n-gram precision of the full candidate translation. 2.4.2 Kullback-Leibler divergence KL divergence or relative entropy is a measure of how different two probability distributions are [Kullback and Leibler, 1951]. Given two distributions on the same set X, P and Q, it is defined as: KL(P Q) = P(x)log P(x) x X Q(x)

2.4. Metrics for experiment evaluation 11 The inequality KL(P Q) 0 is always satisfied, with equality if and only if P and Q are identical on the whole of X. The value is finite if and only if Q(x) = 0 implies P(x) = 0, i.e. the support of Q is a superset of the support of P. If two distributions are similar, their KL divergence will be small. In general KL(P Q) KL(Q P), i.e. it is not symmetrical. In practice we often take P to be the true distribution of a random variable X, and Q to be its approximation. In this case, KL(P Q) is the expected value of log P(X) Q(X) under the true distribution of X.

Chapter 3 Coordinate-wise Quantization 3.1 Motivation The neural state is a high-dimensional (1024-dimensional for the model that was used; Section 3.5.1) vector with each coordinate being a 32-bit float, i.e. there are approximately 2 32 different values possible. If we were able to find a smaller set of values, representative enough so that the model still performs well when using only these, we could adjust the neural networks to use cheaper fixed-point integer operations during forward propagation, and thus reduce the decoding time. 3.2 Related work There are several results about neural networks being robust to errors associated with quantization. This suggests that we can expect approximating state vectors with quantized ones to work well. For instance, Courbariaux et al. [2014] reduced the precision of parameters and inputs/outputs of a neural network to simplify the multiplications. They found a very low precision is sufficient for both training and querying the NNs. Wu et al. [2015] quantized a convolutional neural network, and achieved 4 6 speedup with only 1% loss of classification accuracy. Kim and Smaragdis [2016] trained a network operating with binary values only, which allowed them to use the very efficient bitwise operations. They proposed several training methods, resulting in a network performing almost as well as the corresponding 13

14 Chapter 3. Coordinate-wise Quantization network with real values. 3.3 Quantization To quantize the coordinates of the neural state vector, we first choose a set Q of values of a fixed size Q. This set will not be changed during the decoding task. We then replace each value x in the neural state vector by the value ˆx Q, minimizing the distance x ˆx. Effectively, we are mapping the vectors from R D to the closest vector in the discrete vector space Q D (Figure 3.1). y y 3 y 2 y 1 quantized vector original vector 0 x 1 x 2 x 3 x 4 x 5 x Figure 3.1: Example of quantization in two dimensions. The original vector is transformed to the closest vector in the discrete subspace given by the quantization grid (green). A slightly more complex procedure allows Q to vary for different coordinates (i.e. we would have D different sets, one for each dimension of the neural state). This might be beneficial if it turns out that the distributions of numbers in different coordinates are very distinct. 3.3.1 Creating the quantization set Q Uniform (fig. 3.2a) If the majority of the values in the neural state vectors fall into an interval [a,b] (Section 3.5.2), we can just cut this interval to Q equal pieces and take their centres to form Q: { ( (b a) Q = a + t + 1 ) } : t {0,1,...,( Q 1)} Q 2 This approach has the advantage that for a given value x R, it is possible to find the closest value ˆx Q in constant time just by rounding x.

3.4. Evaluation metrics 15 a t = 0 t = 1 t = 2 t = 3 (b a) 4 (a) Uniform: only depends on the data range. b quant. centre data cluster (b) Simple bucket: equally sized clusters. (c) K-means: more clever clusters. Figure 3.2: Example of quantization types for Q = 4. Simple buckets (fig. 3.2b) We sample a selection of values from the state vectors created during decoding the training data to get a training set (Section 3.5.1). We sort and split it into Q buckets of the same size (if the number of values is not divisible by Q, the first few buckets will contain 1 more element than the latter ones). Then we take the average of each bucket to form Q. This is basically doing the first step of the K-means clustering algorithm on the values. K-means clustering (fig. 3.2c) Run K-means algorithm [Lloyd, 2006] on the training set (obtained as described above) with K = Q ; the outputted centers will form Q. 3.4 Evaluation metrics I will present my results for various sets of quantization centers in section 3.5.3. To test a method, I will modify the decoding process (Section 2.2) and quantize the neural states at each step (i.e. prediction of one word). A probability distribution over the next word is then created for each hypothesis, and I will either: Propagate the modifications: use this probability distribution to determine the scores of current hypotheses, and carry on using these scores; or

16 Chapter 3. Coordinate-wise Quantization Evaluate the modifications locally: compare the output to the true probability distribution, i.e. the one which is created using the original (non-quantized) state vectors; then throw the modifications away and carry on decoding using the true scores. 3.4.1 Metrics propagating the modifications BLEU score (Section 2.4.1) In the BLEU test, I will decode the sentences using the modified states, which will produce an alternative translation. Comparing its BLEU score to the BLEU score of an unmodified model, we can see how much damage the modification causes to the translation task. We are expecting to see that using smaller number of quantization centers will lead to bigger drop in BLEU, and we are interested in the minimal number of centers sufficient to restore the original value of BLEU. 3.4.2 Metrics evaluating the modifications locally Kullback-Leibler divergence (Section 2.4.2) We can look at the probability distributions over the whole target vocabulary, and measure how similar they are. For this, the KL divergence KL(P Q) can be used, with P being the true distribution and Q the distribution given by quantized states. We would say that an approximation is good (meaning it does not damage the resulting probability distribution too much) if the KL divergence is small. One issue with using this metric for evaluation is that the interpretation is quite vague what is a small KL divergence? It might be used as a measure for comparing different methods, but it is difficult to explain the result on its own. At the same time, we might not need to care about the damage to the whole probability distribution. In the beam search algorithm (Section 2.2) that we use during the decoding, we score a set of hypotheses and then keep only the top scoring few. Therefore it makes sense to ensure that our approximation gets the top scores right, and do not penalize the ones which do not exactly match the true distribution in its tail (i.e. the hypotheses which are very unlikely under the original model).

3.4. Evaluation metrics 17 For this reason I will measure also other quantities, which should be easier to interpret and possibly align closer to our interest: Precision on the top word This a very simple quantity, measuring how often a word which scores highest in the true distribution is also scoring highest after the modifications. For each sentence, the measured quantity will be averaged over all the hypotheses that are scored during the process: Top 1 = # of hypotheses scored for which the top 1 word did not change # of hypotheses scored This metric is treating each hypothesis separately. We might also be interested in a slightly more global properties, considering the full beams instead: Precision on the next beam This looks at the whole beam and calculates the precision on the next beam; i.e. how many hypotheses selected to the next beam using the true distribution are still selected after the modifications. For a specific beam b i, we calculate: π(b i ) = (true b i+1) (b i+1 after modifications) true b i+1 For each sentence s, the reported value is a weighted average over the beams: with weights w(b i ) = true b i+1 i true b i+1. Equivalently, this is π(s) = w(b i )π(b i ) i π(s) = i (true b i+1 ) (b i+1 after modifications) i true b i+1 This gives us the proportion of the hypotheses which the original model would choose, and which are not lost when using quantization.

18 Chapter 3. Coordinate-wise Quantization 3.5 Experiments 3.5.1 System description I will perform all my experiments on an English Czech model designed and trained by Sennrich et al. [2016] for WMT 2016 shared news translation task. This is a pure neural translation system based on an attentional encoder-decoder [Bahdanau et al., 2014] (Section 2.1.2). It is implemented according to the dl4mt-tutorial 1, but it uses a more efficient AmuNMT C++ decoder (Section 2.2) instead of the given theano implementation. To achieve the open vocabulary translation, byte-pair encoding [Sennrich et al., 2015b] (Section 2.3) is used to segment the words. The English Czech model was first trained on the full WMT16 parallel training set, until it converged on the heldout validation set. Afterwards, the training continued with a synthetic parallel corpus created by back-translating monolingual Czech data [Sennrich et al., 2015a]. For the back-translation a neural MT model from earlier experiments, trained on WMT15 data, was used. Sennrich et al. [2016] used several other enhancements for their submission, including combining left-to-right and right-to-left models through reranking of the n-best list. I will not use these in my experiments. The system I will be working with achieved BLEU score 0.207 on their dev set, and 0.237 on the official WMT16 test set. I used a fixed beam size equal to 100 in the experiments. The data was tokenized, truecased and segmented using BPE. For data exploration and construction of set Q I used training data the WMT newstest2013 (3000 sentences). I uniformly sampled 100,000 neural states of 6,625,568 produced during decoding, and collected 100,000 sample values for each coordinate, as well as for all the coordinates at once. For the experiments, test data was used first 1000 sentences from WMT newstest2016. The unmodified model scored 0.25 BLEU points on my test set. Implementation The codebase uses the Thrust 2 library, which is a CUDA equivalent for C++ standard template library (STL). The loop over the coordinates of the neural state is parallelized 1 https://github.com/nyu-dl/dl4mt-tutorial 2 http://docs.nvidia.com/cuda/thrust/

3.5. Experiments 19 on the GPU; for time analysis see section 3.5.4. For each coordinate of the neural state, the closest value in the quantization set is found using a binary search, i.e. the time to quantize one value is O(log Q ). For uniform quantization (Section 3.3.1) this can be done in O(1), as the value of the closest center can be computed directly. I implemented this in a naive way (calculating the position in the array explicitly) in the first version of my program, but it turned out to be effectively as fast as the more general binary search version, so I did not use it in the final version. Speed-up is still possible by using a more clever way of rounding the values. 3.5.2 Data exploration To use any of the quantization methods described in section 3.3.1, we need to know more about the distribution of the data. I made my analysis on a training set (Section 3.5.1), which contains 100,000 representative values from each coordinate of the neural state vector, and also a sample of 100,000 from all of them (this is a sample for the case when we consider all the coordinates to be drawn from the same distribution). The distribution of values from all coordinates can be seen in Figure 3.3. As a consequence of using a tanh activation function in the neural model, all the values are in [ 1,1]. There are more values clustered around 0, with median slightly shifted from 0 towards the negative numbers (it is 0.0035 on the sample). The histogram of the distribution looks similar to normal distribution around its mean (although it cannot be normal due to the finite support). Trying to fit a Gaussian curve to it (Figure 3.4) we see that indeed this is not the case. The histogram is narrower than the Gaussian equation allows, decreasing faster near the median value and then slowing down, giving rise to heavier tails. We can also try to fit a beta distribution to it, which is more suitable as it has finite support (we have to shift it from [0,1] to [ 1,1]). However, one can see (Figure 3.4) that this fit is almost identical to the Gaussian one, and so does not explain the distribution better. It is also interesting to see whether the distributions look the same for all coordinates. In order to find out, we can plot the distributions for selected values of i {1,2,...,1024}. As the order of the coordinates is not important during the training, we can just choose i {1,2,...,5}. The results can be seen in Figure 3.5. The distributions differ slightly - they all have the bell shape, but the median and width

20 Chapter 3. Coordinate-wise Quantization 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Figure 3.3: Distribution of values in the neural state vectors, treating all coordinates equally. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Figure 3.4: Fitting the curve over the histogram of values from the neural states.

3.5. Experiments 21 1 1 1 1 1 1 1 1 1 1 1 1 Figure 3.5: Distribution of values in the neural state vectors for different coordinates. of the peak differ. We note there are some irregularities even to the shape, for example the heavy tail near 1 for 2 nd coordinate. This suggests using different set Q for different coordinates i {1,2,...,1024} might be beneficial. In this thesis, I will only explore global quantization (i.e. using the same Q for all coordinates), as the distributions are not too different (and it would also introduce additional memory usage if we stored 1024 different arrays for quantization). However, if the results of my experiments suggest quantization works well, it might be sensible to implement this in the future to see whether even better overall improvements are possible. Different types of the quantization set Q I used the full training data (Section 3.5.1) to construct the sets of quantization centers as described in Section 3.3.1. The values chosen for Q = 63 can be seen in Figure 3.6. The uniform centers are evenly distributed in [ 1,1]. Simple bucket clustering creates centers which follow the same curve as the data themselves (the thick light blue line), giving a lot of centers for the more frequent values near 0. Centers defined using the K-means clustering are somewhat in between, still giving more weight to the values around 0, but the clusters being distributed more evenly on the whole interval.

22 Chapter 3. Coordinate-wise Quantization 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Figure 3.6: Quantization centers created by 3 different approaches (Section 3.3.1) for Q = 63. 3.5.3 Results The results can be seen on figures 3.7, 3.8, 3.9, and 3.10. In terms of BLEU, the simple bucket centers perform the worst 100 centers were needed to achieve BLEU score 0.2495, restoring the score 0.25 of the unmodified model. For both uniform and K-means approaches, 25 centers seem to be enough (scoring 0.2527 and 0.2496 BLEU points respectively), and 16 centers are only slightly worse (scoring 0.2413 and 0.2404 respectively). The KL divergence test confirms that simple bucket clusters are the least suitable. For all methods, it looks like there is a linear relationship between the logarithms of the KL divergence and the number of quantization centers, i.e. the KL divergence decreases as a power of Q. We note that the KL divergence is 0.005 and 0.003 for the uniform and K-means with 25 centers (resp.), and 0.005 for simple bucket with 100 centers. Thus it seems that to achieve the BLEU score of the original model, KL divergence has to be at most 0.005. The results for precisions on top word and next beam are very similar, suggesting that measuring the precision on the top word might be a good approximation for the precision on the next beam (calculation of which takes more time). Again, there seems to be a linear relationship between the logarithms of the axes, with K-means and uniform

3.5. Experiments 23 1 1 1 1 1 1 1 1 1 1 1 1 Figure 3.7: BLEU scores as a function of the quantization set Q. 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Figure 3.8: KL divergence as a function of the quantization set Q. Note that although for uniform and K-means the value at 10,000 clusters is 0 (i.e. due to the log scale), this does not necessarily mean the KL divergence was always 0. The output precision was 6 decimal places and so the average KL being 0.0 just means that all the measured values were below 5 10 7.

24 Chapter 3. Coordinate-wise Quantization 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Figure 3.9: Precision on the top word as a function of the quantization set Q. being about equally good, and simple buckets decreasing a bit slower. Achieving the same BLEU score as the unmodified model corresponds to roughly 3% error rate (uniform with 25 centers has precision 0.969, K-means with 25 centers has 0.976, and simple buckets with 100 centers has 0.978). 3.5.4 Time experiments A time comparison of the task to translate 1000 sentences (propagating the modifications) is in Figure 3.11. The original model without quantization took 0.381 seconds per sentence on average. The time is slightly higher for the models with quantization, ranging from 0.386 seconds per sentence for only 2 centres to 0.391 seconds per sentence for 10,000 centres. The measurements are done with the implementation using the binary search to find the closest value for each coordinate in O(log Q ). For the case with 2 centers, this lookup should be quite fast - and we see that the model is still approximately 5 seconds slower than the original one. This shows that a naive O(1) implementation of the uniform quantization would not speed up the process very much, as the actual lookup does not take a lot of extra time.

3.6. Summary 25 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Figure 3.10: Precision on the next beam as a function of the quantization set Q. The current implementation slows down the process by 1.3 2.6 % only. Since we have not applied any of the potential speedups (Section 3.1) such as refactoring the network to use faster fixed-point integer operations, we believe that in the end this technique will make the decoding faster, as intended. 3.6 Summary We analyzed the potential benefits of quantization applied to neural states. The intention is to use it to speed up the machine translation models by reimplementing the networks to use smaller number of bits in the matrix operations (Section 3.1); a similar approach has been shown successful in other domains (Section 3.2). The technique was described (Section 3.3) together with three different ways to choose the quantization centers (Section 3.3.1). The system (Section 3.5.1) and evaluation metrics (Section 3.4) were introduced. The structure of the data was explored (Section 3.5.2), and used to create the quantization centers. The quantization methods were compared in terms of the time taken and approximation error introduced (Section 3.5.3). Overall, the simple bucket method scored the worst. Both uniform and K-means methods were able to produce translations with BLEU

26 Chapter 3. Coordinate-wise Quantization 1 1 1 1 1 1 1 1 1 1 1 Figure 3.11: Green time taken by the original model. Orange an average time over 3 measurements with quantization, which uses the O(log Q ) algorithm to quantize the values. score approximately equal to the BLEU score of the unmodified model with only 25 centers, i.e. encoding each value in only 5 bits (compared to the original 32). Looking at the structure of centers generated (Figure 3.6), we might conclude that the values further away from 0 are important when predicting the next word, and should not be quantized too coarsely (as the simple bucket method has done).

Chapter 4 Clusterings Based on Neural State Similarity 4.1 Motivation As mentioned in section 1, in statistical MT models one could use decoding algorithms such as beam search and cube pruning (Section 2.2) to consider exponentially many hypotheses for the translation. This was the case thanks to dynamic programming, which would often merge hypotheses because they have the same state, and in this way decrease the number of times we have to call the score function. The issue with neural decoding is that the state space is too large and very sparse, and it is almost impossible to find two hypotheses with the same neural state. Even if they share most of the target words and only differ in one that was translated long ago (so probably does not have much effect on the current prediction task), their state vectors will be different. However, we believe that although these states are not equal, they might be similar; for example, they might have high cosine similarity, or they might be close to each other in some metric space. The idea is to cluster the states in a beam based on how similar they are, and decode the next word using the assumption that each cluster contains hypotheses that are alike, and so can be scored jointly. See section 4.3 for the details of the process. 27

28 Chapter 4. Clusterings Based on Neural State Similarity 4.2 Related work Liu et al. [2014] investigated using RNNLM for decoding in speech recognition. They proposed two ways to cluster the hypotheses based on their history contexts. First is an N-gram style clustering, where two hypotheses are merged if they share their language model state. Second is a clustering using state vectors; they merge hypotheses if they end with the same word, and if the distance between their state vectors is below a tuned beam parameter γ. They used Euclidean distance normalized by the dimensionality of the state vector space. They defined optimal clustering method as the one that minimizes the Kullback-Leibler divergence between the distributions P( h i 1 1 ) for hypotheses h i 1 1 within one cluster. 4.3 Future work: Coarse-to-fine decoding The goal of my project is to construct a hierarchical clustering of the hypotheses in the current beam, and reduce the computational costs by using the results of the simple problems to prune the search spaces for the more complex ones. More specifically, the idea is to bundle similar hypotheses together, and assign a shared state vector to each such bundle. Consider partitioning the beam to c clusters of hypotheses {H 1,H 2,...,H c }. Each cluster H i has a representative state vector, t i. This can be an average of the state vectors within the bundle, or the state vector of the previously top scoring hypothesis in H i. We use t i to score the whole bundle at once using our neural model, producing a probability distribution over the next word. We assign a score s i to each bundle, e.g. it can be the score of the previously top scoring hypothesis combined with the top score from the distribution. The bundles are inserted to a priority queue sorted by the scores. To construct the next beam, we follow a simple procedure as shown in Algorithm 1. This is prioritizing the exploration of the hypotheses based on the score of their cluster. Note that the only time we query the NN is on line 8, where we break a bundle and need to assign scores to each of the new bundles. Thus we never score extended hypotheses separately, unless they appear in a high scoring sequence of bundles. This could potentially eliminate the need to score the hypotheses of very low quality by grouping them together and assigning a low score to the whole group.

4.4. Clusterings 29 Algorithm 1 Constructing the next beam 1: pq = InitializePriorityQueueFromHypBundles() 2: nextbeam = {} 3: while nextbeam.size() beamsize do 4: item = pq.pop(); 5: if item.isunbundled() then 6: nextbeam.add(item) 7: else 8: bundles = item.break() Breaks the bundle to multiple smaller ones, and assigns each new representative state vector and a score. In case item represents a single hypothesis with W target words, this breaks it to W unbundled 9: pq.pushall(bundles) hypotheses. A small example of this is in figure 4.1. The order of exploring would be: 1. bundle with score -1.3; 2. bundle with score -1.9; 3. They are; 4. Who are; 5. bundle with score -4.5; 6. What is; 7. He is; 8. What are. If the beam size is smaller or equal to 2, the bundle with score -4.5 would never be unbundled, so we would save time on scoring its hypotheses separately. I am planning to implement this procedure at the beginning of my PhD. In this project, I am analyzing different ways to bundle the hypotheses. After obtaining the bundles, I will choose the representative vector t i to be the average of the state vectors inside the bundle. Then I will measure how much approximation error is introduced by using t i instead of the true state vector to predict the next word in a hypothesis. 4.4 Clusterings 4.4.1 Locality sensitive hashing Locality sensitive hashing (LSH) is a type of dimensionality reduction, suitable for finding similar items in high-dimensional spaces [Gionis et al., 1999]. It is often used in tasks such as the nearest neighbour search. In contrast to conventional hashing functions, LSH is constructed in a way that the probability of collision is large for

30 Chapter 4. Clusterings Based on Neural State Similarity ( ) What 7.1 are, 4.2-5.1 ( ) 7.3 What is, 3.4-4.4 ( ) He 2.1 is, 1.3-4.8 ( ) 1.8 They are, 1.7-1.7 ( ) 7.2 3.8-4.5 ( ) 6.7 Who are, 4.3-3.2 ( ) 1.95 1.5-1.9? ( ) 7.033 3.966-1.3 Figure 4.1: A toy example of the hypothesis bundles based on the Euclidean similarity of their state vectors. In this case, the neural states are two-dimensional, and for simplicity we assume that there is only one possible word to append to all the hypothesis (so that we do not have to care about unbundling the words at the end). vectors close to each other in the original metric space. The random projection method of LSH will be used [Charikar, 2002]. If we want to reduce dimensions from D to M, we first generate M random D-dimensional vectors v 1, v 2,..., v M. These are taken from a multivariate Gaussian distribution N(0,I D ), i.e. the coordinates are mutually independent with zero mean and unit variance. To hash a vector x R D, we compute the dot products d i := x v i for i {1,2,...,M}. The hashed value is then h( x) = (h 1,h 2,...,h M ) {±1} M, where h i = sgn(d i ). We can interpret this as follows: the random vector v i defines a random hyperplane in R D to which it is normal. Thus h i = sgn( x v i ) specifies on which side of the hyperplane x lies. Naturally, if two vectors are very close to each other, they are likely to have the same relative location with respect to the random hyperplanes, and so their hashes collide (Figure 4.2). In relation to the coarse-to-fine algorithm (Section 4.3), we would hash the state vectors in the beam, and bundle them according to their hash value. The hierarchical structure can be obtained by adding more hash vectors for a finer clustering. 4.4.2 Online k-means clustering The K-means clustering [Lloyd, 2006] can be used directly on the neural state vectors in a beam. This partitions the beam into K clusters (where K is given), trying to

4.5. Evaluation metrics 31 y x 1 (1,1) v 2 x 2 (1, 1) plane 2 0 plane 1 v 1 x 3 (1, 1) x Figure 4.2: Example of LSH in two dimensions (D = M = 2). The vectors x i are hashed according to their dot products with hash vectors v j. The vectors x 2 and x 3 have the same hash value, and so the algorithm puts them to the same cluster. minimize the Euclidean distance of each state vector and the mean of its cluster. To get the necessary hierarchical structure, for example one can fix K = 2 and when a bundle is to be explored, break it down to two smaller bundles by running the 2-means clustering again. There are several issues with this approach. First, this clustering might be too slow, reducing the effects of coarse-to-fine decoding if we need to run it several times on the high dimensional data. Second, it optimizes the squared Euclidean distance of the state vectors, and this might not be the most suitable metric for measuring the similarity of the hypotheses. 4.5 Evaluation metrics I will use the same metrics as described in section 3.4. In addition, there is a new point of interest in this method how coarse the clustering is as a function of the parameters of the method. To measure this, we introduce a new metric, an average bucket size the average number of items clustered to the same bundle (bucket) during the decoding process.

32 Chapter 4. Clusterings Based on Neural State Similarity 4.6 Experiments I used the same system and data as in chapter 3 (Section 3.5.1). 4.6.1 Implementation For the online K-means clustering, I used a C++ implementation 1 using Thrust, thus performing the vector operations on the GPU in a way compatible with the AmuNMT system used. I made two amendments to the original code: Initialization of the clusters: The default version assigns random labels to the datapoints, and then creates centers by averaging over each class. Instead, for each cluster I chose a random datapoint to be its initial center. This reduced the initial squared error. Empty cluster reinitialization: More importantly, if at any point a cluster becomes empty, the default program would just keep its center to be the zero vector (and thus most likely never start using it again, if the data is not near 0). As a result, specifying the K for the algorithm only gives an upper bound on the number of clusters formed. In practice, this upper bound was seldom reached. The wanted behaviour was that the error tends to 0 as K approaches the size of the beam (as each vector can have its own cluster). I achieved this by reinitializing the empty clusters to the datapoints currently furthest away from their cluster centers. It also helped to overcome the early stopping problem, i.e. reaching a local optimum after only one step. However, it is a bit slower, as one needs to keep track of the datapoints far away from their centers. 4.6.2 Results The results can be seen on figures from 4.3 to 4.12. To recover the BLEU score of the unmodified model, the LSH approach required 15 random hash vectors (achieving 0.2533 BLEU points). The corresponding KL divergence is 0.04, and the error rate on the top word (and also next beam) is 4%. This 1 https://github.com/bryancatanzaro/kmeans