Analysis of techniques for coarse-to-fine decoding in neural machine translation

Size: px
Start display at page:

Download "Analysis of techniques for coarse-to-fine decoding in neural machine translation"

Transcription

1 Analysis of techniques for coarse-to-fine decoding in neural machine translation Soňa Galovičová E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science by Research School of Informatics University of Edinburgh 2016

2

3 Abstract In this report, I first briefly explain how the neural machine translation models work. I introduce the issues connected to them, specifically the slow decoding and the lack of recombination. I propose techniques to tackle these problems, and analyze their efficiency using several metrics. First, I investigate coordinate-wise quantization. Afterwards, I describe the idea of a coarse-to-fine decoding algorithm, and I suggest and explore three different clustering techniques - locality sensitive hashing, K-means clustering and word history similarity. At the end I compare these methods against each other, and evaluate their suitability for the future work. iii

4 Acknowledgements This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh. iv

5 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Soňa Galovičová) v

6

7 Contents 1 Introduction 1 2 Background Neural models Language models Machine translation models Decoding Generating the probability distribution Byte pair encoding Metrics for experiment evaluation BLEU score Kullback-Leibler divergence Coordinate-wise Quantization Motivation Related work Quantization Creating the quantization set Q Evaluation metrics Metrics propagating the modifications Metrics evaluating the modifications locally Experiments System description Data exploration Results Time experiments Summary vii

8 4 Clusterings Based on Neural State Similarity Motivation Related work Future work: Coarse-to-fine decoding Clusterings Locality sensitive hashing Online k-means clustering Evaluation metrics Experiments Implementation Results Summary Clusterings Based on Word History Motivation Related work Clustering Evaluation metrics Experiments Results Comparison to the neural state clusterings Summary Conclusion and Future Work 47 Bibliography 49 viii

9 Chapter 1 Introduction Machine translation (MT) models based on neural networks (NNs) [Ñeco and Forcada, 1996] were invented with the aim to build a single system which takes a sentence as an input and outputs its translation directly, with all the subparts of the system tuned simultaneously [Bahdanau et al., 2014]. In this work, I will focus on infinite context models, which use networks such as recurrent NNs or LSTMs (Section 2.1). When searching for and scoring possible translations, these take the source sentence and the output words produced so far (which together form a history), and create a vector summarization of the parts most relevant to predicting the next word. As a result, they can capture long distance dependencies between words. This is not the case in traditional statistical MT models (e.g. phrase-based N-gram models) which use only local information to make predictions, and so any effect of words further away is lost [Katz, 1987, Koehn et al., 2003]. Unfortunately, NNs are limited by their speed, both in training and subsequent applications. For instance, they are too slow to translate the monolingual data for back translation (i.e. creating synthetic parallel data), which has been shown to be beneficial for training MT systems [Sennrich et al., 2015a]. There are 1.8 trillion words of English we want to exploit, but training the NN already takes weeks for billions of words in the training dataset. We might also be interested in constructing more complicated systems, for example combined left-to-right and right-to-left translation models. Sennrich et al. [2016] improved their system by using left-to-right model to produce an n-best list of possible translations, and then reranking the output with the right-to-left model (i.e. they scored 1

10 2 Chapter 1. Introduction the n sentences with the second model, and sorted them accordingly). However, reranking is of limited use in case the n-best lists are already of low quality (e.g. because the long distance dependencies were ignored). The fact that it works well suggests that we can achieve even better improvements by implementing a similar idea to decoding. If we found a decoding technique able to do a fast pass through the left-to-right model to approximately complete the sentence, we could use this to implement the bidirectional neural decoding: to predict next target word, we can use the completed sentence to directly obtain a score from the right-to-left model as well. Another problem of pure neural models is that they can handle only a limited vocabulary, and so are not well fitted for filling in rare words [Chen et al., 2015]. By using the neural scores in reranking of n-best lists, it has been shown that the combination of NNs and statistical MT works better than either of them on its own [Sutskever et al., 2014, Neubig et al., 2015]. Alkhouli et al. [2015] have investigated applying the RNN scores to decoding (Section 5.2), and reported that their method worked at least as well as reranking. Unfortunately, implementing NNs in the decoding is not trivial. As mentioned above, NNs are very slow to query, and so if we want to consider a large number of possible translations, scoring them all separately would slow down the decoding process greatly. Even though the resultant translation might be of higher quality, if the task requires much longer running time, it is not very practical. There are procedures for traditional statistical MT which allow us to consider big number of translations without scoring every one of them. Let us define hypothesis to be a possible partial translation, and state to be the inputs to the scoring function. By definition, if multiple hypotheses have exactly the same state, it is ensured that the future predictions will be identical for them. As a result, we can use dynamic programming to merge these hypotheses, and only explore the currently highest scoring one from each such group; as the future costs are identical, we know the other ones in each group will be at most as good in the future. This is called recombination. Approximate search techniques such as beam search [Chiang, 2005] or cube pruning [Chiang, 2007] (Section 2.2) constrain us to consider a limited number of recombined hypotheses, which is linear in the sentence length. In the case of simple N-gram models, this means we actually consider exponentially many hypotheses, as all the hypotheses finishing with the same (N 1) words and covering the same source words

11 3 can be merged. On the contrary, in the infinite context NN models two hypotheses have the same state only if they represent the same word sequence so far (because the neural state is a real valued vector, often of a high dimension). Consequently, few hypotheses are actually recombined. Thus using the same search techniques as mentioned above, we can only consider a linear number of hypotheses. As usually considering more hypotheses increases our chances to find the best one, this lowers the resultant translation quality. The goal of my project is to find a way to restore the benefits of recombination in neural models. I will apply different coarse-to-fine approximations [Petrov et al., 2008] which would allow us to consider more hypotheses in the same time, and measure how much approximation error they introduce. To do this, I will need to make the search more extensive by increasing the beam size (Section 2.2); this might cause the model to perform worse, as with the perplexity objective, scores actually go down with better search. This is a model problem that Andor et al. [2016] are trying to fix. In this thesis, I first introduce the relevant background (Chapter 2). I analyze coordinatewise quantization of the neural states (Chapter 3), which can make the decoding faster by using cheaper operations within the NNs. I show that using only 5 bits (25 distinct values) is sufficient to restore the quality of the unmodified model. I also investigate several ways to bundle multiple hypotheses together and score them simultaneously, namely two clusterings based on the similarity of the neural state vectors (Chapter 4) locality sensitive hashing and online K-means clustering and another one based on the word history (Chapter 5). I present the results showing that LSH is not suitable for future work, but both word history and K-means provide clusterings that look promising for the use in coarse-to-fine decoding. I conclude in chapter 6, comparing the methods to each other, evaluating their efficiency, and proposing possible future extensions.

12

13 Chapter 2 Background 2.1 Neural models Language models The simplest neural language model (LM) is based on feed-forward NNs [Bengio et al., 2003]. It is an N-gram model: it takes vector representations of the last (N 1) words as an input, concatenates them, and propagates the resulting vector through a feedforward NN. This produces a probability distribution over the next word. This does not solve the problem of finite context models mentioned in Section 1, namely the inability to capture long distance dependencies between the words. For this, we can use a LM based on the recurrent neural networks (RNNLM) [Mikolov et al., 2010]. In contrast to the above, RNNLM represents the full history in a continuous state vector. The state vector is updated with each word added to the output sentence, and the information about any preceding word can influence the predictions for arbitrarily long time [Boden, 2002]. Unfortunately, RNNs are hard to train because of the vanishing gradient problem [Bengio et al., 1994, Hochreiter et al., 2001]. This can be partly solved using a special type of RNN, Long Short-Term Memory network (LSTM) [Hochreiter and Schmidhuber, 1997]. This differs from RNN, but it is also an infinite context model, carrying a state vector. For the scope of this project, I will consider these two models - RNN and LSTM - to be the same one. The same techniques (Chapters 3, 4 and 5) will apply to both, although the architecture can influence their effectiveness. 5

14 6 Chapter 2. Background Machine translation models I will work with a left-to-right recurrent neural machine translation model, which belongs to a family of encoder-decoders (Figure 2.1) [Sutskever et al., 2014]. The encoder is an infinite context network trained on the parallel data, and its purpose is to take the source sentence and turn it into a single continuous valued vector, called the source context. In this project, I do not alter the encoding. The decoder is another network, trained jointly with the encoder. It can be used both for scoring and producing potential translations of the encoded source sentence. We start by plugging in the source context and the beginning of sentence symbol <s>. After this, the RNN produces a hidden state vector, which now corresponds to the sentence "<s>", and a probability distribution over the following word. Then we take the next word (either we have it as we just want to score a hypothesis, or we can choose according to the distribution produced), and plug it back together with the new state vector and the source context. These steps are repeated until we reach the end of the sentence. <s> Das ist ein Hund </s> source 0 h 1 h 2 h 3 h 4 context This is a dog source context s 1 s 2 s 3 s 4 s 5 (a) Encoder: this encodes the source sentence This is a dog to a source context vector, by pushing the words to the RNN one by one. (b) Decoder: scores the translation Das ist ein Hund" (the tags <s></s> mark the beginning and the end of the sentence). Each score is calculated using the source context, hidden state s i and the last word added. Figure 2.1: The machine translation recurrent neural model. Attention The simplest encoder-decoder MT models work as described above. However, in this setup the decoder only sees the last state from the encoder. To make it more flexible, we can use a model with attention [Bahdanau et al., 2014]. This is another NN trained

15 2.2. Decoding 7 <s> Great! </s> <s> Great! </s> source context = <s> Great! </s> <s> Super... w 1,0 w 1,1 w 1,2 w 1,3 Figure 2.2: The attentional encoder-decoder model. Encoder (red) creates context vectors for each word. Decoder (blue) queries the attention network to get weights w i, j for the words, and uses the corresponding linear combination to predict the next word. together with encoder and decoder, and its purpose is to tell the decoder how important the different source words are when predicting the next word. (This can be understood as a neural version of alignment.) More specifically, we feed the sentence to the encoder from left to right, and store the intermediate states (called h i in figure 2.1a). Then we do the same, but this time from right to left, obtaining a second state for each word in the sentence. We concatenate these two vectors for each word; together they form the source context. When the decoder wants to predict the next word, the attention network assigns weights to all the source words; the corresponding linear combination of the context vectors is then passed to decoder as the aligned source context (Figure 2.2). 2.2 Decoding For the purpose of this project, it is necessary to understand how the neural decoder works. I will explain it on the example of a decoding algorithm used in AmuNMT 1 [Junczys-Dowmunt et al., 2016]. This is a C++ implementation of an attentional encoder-decoder [Bahdanau et al., 2014] which I will be using for the experiments in chapters 3, 4 and 5. This algorithm uses a search technique called beam search [Chiang, 2005], and we have to specify its parameter, the beam size, in advance. To translate a sentence, it is first put to an encoder to obtain the source context (Section 2.1.2). A beam B, an 1

16 8 Chapter 2. Background array of current hypotheses, is initialized to contain a single hypothesis marking the beginning of a sentence. Then, until all the hypotheses end (or the translations reach the maximum length allowed; for example, this can be a constant multiple of the source sentence length), we iteratively decode one word at a time. To decode one word, we take the neural states of the hypotheses in B and use them to produce a probability distribution over the whole vocabulary (Section 2.2.1). These probabilities are then combined with the scores of the hypotheses in B, and scores for all the possible extended hypotheses, E, are obtained. The number of elements in E is B vocabulary. The beam size specifies how many top elements from E we carry forward and extend in the next step. Out of these we store the ones which have just finished (i.e. end with the character </s>), and subtract their number from the beam size for the future steps. The beam B is then replaced by the remaining hypotheses, and the procedure repeats to translate the next word. After the loop is finished, we look at the complete hypotheses that were stored during the whole process, and output the highest scoring one as the final translation (or highest scoring n if we are interested in an n-best list). Cube pruning Most statistical MT models do not use beam search anymore, abandoning it in favour of other search algorithms, most commonly cube pruning [Chiang, 2007]. Instead of scoring all the hypotheses in the current beam and then choosing the top ones, it prioritizes their exploration based on the scores of the hypotheses without the last word, and the individual scores of the words to be appended. The advantage is that since we are not scoring all the hypotheses separately, we can afford to increase the beam size, and so consider a larger number of hypotheses in the same amount of time Generating the probability distribution Diagram in figure 2.3 shows how the probability distribution over the next word is obtained. There is a hypothesis as an input: it consists of a previous state (the neural state of the hypothesis without the last word), and embedding of the last word added.

17 2.3. Byte pair encoding 9 Source context Aligned source context Previous State Hidden state Neural state Probabilities Embedding hypothesis Figure 2.3: Assembling the neural state, and calculating the probabilities for the next word. They are then combined by a NN to form a hidden state. Together with the source context (which is the same for all decoded words), we put the hidden state to the attention NN, which outputs aligned source context - this is the weighted sum of the context vectors for all the source words. Afterwards, the hidden state and the aligned source context are combined in another NN to produce a neural state. The final step is to feed the neural state, aligned source context and again the embedding to the last NN, which has a softmax layer producing the probability distribution over the next word. In this project, I am amending (quantizing in chapter 3 and clustering in chapters 4 and 5) the neural state vectors. 2.3 Byte pair encoding One of the limitations of neural MT models is that they can only handle a fixed vocabulary; it is usually limited to 30,000 50,000 words. This makes translation of rare words problematic. A possible way to introduce an open vocabulary in word-level neural MT models is by segmenting the words to smaller units, and then training the model on them instead. This can be helpful especially for languages forming words by agglugation and compounding, such as German (e.g. a language school eine Sprach schule). In my work, models operate on subword units created using byte pair encoding (BPE) [Sennrich et al., 2015b], which is an algorithm based on the BPE compression [Gage, 1994]. It starts with each word split to characters (ended by a special end of word character), and then iteratively merges the most frequent pairs of adjacent characters

18 10 Chapter 2. Background or character sequences, until the wanted vocabulary size is reached. Very frequent character n-grams or whole words are eventually merged, and so end up included in the vocabulary; rare words will be always passed to the model in their segmented form. 2.4 Metrics for experiment evaluation BLEU score Bilingual evaluation understudy (BLEU) [Papineni et al., 2002] is a commonly used metric to evaluate the quality of Machine Translation systems. It is completely automated, and it has been shown to highly correlate with human judgment (which is expensive and time-consuming). It compares the translation generated by the system c (candidate translation) to a human generated reference r, and evaluates how much their N-grams overlap. The exact formula is: ( N n=1w n log p n ) BLEU = BP exp BP is the brevity penalty, and it is penalizing translations shorter than the reference. It is crucial to ensure that models outputting only words (phrases) which they are really sure about would not score the highest. It is calculated as: 1 if c is longer than r BP = e (1 r/c) if c is shorter or equal in length to r N is the maximal length of n-grams used; this is usually set to 4; w n are positive weights summing to 1; uniform weights have w n = 1/N for all n; p n is the n-gram precision of the full candidate translation Kullback-Leibler divergence KL divergence or relative entropy is a measure of how different two probability distributions are [Kullback and Leibler, 1951]. Given two distributions on the same set X, P and Q, it is defined as: KL(P Q) = P(x)log P(x) x X Q(x)

19 2.4. Metrics for experiment evaluation 11 The inequality KL(P Q) 0 is always satisfied, with equality if and only if P and Q are identical on the whole of X. The value is finite if and only if Q(x) = 0 implies P(x) = 0, i.e. the support of Q is a superset of the support of P. If two distributions are similar, their KL divergence will be small. In general KL(P Q) KL(Q P), i.e. it is not symmetrical. In practice we often take P to be the true distribution of a random variable X, and Q to be its approximation. In this case, KL(P Q) is the expected value of log P(X) Q(X) under the true distribution of X.

20

21 Chapter 3 Coordinate-wise Quantization 3.1 Motivation The neural state is a high-dimensional (1024-dimensional for the model that was used; Section 3.5.1) vector with each coordinate being a 32-bit float, i.e. there are approximately 2 32 different values possible. If we were able to find a smaller set of values, representative enough so that the model still performs well when using only these, we could adjust the neural networks to use cheaper fixed-point integer operations during forward propagation, and thus reduce the decoding time. 3.2 Related work There are several results about neural networks being robust to errors associated with quantization. This suggests that we can expect approximating state vectors with quantized ones to work well. For instance, Courbariaux et al. [2014] reduced the precision of parameters and inputs/outputs of a neural network to simplify the multiplications. They found a very low precision is sufficient for both training and querying the NNs. Wu et al. [2015] quantized a convolutional neural network, and achieved 4 6 speedup with only 1% loss of classification accuracy. Kim and Smaragdis [2016] trained a network operating with binary values only, which allowed them to use the very efficient bitwise operations. They proposed several training methods, resulting in a network performing almost as well as the corresponding 13

22 14 Chapter 3. Coordinate-wise Quantization network with real values. 3.3 Quantization To quantize the coordinates of the neural state vector, we first choose a set Q of values of a fixed size Q. This set will not be changed during the decoding task. We then replace each value x in the neural state vector by the value ˆx Q, minimizing the distance x ˆx. Effectively, we are mapping the vectors from R D to the closest vector in the discrete vector space Q D (Figure 3.1). y y 3 y 2 y 1 quantized vector original vector 0 x 1 x 2 x 3 x 4 x 5 x Figure 3.1: Example of quantization in two dimensions. The original vector is transformed to the closest vector in the discrete subspace given by the quantization grid (green). A slightly more complex procedure allows Q to vary for different coordinates (i.e. we would have D different sets, one for each dimension of the neural state). This might be beneficial if it turns out that the distributions of numbers in different coordinates are very distinct Creating the quantization set Q Uniform (fig. 3.2a) If the majority of the values in the neural state vectors fall into an interval [a,b] (Section 3.5.2), we can just cut this interval to Q equal pieces and take their centres to form Q: { ( (b a) Q = a + t + 1 ) } : t {0,1,...,( Q 1)} Q 2 This approach has the advantage that for a given value x R, it is possible to find the closest value ˆx Q in constant time just by rounding x.

23 3.4. Evaluation metrics 15 a t = 0 t = 1 t = 2 t = 3 (b a) 4 (a) Uniform: only depends on the data range. b quant. centre data cluster (b) Simple bucket: equally sized clusters. (c) K-means: more clever clusters. Figure 3.2: Example of quantization types for Q = 4. Simple buckets (fig. 3.2b) We sample a selection of values from the state vectors created during decoding the training data to get a training set (Section 3.5.1). We sort and split it into Q buckets of the same size (if the number of values is not divisible by Q, the first few buckets will contain 1 more element than the latter ones). Then we take the average of each bucket to form Q. This is basically doing the first step of the K-means clustering algorithm on the values. K-means clustering (fig. 3.2c) Run K-means algorithm [Lloyd, 2006] on the training set (obtained as described above) with K = Q ; the outputted centers will form Q. 3.4 Evaluation metrics I will present my results for various sets of quantization centers in section To test a method, I will modify the decoding process (Section 2.2) and quantize the neural states at each step (i.e. prediction of one word). A probability distribution over the next word is then created for each hypothesis, and I will either: Propagate the modifications: use this probability distribution to determine the scores of current hypotheses, and carry on using these scores; or

24 16 Chapter 3. Coordinate-wise Quantization Evaluate the modifications locally: compare the output to the true probability distribution, i.e. the one which is created using the original (non-quantized) state vectors; then throw the modifications away and carry on decoding using the true scores Metrics propagating the modifications BLEU score (Section 2.4.1) In the BLEU test, I will decode the sentences using the modified states, which will produce an alternative translation. Comparing its BLEU score to the BLEU score of an unmodified model, we can see how much damage the modification causes to the translation task. We are expecting to see that using smaller number of quantization centers will lead to bigger drop in BLEU, and we are interested in the minimal number of centers sufficient to restore the original value of BLEU Metrics evaluating the modifications locally Kullback-Leibler divergence (Section 2.4.2) We can look at the probability distributions over the whole target vocabulary, and measure how similar they are. For this, the KL divergence KL(P Q) can be used, with P being the true distribution and Q the distribution given by quantized states. We would say that an approximation is good (meaning it does not damage the resulting probability distribution too much) if the KL divergence is small. One issue with using this metric for evaluation is that the interpretation is quite vague what is a small KL divergence? It might be used as a measure for comparing different methods, but it is difficult to explain the result on its own. At the same time, we might not need to care about the damage to the whole probability distribution. In the beam search algorithm (Section 2.2) that we use during the decoding, we score a set of hypotheses and then keep only the top scoring few. Therefore it makes sense to ensure that our approximation gets the top scores right, and do not penalize the ones which do not exactly match the true distribution in its tail (i.e. the hypotheses which are very unlikely under the original model).

25 3.4. Evaluation metrics 17 For this reason I will measure also other quantities, which should be easier to interpret and possibly align closer to our interest: Precision on the top word This a very simple quantity, measuring how often a word which scores highest in the true distribution is also scoring highest after the modifications. For each sentence, the measured quantity will be averaged over all the hypotheses that are scored during the process: Top 1 = # of hypotheses scored for which the top 1 word did not change # of hypotheses scored This metric is treating each hypothesis separately. We might also be interested in a slightly more global properties, considering the full beams instead: Precision on the next beam This looks at the whole beam and calculates the precision on the next beam; i.e. how many hypotheses selected to the next beam using the true distribution are still selected after the modifications. For a specific beam b i, we calculate: π(b i ) = (true b i+1) (b i+1 after modifications) true b i+1 For each sentence s, the reported value is a weighted average over the beams: with weights w(b i ) = true b i+1 i true b i+1. Equivalently, this is π(s) = w(b i )π(b i ) i π(s) = i (true b i+1 ) (b i+1 after modifications) i true b i+1 This gives us the proportion of the hypotheses which the original model would choose, and which are not lost when using quantization.

26 18 Chapter 3. Coordinate-wise Quantization 3.5 Experiments System description I will perform all my experiments on an English Czech model designed and trained by Sennrich et al. [2016] for WMT 2016 shared news translation task. This is a pure neural translation system based on an attentional encoder-decoder [Bahdanau et al., 2014] (Section 2.1.2). It is implemented according to the dl4mt-tutorial 1, but it uses a more efficient AmuNMT C++ decoder (Section 2.2) instead of the given theano implementation. To achieve the open vocabulary translation, byte-pair encoding [Sennrich et al., 2015b] (Section 2.3) is used to segment the words. The English Czech model was first trained on the full WMT16 parallel training set, until it converged on the heldout validation set. Afterwards, the training continued with a synthetic parallel corpus created by back-translating monolingual Czech data [Sennrich et al., 2015a]. For the back-translation a neural MT model from earlier experiments, trained on WMT15 data, was used. Sennrich et al. [2016] used several other enhancements for their submission, including combining left-to-right and right-to-left models through reranking of the n-best list. I will not use these in my experiments. The system I will be working with achieved BLEU score on their dev set, and on the official WMT16 test set. I used a fixed beam size equal to 100 in the experiments. The data was tokenized, truecased and segmented using BPE. For data exploration and construction of set Q I used training data the WMT newstest2013 (3000 sentences). I uniformly sampled 100,000 neural states of 6,625,568 produced during decoding, and collected 100,000 sample values for each coordinate, as well as for all the coordinates at once. For the experiments, test data was used first 1000 sentences from WMT newstest2016. The unmodified model scored 0.25 BLEU points on my test set. Implementation The codebase uses the Thrust 2 library, which is a CUDA equivalent for C++ standard template library (STL). The loop over the coordinates of the neural state is parallelized

27 3.5. Experiments 19 on the GPU; for time analysis see section For each coordinate of the neural state, the closest value in the quantization set is found using a binary search, i.e. the time to quantize one value is O(log Q ). For uniform quantization (Section 3.3.1) this can be done in O(1), as the value of the closest center can be computed directly. I implemented this in a naive way (calculating the position in the array explicitly) in the first version of my program, but it turned out to be effectively as fast as the more general binary search version, so I did not use it in the final version. Speed-up is still possible by using a more clever way of rounding the values Data exploration To use any of the quantization methods described in section 3.3.1, we need to know more about the distribution of the data. I made my analysis on a training set (Section 3.5.1), which contains 100,000 representative values from each coordinate of the neural state vector, and also a sample of 100,000 from all of them (this is a sample for the case when we consider all the coordinates to be drawn from the same distribution). The distribution of values from all coordinates can be seen in Figure 3.3. As a consequence of using a tanh activation function in the neural model, all the values are in [ 1,1]. There are more values clustered around 0, with median slightly shifted from 0 towards the negative numbers (it is on the sample). The histogram of the distribution looks similar to normal distribution around its mean (although it cannot be normal due to the finite support). Trying to fit a Gaussian curve to it (Figure 3.4) we see that indeed this is not the case. The histogram is narrower than the Gaussian equation allows, decreasing faster near the median value and then slowing down, giving rise to heavier tails. We can also try to fit a beta distribution to it, which is more suitable as it has finite support (we have to shift it from [0,1] to [ 1,1]). However, one can see (Figure 3.4) that this fit is almost identical to the Gaussian one, and so does not explain the distribution better. It is also interesting to see whether the distributions look the same for all coordinates. In order to find out, we can plot the distributions for selected values of i {1,2,...,1024}. As the order of the coordinates is not important during the training, we can just choose i {1,2,...,5}. The results can be seen in Figure 3.5. The distributions differ slightly - they all have the bell shape, but the median and width

28 20 Chapter 3. Coordinate-wise Quantization Figure 3.3: Distribution of values in the neural state vectors, treating all coordinates equally Figure 3.4: Fitting the curve over the histogram of values from the neural states.

29 3.5. Experiments Figure 3.5: Distribution of values in the neural state vectors for different coordinates. of the peak differ. We note there are some irregularities even to the shape, for example the heavy tail near 1 for 2 nd coordinate. This suggests using different set Q for different coordinates i {1,2,...,1024} might be beneficial. In this thesis, I will only explore global quantization (i.e. using the same Q for all coordinates), as the distributions are not too different (and it would also introduce additional memory usage if we stored 1024 different arrays for quantization). However, if the results of my experiments suggest quantization works well, it might be sensible to implement this in the future to see whether even better overall improvements are possible. Different types of the quantization set Q I used the full training data (Section 3.5.1) to construct the sets of quantization centers as described in Section The values chosen for Q = 63 can be seen in Figure 3.6. The uniform centers are evenly distributed in [ 1,1]. Simple bucket clustering creates centers which follow the same curve as the data themselves (the thick light blue line), giving a lot of centers for the more frequent values near 0. Centers defined using the K-means clustering are somewhat in between, still giving more weight to the values around 0, but the clusters being distributed more evenly on the whole interval.

30 22 Chapter 3. Coordinate-wise Quantization Figure 3.6: Quantization centers created by 3 different approaches (Section 3.3.1) for Q = Results The results can be seen on figures 3.7, 3.8, 3.9, and In terms of BLEU, the simple bucket centers perform the worst 100 centers were needed to achieve BLEU score , restoring the score 0.25 of the unmodified model. For both uniform and K-means approaches, 25 centers seem to be enough (scoring and BLEU points respectively), and 16 centers are only slightly worse (scoring and respectively). The KL divergence test confirms that simple bucket clusters are the least suitable. For all methods, it looks like there is a linear relationship between the logarithms of the KL divergence and the number of quantization centers, i.e. the KL divergence decreases as a power of Q. We note that the KL divergence is and for the uniform and K-means with 25 centers (resp.), and for simple bucket with 100 centers. Thus it seems that to achieve the BLEU score of the original model, KL divergence has to be at most The results for precisions on top word and next beam are very similar, suggesting that measuring the precision on the top word might be a good approximation for the precision on the next beam (calculation of which takes more time). Again, there seems to be a linear relationship between the logarithms of the axes, with K-means and uniform

31 3.5. Experiments Figure 3.7: BLEU scores as a function of the quantization set Q Figure 3.8: KL divergence as a function of the quantization set Q. Note that although for uniform and K-means the value at 10,000 clusters is 0 (i.e. due to the log scale), this does not necessarily mean the KL divergence was always 0. The output precision was 6 decimal places and so the average KL being 0.0 just means that all the measured values were below

32 24 Chapter 3. Coordinate-wise Quantization Figure 3.9: Precision on the top word as a function of the quantization set Q. being about equally good, and simple buckets decreasing a bit slower. Achieving the same BLEU score as the unmodified model corresponds to roughly 3% error rate (uniform with 25 centers has precision 0.969, K-means with 25 centers has 0.976, and simple buckets with 100 centers has 0.978) Time experiments A time comparison of the task to translate 1000 sentences (propagating the modifications) is in Figure The original model without quantization took seconds per sentence on average. The time is slightly higher for the models with quantization, ranging from seconds per sentence for only 2 centres to seconds per sentence for 10,000 centres. The measurements are done with the implementation using the binary search to find the closest value for each coordinate in O(log Q ). For the case with 2 centers, this lookup should be quite fast - and we see that the model is still approximately 5 seconds slower than the original one. This shows that a naive O(1) implementation of the uniform quantization would not speed up the process very much, as the actual lookup does not take a lot of extra time.

33 3.6. Summary Figure 3.10: Precision on the next beam as a function of the quantization set Q. The current implementation slows down the process by % only. Since we have not applied any of the potential speedups (Section 3.1) such as refactoring the network to use faster fixed-point integer operations, we believe that in the end this technique will make the decoding faster, as intended. 3.6 Summary We analyzed the potential benefits of quantization applied to neural states. The intention is to use it to speed up the machine translation models by reimplementing the networks to use smaller number of bits in the matrix operations (Section 3.1); a similar approach has been shown successful in other domains (Section 3.2). The technique was described (Section 3.3) together with three different ways to choose the quantization centers (Section 3.3.1). The system (Section 3.5.1) and evaluation metrics (Section 3.4) were introduced. The structure of the data was explored (Section 3.5.2), and used to create the quantization centers. The quantization methods were compared in terms of the time taken and approximation error introduced (Section 3.5.3). Overall, the simple bucket method scored the worst. Both uniform and K-means methods were able to produce translations with BLEU

34 26 Chapter 3. Coordinate-wise Quantization Figure 3.11: Green time taken by the original model. Orange an average time over 3 measurements with quantization, which uses the O(log Q ) algorithm to quantize the values. score approximately equal to the BLEU score of the unmodified model with only 25 centers, i.e. encoding each value in only 5 bits (compared to the original 32). Looking at the structure of centers generated (Figure 3.6), we might conclude that the values further away from 0 are important when predicting the next word, and should not be quantized too coarsely (as the simple bucket method has done).

35 Chapter 4 Clusterings Based on Neural State Similarity 4.1 Motivation As mentioned in section 1, in statistical MT models one could use decoding algorithms such as beam search and cube pruning (Section 2.2) to consider exponentially many hypotheses for the translation. This was the case thanks to dynamic programming, which would often merge hypotheses because they have the same state, and in this way decrease the number of times we have to call the score function. The issue with neural decoding is that the state space is too large and very sparse, and it is almost impossible to find two hypotheses with the same neural state. Even if they share most of the target words and only differ in one that was translated long ago (so probably does not have much effect on the current prediction task), their state vectors will be different. However, we believe that although these states are not equal, they might be similar; for example, they might have high cosine similarity, or they might be close to each other in some metric space. The idea is to cluster the states in a beam based on how similar they are, and decode the next word using the assumption that each cluster contains hypotheses that are alike, and so can be scored jointly. See section 4.3 for the details of the process. 27

36 28 Chapter 4. Clusterings Based on Neural State Similarity 4.2 Related work Liu et al. [2014] investigated using RNNLM for decoding in speech recognition. They proposed two ways to cluster the hypotheses based on their history contexts. First is an N-gram style clustering, where two hypotheses are merged if they share their language model state. Second is a clustering using state vectors; they merge hypotheses if they end with the same word, and if the distance between their state vectors is below a tuned beam parameter γ. They used Euclidean distance normalized by the dimensionality of the state vector space. They defined optimal clustering method as the one that minimizes the Kullback-Leibler divergence between the distributions P( h i 1 1 ) for hypotheses h i 1 1 within one cluster. 4.3 Future work: Coarse-to-fine decoding The goal of my project is to construct a hierarchical clustering of the hypotheses in the current beam, and reduce the computational costs by using the results of the simple problems to prune the search spaces for the more complex ones. More specifically, the idea is to bundle similar hypotheses together, and assign a shared state vector to each such bundle. Consider partitioning the beam to c clusters of hypotheses {H 1,H 2,...,H c }. Each cluster H i has a representative state vector, t i. This can be an average of the state vectors within the bundle, or the state vector of the previously top scoring hypothesis in H i. We use t i to score the whole bundle at once using our neural model, producing a probability distribution over the next word. We assign a score s i to each bundle, e.g. it can be the score of the previously top scoring hypothesis combined with the top score from the distribution. The bundles are inserted to a priority queue sorted by the scores. To construct the next beam, we follow a simple procedure as shown in Algorithm 1. This is prioritizing the exploration of the hypotheses based on the score of their cluster. Note that the only time we query the NN is on line 8, where we break a bundle and need to assign scores to each of the new bundles. Thus we never score extended hypotheses separately, unless they appear in a high scoring sequence of bundles. This could potentially eliminate the need to score the hypotheses of very low quality by grouping them together and assigning a low score to the whole group.

37 4.4. Clusterings 29 Algorithm 1 Constructing the next beam 1: pq = InitializePriorityQueueFromHypBundles() 2: nextbeam = {} 3: while nextbeam.size() beamsize do 4: item = pq.pop(); 5: if item.isunbundled() then 6: nextbeam.add(item) 7: else 8: bundles = item.break() Breaks the bundle to multiple smaller ones, and assigns each new representative state vector and a score. In case item represents a single hypothesis with W target words, this breaks it to W unbundled 9: pq.pushall(bundles) hypotheses. A small example of this is in figure 4.1. The order of exploring would be: 1. bundle with score -1.3; 2. bundle with score -1.9; 3. They are; 4. Who are; 5. bundle with score -4.5; 6. What is; 7. He is; 8. What are. If the beam size is smaller or equal to 2, the bundle with score -4.5 would never be unbundled, so we would save time on scoring its hypotheses separately. I am planning to implement this procedure at the beginning of my PhD. In this project, I am analyzing different ways to bundle the hypotheses. After obtaining the bundles, I will choose the representative vector t i to be the average of the state vectors inside the bundle. Then I will measure how much approximation error is introduced by using t i instead of the true state vector to predict the next word in a hypothesis. 4.4 Clusterings Locality sensitive hashing Locality sensitive hashing (LSH) is a type of dimensionality reduction, suitable for finding similar items in high-dimensional spaces [Gionis et al., 1999]. It is often used in tasks such as the nearest neighbour search. In contrast to conventional hashing functions, LSH is constructed in a way that the probability of collision is large for

38 30 Chapter 4. Clusterings Based on Neural State Similarity ( ) What 7.1 are, ( ) 7.3 What is, ( ) He 2.1 is, ( ) 1.8 They are, ( ) ( ) 6.7 Who are, ( ) ? ( ) Figure 4.1: A toy example of the hypothesis bundles based on the Euclidean similarity of their state vectors. In this case, the neural states are two-dimensional, and for simplicity we assume that there is only one possible word to append to all the hypothesis (so that we do not have to care about unbundling the words at the end). vectors close to each other in the original metric space. The random projection method of LSH will be used [Charikar, 2002]. If we want to reduce dimensions from D to M, we first generate M random D-dimensional vectors v 1, v 2,..., v M. These are taken from a multivariate Gaussian distribution N(0,I D ), i.e. the coordinates are mutually independent with zero mean and unit variance. To hash a vector x R D, we compute the dot products d i := x v i for i {1,2,...,M}. The hashed value is then h( x) = (h 1,h 2,...,h M ) {±1} M, where h i = sgn(d i ). We can interpret this as follows: the random vector v i defines a random hyperplane in R D to which it is normal. Thus h i = sgn( x v i ) specifies on which side of the hyperplane x lies. Naturally, if two vectors are very close to each other, they are likely to have the same relative location with respect to the random hyperplanes, and so their hashes collide (Figure 4.2). In relation to the coarse-to-fine algorithm (Section 4.3), we would hash the state vectors in the beam, and bundle them according to their hash value. The hierarchical structure can be obtained by adding more hash vectors for a finer clustering Online k-means clustering The K-means clustering [Lloyd, 2006] can be used directly on the neural state vectors in a beam. This partitions the beam into K clusters (where K is given), trying to

39 4.5. Evaluation metrics 31 y x 1 (1,1) v 2 x 2 (1, 1) plane 2 0 plane 1 v 1 x 3 (1, 1) x Figure 4.2: Example of LSH in two dimensions (D = M = 2). The vectors x i are hashed according to their dot products with hash vectors v j. The vectors x 2 and x 3 have the same hash value, and so the algorithm puts them to the same cluster. minimize the Euclidean distance of each state vector and the mean of its cluster. To get the necessary hierarchical structure, for example one can fix K = 2 and when a bundle is to be explored, break it down to two smaller bundles by running the 2-means clustering again. There are several issues with this approach. First, this clustering might be too slow, reducing the effects of coarse-to-fine decoding if we need to run it several times on the high dimensional data. Second, it optimizes the squared Euclidean distance of the state vectors, and this might not be the most suitable metric for measuring the similarity of the hypotheses. 4.5 Evaluation metrics I will use the same metrics as described in section 3.4. In addition, there is a new point of interest in this method how coarse the clustering is as a function of the parameters of the method. To measure this, we introduce a new metric, an average bucket size the average number of items clustered to the same bundle (bucket) during the decoding process.

40 32 Chapter 4. Clusterings Based on Neural State Similarity 4.6 Experiments I used the same system and data as in chapter 3 (Section 3.5.1) Implementation For the online K-means clustering, I used a C++ implementation 1 using Thrust, thus performing the vector operations on the GPU in a way compatible with the AmuNMT system used. I made two amendments to the original code: Initialization of the clusters: The default version assigns random labels to the datapoints, and then creates centers by averaging over each class. Instead, for each cluster I chose a random datapoint to be its initial center. This reduced the initial squared error. Empty cluster reinitialization: More importantly, if at any point a cluster becomes empty, the default program would just keep its center to be the zero vector (and thus most likely never start using it again, if the data is not near 0). As a result, specifying the K for the algorithm only gives an upper bound on the number of clusters formed. In practice, this upper bound was seldom reached. The wanted behaviour was that the error tends to 0 as K approaches the size of the beam (as each vector can have its own cluster). I achieved this by reinitializing the empty clusters to the datapoints currently furthest away from their cluster centers. It also helped to overcome the early stopping problem, i.e. reaching a local optimum after only one step. However, it is a bit slower, as one needs to keep track of the datapoints far away from their centers Results The results can be seen on figures from 4.3 to To recover the BLEU score of the unmodified model, the LSH approach required 15 random hash vectors (achieving BLEU points). The corresponding KL divergence is 0.04, and the error rate on the top word (and also next beam) is 4%. This 1

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Contents 1 Pseudo-code for the damped Gauss-Newton vector product 2 2 Details of the pathological synthetic problems

More information

Tuning as Linear Regression

Tuning as Linear Regression Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Random Coattention Forest for Question Answering

Random Coattention Forest for Question Answering Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Better Conditional Language Modeling. Chris Dyer

Better Conditional Language Modeling. Chris Dyer Better Conditional Language Modeling Chris Dyer Conditional LMs A conditional language model assigns probabilities to sequences of words, w =(w 1,w 2,...,w`), given some conditioning context, x. As with

More information

An overview of word2vec

An overview of word2vec An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations

More information

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing and Recurrent Neural Networks Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26 Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain

More information

Translator

Translator Translator Marian Rejewski A few words about Marian Portable C++ code with minimal dependencies (CUDA or MKL and still Boost); Single engine for training and decoding on GPU and CPU; Custom auto-diff engine

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes) Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

A fast and simple algorithm for training neural probabilistic language models

A fast and simple algorithm for training neural probabilistic language models A fast and simple algorithm for training neural probabilistic language models Andriy Mnih Joint work with Yee Whye Teh Gatsby Computational Neuroscience Unit University College London 25 January 2013 1

More information

Design and Implementation of Speech Recognition Systems

Design and Implementation of Speech Recognition Systems Design and Implementation of Speech Recognition Systems Spring 2013 Class 7: Templates to HMMs 13 Feb 2013 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from

More information

Lecture 1 : Data Compression and Entropy

Lecture 1 : Data Compression and Entropy CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu In the neural encoder-decoder

More information

Sequence Modeling with Neural Networks

Sequence Modeling with Neural Networks Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad 1. Pruning 2. Quantization 3. Architectural Modifications PRUNING WHY PRUNING? Deep Neural Networks have redundant parameters.

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

The Geometry of Statistical Machine Translation

The Geometry of Statistical Machine Translation The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu Abstract We build a multi-source

More information

Lecture 6: Neural Networks for Representing Word Meaning

Lecture 6: Neural Networks for Representing Word Meaning Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

CS230: Lecture 10 Sequence models II

CS230: Lecture 10 Sequence models II CS23: Lecture 1 Sequence models II Today s outline We will learn how to: - Automatically score an NLP model I. BLEU score - Improve Machine II. Beam Search Translation results with Beam search III. Speech

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Vectors and their uses

Vectors and their uses Vectors and their uses Sharon Goldwater Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh DRAFT Version 0.95: 3 Sep 2015. Do not redistribute without permission.

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Language Model Rest Costs and Space-Efficient Storage

Language Model Rest Costs and Space-Efficient Storage Language Model Rest Costs and Space-Efficient Storage Kenneth Heafield Philipp Koehn Alon Lavie Carnegie Mellon, University of Edinburgh July 14, 2012 Complaint About Language Models Make Search Expensive

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

Feature selection. Micha Elsner. January 29, 2014

Feature selection. Micha Elsner. January 29, 2014 Feature selection Micha Elsner January 29, 2014 2 Using megam as max-ent learner Hal Daume III from UMD wrote a max-ent learner Pretty typical of many classifiers out there... Step one: create a text file

More information

Neural Hidden Markov Model for Machine Translation

Neural Hidden Markov Model for Machine Translation Neural Hidden Markov Model for Machine Translation Weiyue Wang, Derui Zhu, Tamer Alkhouli, Zixuan Gan and Hermann Ney {surname}@i6.informatik.rwth-aachen.de July 17th, 2018 Human Language Technology and

More information

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances 6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

Caesar s Taxi Prediction Services

Caesar s Taxi Prediction Services 1 Caesar s Taxi Prediction Services Predicting NYC Taxi Fares, Trip Distance, and Activity Paul Jolly, Boxiao Pan, Varun Nambiar Abstract In this paper, we propose three models each predicting either taxi

More information

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN A Tutorial On Backward Propagation Through Time (BPTT In The Gated Recurrent Unit (GRU RNN Minchen Li Department of Computer Science The University of British Columbia minchenl@cs.ubc.ca Abstract In this

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Generating Sequences with Recurrent Neural Networks

Generating Sequences with Recurrent Neural Networks Generating Sequences with Recurrent Neural Networks Alex Graves University of Toronto & Google DeepMind Presented by Zhe Gan, Duke University May 15, 2015 1 / 23 Outline Deep recurrent neural network based

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve) Neural Turing Machine Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve) Introduction Neural Turning Machine: Couple a Neural Network with external memory resources The combined

More information

Asymptotic Notation. such that t(n) cf(n) for all n n 0. for some positive real constant c and integer threshold n 0

Asymptotic Notation. such that t(n) cf(n) for all n n 0. for some positive real constant c and integer threshold n 0 Asymptotic Notation Asymptotic notation deals with the behaviour of a function in the limit, that is, for sufficiently large values of its parameter. Often, when analysing the run time of an algorithm,

More information

Neural Network Language Modeling

Neural Network Language Modeling Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Composite Quantization for Approximate Nearest Neighbor Search

Composite Quantization for Approximate Nearest Neighbor Search Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from

More information

Information Theory. Week 4 Compressing streams. Iain Murray,

Information Theory. Week 4 Compressing streams. Iain Murray, Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 4 Compressing streams Iain Murray, 2014 School of Informatics, University of Edinburgh Jensen s inequality For convex functions: E[f(x)]

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

New Attacks on the Concatenation and XOR Hash Combiners

New Attacks on the Concatenation and XOR Hash Combiners New Attacks on the Concatenation and XOR Hash Combiners Itai Dinur Department of Computer Science, Ben-Gurion University, Israel Abstract. We study the security of the concatenation combiner H 1(M) H 2(M)

More information

Lecture 13: More uses of Language Models

Lecture 13: More uses of Language Models Lecture 13: More uses of Language Models William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 13 What we ll learn in this lecture Comparing documents, corpora using LM approaches

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Vinod Variyam and Ian Goodfellow) sscott@cse.unl.edu 2 / 35 All our architectures so far work on fixed-sized inputs neural networks work on sequences of inputs E.g., text, biological

More information

Integer weight training by differential evolution algorithms

Integer weight training by differential evolution algorithms Integer weight training by differential evolution algorithms V.P. Plagianakos, D.G. Sotiropoulos, and M.N. Vrahatis University of Patras, Department of Mathematics, GR-265 00, Patras, Greece. e-mail: vpp

More information

Task-Oriented Dialogue System (Young, 2000)

Task-Oriented Dialogue System (Young, 2000) 2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see

More information