Unüberwachtes Lernen mit Anwendungen bei der Verarbeitung natürlicher Sprache. Unsupervised Training with Applications in Natural Language Processing

Size: px

Start display at page:

Download "Unüberwachtes Lernen mit Anwendungen bei der Verarbeitung natürlicher Sprache. Unsupervised Training with Applications in Natural Language Processing"

Kristopher Fields
6 years ago
Views:

1 Masterarbeit im Fach Informatik vorgelegt der Fakultät für Mathematik, Informatik und Naturwissenschaften der RWTH Aachen Lehrstuhl für Informatik 6 Prof. Dr.-Ing. H. Ney Unüberwachtes Lernen mit Anwendungen bei der Verarbeitung natürlicher Sprache Unsupervised Training with Applications in Natural Language Processing vorgelegt von Julian Schamper aus Mechernich Matrikelnummer: Gutachter: Prof. Dr.-Ing. Hermann Ney Prof. Dr. rer. nat. Thomas Seidl Betreuer: Dipl.-Phys., Dipl.-Inform. Malte Nuhn Aachen, 30. September 2015

3 Masterarbeit im Fach Informatik Unüberwachtes Lernen mit Anwendungen bei der Verarbeitung natürlicher Sprache Unsupervised Training with Applications in Natural Language Processing Julian Schamper September 30, 2015

5 Versicherung an Eides statt Hiermit erkläre ich, Julian Schamper, an Eides statt, dass ich die vorliegende Masterarbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe. Aachen, Julian Schamper iii

7 Abstract For most applications in natural language processing, the amount of available unlabeled data is tremendously higher than the amount of human annotated data. The performance of current state-of-the-art systems highly depends on the amount of available training data. Therefore the training of statistical models without human annotated data unsupervised training is attractive. This thesis studies certain aspects of unsupervised training, which are not covered in detail by previous work. The studied problems are spelling correction and the solving of noisy substitution ciphers performed on character level. These problems have a well controllable parameter set and allow for a large number of systematic experiments. All experiments are done on the same data set. The noisy data part follows a λ-model, which distributes noise uniformly on all characters. The first set of experiments investigates upon the relations between the noise ratio, the language model quality and the resulting error rate of the spelling correction algorithm. Additionally a typical approximation is studied. These first experiments are conducted in a supervised way in order to have results for later comparison. The next set of experiments shows, that the noise parameter of the λ-model can be learned very precisely by unsupervised training, using only a noisy text and a language model. The corresponding error rates do not significantly differ from the error rates obtained in the supervised setup. Besides the successfully used maximum likelihood training, we investigate on another training criterion and show that it is impractical for unsupervised training of the noise parameter. In the final experiment set we solve noisy substitution ciphers by using the expectation maximization (EM) algorithm. This involves to learn a full probabilistic substitution table in an unsupervised way. We evaluate the resulting error rate, convergence speed, likelihood, and the difference to the correct full table. We will show that the available training data size, the language model quality and the noise ratio have a high impact on the training performance. Here we observe cases where the results depend on the initialization of the EM algorithm. Nevertheless we obtain error rates closely to the corresponding supervised error rates if the training data size is sufficiently large. In the very last experiments, the noisy data follows a structure different to the λ-model. Then the problem seems to be simpler. The error rates are lower and the convergence is faster. v

9 Contents 1 Introduction Related Work Outline Notation Basic Theory of Probabilistic Spelling Correction Model Decision Rules Minimum Symbol Error Criterion (SUM-CRITERION) Minimum String Error Criterion (MAX-CRITERION) Data Setup Corpus Choice Language Models Adding Noise Supervised Spelling Correction with λ-model Research Questions Experiments General Observations Details on the Functional Dependency between Perplexity and Error Rate Details on the Functional Dependency between λ and Error Rate Examples from Experiments Unsupervised Training of λ-model Research Questions Learning of λ via Maximum Likelihood Criterion Learning λ via EM Algorithm Experiments Scanning λ Learning λ via EM Algorithm Error Rates for all LMs vii

10 6 Expected Accuracy Criterion Research Question Unsupervised Training of Full Table Model Research Questions New Aspects in Comparison to λ-model EM Algorithm for Full Table {p(x c)} Evaluation Methods Training Data Partitioning Experiments Effects of Training Data Size, Language Model, λ and Initialization Error Rates for all LMs Keyboard and Rival Model Research Question Keyboard Model Rival Model Experiments Conclusion 55 A Basic Theory of Probabilistic Spelling Correction 59 A.1 Derivation of Forward-Backward Algorithm A.1.1 Forward Recursion A.1.2 Backward Recursion A.2 Sequence Joint Probability Recursion B Supervised Spelling Correction with λ-model 63 B.1 Error Rate Tables B.2 Log Log Plot for Word Perplexity vs. Error Rate C Unsupervised Training of λ-model 66 C.1 Derivation of EM Algorithm for λ C.2 Spelling Correction for Higher Order LMs, λ trained by 4-gram LM D Unsupervised Training of Full Table Model 69 D.1 Derivation of EM Algorithm for Full Table D.2 Spelling Correction for Higher Order LMs, Full Table Is Learned with EM Algorithm using a 4-gram LM List of Figures 73 viii

11 List of Tables 75 Bibliography 77 ix

13 Chapter 1 Introduction In natural language processing, many state-of-the art systems require human annotated data to train all or at least some statistical models. This kind of training is often called supervised training. For example many statistical machine translation (SMT) systems use the word alignments obtained by the GIZA++ toolkit [Och & Ney 03]. These alignments are trained on a collection of sentence pairs in a source and in a target language (parallel corpus). Unfortunately, parallel data is a limited resource, and the performance of the current systems depends on the amount available for training. On the other hand so called monolingual data exists in higher orders of magnitude since it is produced on a daily basis naturally. One can think of books, news articles, blog articles and other parts of the Internet. Modern translation systems already make use of monolingual data. For example language models are trained on large monolingual data sets of the target language and then used inside the translation pipeline. However other important models of the state-of-the art systems still require parallel data. In contrast to that one can think of totally unsupervised approaches, which investigate the structure of a monolingual source and a monolingual target corpus at once during training time. Then the hope is that the structures within both languages give enough constraints to find a good translation model, which explains the existence of both corpora at once. Such an approach is often called machine translation decipherment or deciphering foreign language [Ravi & Knight 11]. It is an open research question, whether such a monolingual approach can reach or outperform the current state-of-the art systems. This thesis takes one step back from from machine translation decipherment, and covers unsupervised spelling correction and the solving of noisy substitution ciphers instead. By studying research questions upon these simpler problems in detail, we hope to provide a solid foundation for machine translation decipherment. The following section gives an overview of existing work regarding machine translation decipherment, solving of substitution ciphers and spelling correction. 1.1 Related Work [Ravi & Knight 11] train a simple word based machine translation model in a totally unsupervised way. Due to the typically increased complexity of unsupervised training the vocabulary size was limited to approximately five hundred words. [Nuhn & Mauser + 12] 1

14 continue this approach and are able to tackle a more complex task with approximately five thousand words. [Peleg & Rosenfeld 79] solve probabilistic substitution ciphers by learning a probabilistic substitution table via an iterative algorithm. Their algorithm is initialized by an interesting method we will also use in our work. [Lee 02] follow this work by using a model which is more accurate. Their model allows to use the well known EM algorithm for hidden Markov models. The performance is better especially for probabilistic substitution ciphers with noise.[knight & Nair + 06] follow the same approach, but they do not analyze noisy substitution ciphers. We will also follow the approach of using the EM algorithm to learn a probabilistic substitution table. In contrast to probabilistic substitution ciphers, deterministic substitution ciphers follow a strict mapping from the ciphertext to the plaintext. These models typically do not allow noise. Therefore, the model is more constrained and lower error rates are achievable. [Ravi & Knight 08] solve deterministic substitution ciphers by an integer linear program (ILP) solver. This approach allows for an optimal solution with respect to a score which is based on a plaintext language model (LM). The algorithm from [Nuhn & Schamper + 14] solve similar and harder problems while needing less time by using an advanced heuristic. Regarding spelling correction we revisited the following papers. In [Kernighan & Church + 90] the spelling correction algorithm is fed by a list of misspelled words, which is given by a external spell checking program. The algorithm learns a confusion model between words by means of an unigram word distribution. In comparison to that [Mays & Damerau + 91] make use of word context by using a trigram language model. [Tong & Evans 96] performs spelling correction on the output of a optical character recognition (OCR) system. Their approach can be seen as an instance of unsupervised training. It uses a word bigram model and learns character confusion tables iteratively by handling the decoding output as ground truth for re-estimation. [Huang & Learned-Miller + 06] goes one step further. They treat optical character recognition as a decipherment problem. There the image input is clustered into letter-like fragments and the repetition scheme of the clusters is compared to the character repetition scheme within the words of a dictionary. These approaches have promising results but they contain many crude approximations or no exact mathematical derivation to motivate the formulas. In contrast to that a detailed analysis of unsupervised training of exactly specified models, which parameters are well controllable is the purpose of this thesis. 1.2 Outline The remainder of this thesis is structured as follows. In Section 1.3 we introduce our notation and give a spelling correction example for illustration. Chapter 2 provides the mathematical basis which will be used by several subsequent chapters. We use the same data set for all experiments, Chapter 3 gives background information on this data. 2

15 In Chapter 4, we conduct supervised spelling correction experiments, for later reference. Simultaneously, we compare the performance of two different criteria for spelling correction. In Chapter 5 we perform unsupervised spelling correction experiments. There we use the maximum likelihood criterion for training the noise parameter of our spelling correction model. We analyze an alternative training criterion in Chapter 6. In Chapter 7 we solve noisy substitution ciphers. Due to the data choice the results are directly comparable to the spelling correction experiments. We also solve noisy substitution ciphers in Chapter 8, but differ the model structure for noise generation and analyze the effect of the different structures on the learning performance. In Chapter 9 we summarize the results by answering the research questions, which had motivated the experiments of the previous chapters. 1.3 Notation We introduce a set C of class labels and a set X of observations. Then we define sequences of length N build from elements of these sets. If we apply this notation to spelling correction, the class label sequence corresponds to the correct sequence. Consider a typist who converts the correct sequence deterministically into another sequence. During this process some errors may occur. We call the sequence generated by the typist noisy sequence. class label sequence / correct sequence: c N 1 observation sequence / noisy sequence: x N 1 = c 1... c n... c N, c n C = x 1... x n... x N, x n X Then the task we define is to recover the original sequence containing the class labels again, calling it the candidate class label sequence or the corrected sequence. candidate class label sequence / corrected sequence: ĉ N 1 = ĉ 1... ĉ n... ĉ N, ĉ n C Figure 1.1 shows an example for spelling correction. It is for the case that the class label vocabulary and the observation vocabulary are the same and consist of the lower case English characters and an additional space symbol. C = X = {a, b, c,..., x, y, z, } (1.1) While this is also the case for the data which is used within all experiments of this thesis, at many times the algorithms are not aware of the fact that for example the mapping of a in C to a in X means no error. 3

16 Correct Sequence c N 1 some even question the accuracy of figures showi ng a dramatic easing in price rises saying they are not only based on flawed data but that the g overnment may be massaging the results Noisy Sequence x N 1 somezevenlquxskioykthsascaurcmuvofgkiguwvs z dwi rp a draqatih easing in priae risesrsaitnghqoxi ale notsonoyebaseh on feawef daqa buo thao tje g ovfrnxenm mypwbe rbssxecng temjrllulqs Legend: correct: 120(65.9%) wrong: 62(34.1%) Corrected Sequence ĉ N 1 some even question the scourges of figures showi ng a dramatic easing in price rises saying they are not only based on flawed data but that the g overnment may be resolving the results Legend: correct, was correct 119(65.4%) correct, was wrong 53(29.1%) wrong, was wrong (unchanged) 2(01.1%) wrong, was wrong (changed) 7(03.8%) wrong, was correct 1(00.5%) Figure 1.1. Spelling correction example. Different possibilities which can occur during a correction process are marked by five different combinations of background and text color. A red background indicates in all sequences that at this position the noisy sequence contains an error. In all sequences green text color indicates a correct symbol. Red text color indicates a wrong symbol in the noisy sequence If a wrong symbol is preserved in the corrected sequence it also has red text color there. The following is only valid for coloring of the corrected sequence: If a wrong symbol is replaced by another wrong symbol or a correct symbol is replaced by a wrong symbol this symbol is marked by orange text color. 4

17 Chapter 2 Basic Theory of Probabilistic Spelling Correction 2.1 Model In this chapter we give the mathematical definitions of the spelling correction model, which describe the relations between the correct sequence c N 1, the noisy sequence xn 1 and the corrected sequence ĉ N 1. First we require, which allows us to introduce the following bijective function: C = X (2.1) ˆx c : C X, c ˆx c (2.2) For some formalizations the inverse of ˆx c is more handy. We denote the inverse as ĉ x. ˆx c is the deterministic mapping the typist has in mind during the generation of the noisy sequence. Sometimes the typist makes an error and does not follow the mapping. We formalize this behavior by a so called λ-model, where λ is the probability that the typist makes no error. λ-model: p λ (x c) In the case that the typist makes an error, he uniformly chooses any of the other noisy symbols from vocabulary X. { λ if x = ˆxc p λ (x c) = else 1 λ X 1 (2.3) Observation Model: p λ (x N 1 cn 1 ) We make the assumption that the typist makes an error independently from any previous context. With that the probability of transforming a whole correct sequence c N 1 into a noisy sequence xn 1 follows. p λ (x N 1 c N 1 ) = N p λ (x n c n ) (2.4) n=1 5

18 Joint Model: p λ (c N 1, xn 1 ) p λ (c N 1, x N 1 ) = p(c N 1 ) p λ (x N 1 c N 1 ) (2.5) N = p(c n cn+1 m n 1 ) p λ(x n c n ) (2.6) n=1 We will later show that for obtaining the corrected sequence we need the probability the other way around or the joint probability of c N 1 and xn 1. We get it by multiplying the observation model with a language model p(c N 1 ). As language model we use an m-gram language model p(c N 1 ) = N n=1 p(c n cn+1 m n 1 ). 2.2 Decision Rules After the basic mathematical models are defined we show how to search for the optimal corrected sequence if a noisy sequence is given. Optimality is due to the Bayes decision rule and the loss function we define for the criterion accordingly. For details on loss functions and the Bayes decision rule consider [Ney et al. 05] or [Duda & Hart + 00] Minimum Symbol Error Criterion (SUM-CRITERION) At places were space is restricted we will abbreviate the minimum symbol error criterion as SUM or SUM-CRITERION respectively. For a correct sequence c N 1 and a corrected sequence candidate c N 1 this criterion is defined by the following loss function: L[c N 1, c N 1 ] = 1 N [1 δ(c n, c n )] (2.7) N n=1 At every position n were c N 1 is wrong a penalty of 1 is added. The whole sum is divided by N, so the whole loss function equals the symbol error rate. To minimize the Bayes risk one has to select at every position the class symbol which is maximal according to the posterior over the classes at position n, which we denote as p n,λ (c n x N 1 ). Since p(xn 1 ) is constant with respect to the optimization over the classes, one can also use the joint probability instead of the posterior probability. { } x N 1 ĉ N 1 (x N 1 N 1 ) = arg max p n,λ (c n x N 1 ) (2.8) c N N 1 n=1 { } 1 N p n,λ (c n, x N 1 = arg max ) c N N p(x N 1 n=1 1 ) (2.9) { } 1 N = arg max p n,λ (c n, x N 1 ) (2.10) c N N 1 n=1 6

19 Position Dependent Class Probability: p n,λ (c, x N 1 ) The position dependent class probability can be calculated from the probability for sequences, by summing over all possible class sequences with just the constraint that at position n a specific c is fixed. p n,λ (c, x N 1 ) = p λ (c N 1, x N 1 ) (2.11) c N 1 :cn=c At first sight, the calculation of this sum seems to be computationally hard. Naively one would sum over C N 1 different class sequences. A forward-backward algorithm [Baum & Petrie + 70], which is an instance of dynamic programming can calculate these probabilities efficiently. For a 2-gram LM this algorithm uses two auxiliary tables Q n (c) and Q n (c) which contain probability entries for each position n [1, N] and each class c C. The tables are calculated by the forward recursion Q n (c) = p λ (x n c) c p(c c ) Q n 1 (c ) (2.12) and the backward recursion Q n (c) = c p λ (x n+1 c) p( c c) Q n+1 ( c). (2.13) Note that, for the forward recursion, the sum is carried out over all predecessors c, while for the backward recursion, the sum is carried out over all successors c. The position dependent class probability is obtained by multiplying the corresponding table entries. p n,λ (c, x N 1 ) = Q n (c) Q n (c) (2.14) A derivation of this forward-backward algorithm is given in Appendix A.1. The recursions can be easily extended for LMs of higher orders. Without proof, we give the result for a 3-gram LM: Q n (c, c) = p λ (x n c) c p(c c, c ) Q n 1 (c, c ) (2.15) Q n (c, c) = c p λ (x n+1 c) p( c c, c) Q n+1 (c, c) (2.16) p n,λ (c, x N 1 ) = c Q n (c, c) Q n (c, c) (2.17) The main difference is, that in this case we have to keep track of a history consisting of two classes c and c. 7

20 In general the tables Q and Q have C m 1 entries at every position n. Therefore the total number of entries per table is N C m 1. For every entry, m basic calculations are performed. So in total the runtime complexity of filling the table Q or Q is O(N C m ) (2.18) for an m-gram language model and sequence length N. In practice the runtime can become too high. A typical approach is to apply pruning such that only the most promising entries are kept in the tables. We apply histogram pruning at every position n we keep the H most promising entries. Figure 2.1 illustrates the iterative computation process for Q and also shows how pruning is incorporated in this process. Note that once these tables are computed, the position dependent joint probability can be calculated for all needed combinations of positions n and class labels c Minimum String Error Criterion (MAX-CRITERION) At places were space is restricted we will abbreviate the minimum symbol error criterion as MAX or MAX-CRITERION respectively. For a correct sequence c N 1 and a corrected sequence candidate c N 1 this criterion is defined by the following loss function: L[c N 1, c N 1 ] = { 1 if c N 1 c N 1 0 else (2.19) This loss function just takes into account whether the whole sequence is correct or not. Therefore, selecting the sequence which maximizes the posterior probability for the whole sequence minimizes the Bayes risk. Again the joint probability is sufficient since p(x N 1 ) is constant with respect to the optimization over the class sequences. x N 1 ĉ N 1 (x N { 1 ) = arg max pλ (c N 1 x N 1 ) } (2.20) c N 1 { pλ (c N 1 = arg max 1 ) } c N p(x N 1 1 ) (2.21) = arg max c N 1 { pλ (c N 1, x N 1 ) } (2.22) Similarly as for the calculation of the position dependent joint probability the naive implementation would enumerate all C N sequences. Again this can be avoided by dynamic programming. Here we use a Viterbi algorithm. To calculate the sequence joint probability the following forward recursion for n [1, N] can be used. Q n (c) = p λ (x n c) max p(c c ) Q n 1 (c ) (2.23) c 8

21 Initialization Q 1 (c) = p λ (x 1 c) p(c $, $) Q 2 (c, c) = p λ (x 2 c) p(c $, c ) Q 1 (c ) n := 3 Pruning sort {Q n 1 (c, c)} n < N n := n + 1 prune (c, c) iff rank(q n 1 (c, c)) > H Forward Recursion Q n (c, c) = p λ (x n c) p(c c, c ) Q n 1 (c, c ) c n = N Done Figure 2.1. Iterative calculation of Q n(c, c) with pruning (3-gram LM). For the first two positions the sentence delimiter $ is considered. H denotes the histogram size for pruning. Note that no pruning is applied to Q 1(c). The probability of the most likely sequence equals the highest probability in table Q at the last position N. max c N 1 p λ (c N 1, x N 1 ) = max Q N (c) (2.24) c To obtain the maximizing sequence we have to keep track of the maximization decisions in a separate table B n (c). B n (c) = arg max { p(c c ) Q n 1 (c ) } (2.25) c 9

22 Then the optimal sequence ĉ N 1 can be backtracked in the following recursive way. ĉ N = arg max {Q N (c)} (2.26) c ĉ n = B n+1 (ĉ n+1 ) (2.27) The forward recursion, used here, is similar to the forward recursion in Equation (2.12). The required runtimes for both forward recursions are comparable. The runtime consumed by backtracking is negligible. Since we need no backward recursion, the runtime of deciding according to the MAX-CRITERION tends to be the half of the runtime of deciding according to the SUM-CRITERION. The exact relation might be different since the SUM-CRITERION uses a summation over the predecessor classes, while the MAX-CRITERION uses a maximization. Note that this difference had been the motivation for the abbreviations SUM and MAX. We apply pruning to the MAX in a way similar way, we apply it to the SUM-CRITERION. But during our experiments we came aware of a conceptional difference. Regardless how smart the pruning method is, the value of the sum over the class sequences is always affected by pruning. In contrast to that, at least in theory, it is possible to find the maximizing class sequence even if very strict pruning is applied. The obtained probability would not differ from the probability obtained without pruning. Due to our impression from experiments this conceptional difference becomes relevant if the noise ratio is high and the model is uncertain. In these cases the sum over many sequences with small likelihood can outweigh a dominating sequence. The MAX-CRITERION might find the dominating sequence, even if rather strict pruning is applied and the decision would not differ for less strict pruning. But the decisions according to SUM-CRITERION would differ for the two different pruning settings. 10

23 Chapter 3 Data Setup In this chapter, we give a brief description of the data we use for experiments in subsequent chapters. 3.1 Corpus Choice For our experiments we are interested in language models of low perplexities, while still maintaining a real world setup, e.g. following the typical guideline of a train/test split, such that the test set only contains unseen sentences. All experiments are on character level using only the lower case English characters and a special symbol for space. C = X = {a, b, c,..., x, y, z, } (3.1) This allows for experiments with rather small search spaces if the language model order is sufficiently low. This is important in order to perform a huge number of systematical experiments. The restricted alphabet arises some problems, since text, even if its in proper English, often contains special characters and punctuation. A straight forward approach would be to remove all symbols which are not contained in C. But this yields the problem of creating strange character patterns which are not typical for proper English and therefore can confuse language model training. After some tests we made the following decision. Besides the common punctuation marks.,?,! and, we filtered out all sentences which contained symbols outside of C. Additionally we required that a sentence ends on either.,? or!, and we restricted the allowed ratio of, to 5% (measured on word level). After the filtering on sentence level, all punctuation marks are removed in a final step. While too moderate filtering hurts the quality of the language model by allowing strange character sequences, too hard filtering reduces the available training data, which also can hurt the language model quality. The described filtering is rather strict. While a more moderate filtering (e.g. increasing the ratio of allowed commas) did not improve the language model quality significantly, it significantly increased the size of the language models. As corpus we chose the AFP part of the English Gigaword corpus [Parker & Graff + 11]. In comparison to other corpora it provided a good compromise between size and homogeneity. We also looked at books and collection of books like the Corpus of English Novels 11

24 [Smet 15] since we expected them to be more homogeneous, but they were too small to train language models of high quality. For comparison the number of words in a book are in the ballpark of 2100 k, the corpus of English Novels contains around 25 M words and the brown corpus consists of roughly 21 M words. The AFP part of the English Gigaword contains around 2750 M words, and even after our strict filtering we had a corpus of around 251 M words. Table 3.1 gives the number of sentences, words and characters for a split into two parts A and B. Part B makes up 90% and was used for language model training. Part A was used as a test set reservoir from which for most experiments 500 sentences were used. The statistics for this main experiment data set is also given. 3.2 Language Models We train language models of different orders by means of the SRI language modeling toolkit [Stolcke 02]. Within our spelling correction implementation, we use KenLM [Heafield 11] for language model queries. While both toolkits provide functionality for both language model training and querying, KenLM is more modern and tends to be faster. However, for training the SRI toolkit has more options. Especially for our clean experiments on character level, we wanted to distribute no mass to unknown symbols (<unk>) and we found no option to achieve this via KenLM. After some tests we chose to use the following call of the SRI toolkit to train an m-gram language model. ngram-count -order ${m} -wbdiscount -interpolate -text splitb.gz -lm splitb.lm.${m}gram.gz Table 3.2 gives the number of states and perplexity measured on the main experiment data set for language models of m-gram order zero up to eleven. Because C contains 27 different symbols the 0-gram perplexity is also 27. For the low orders the perplexity drops significantly with increasing order e.g. from 1-gram to 2-gram it drops from 17.5 to The Table 3.1. Statistics: Corpus for experiments. Part A contains main experiment data set which is used for all experiments. For unsupervised training it serves as training and test set simultaneously. Part B is used for LM training. PART #SENTENCES #WORDS #SYMBOLS part A 220 k 5.1 M 30.7 M main experiment data set k 64.0 k part B 2.0 M 46.2 M 276 M 12

25 Table 3.2. Number of states and perplexity on main experiment data set, for LMs of different character m-gram order. The word perplexity is calculated from the character perplexity and the average word length (including blank symbols). LM ORDER #STATES P P L P P L word k , k , k M M M M M gram LM has the lowest perplexity and therefore we will not use the 11-gram LM for our experiments. Note that in this high order region the perplexity differences are low. E.g. the 9-gram LM has a perplexity of 2.61 and the 10-gram language model is only slightly better with While the perplexity does not change a lot for the higher orders the number of states still grows fast with increasing order. Table 3.2 gives also the word perplexity P P L word. These values are just calculated from the character perplexity and the average word length L, which is 5.95 for the main experiment data set. P P L word = P P L L (3.2) 3.3 Adding Noise We add noise according to the λ-model on part A. The straightforward way would be to just sample from the model at every character position. In this case it is not guaranteed that an exact fraction of λ characters remains unchanged. But for evaluation we would like to have this property, because then we do not need to differentiate between a λ for adding noise and an λ data of the resulting noisy data. Therefore we first randomly decide on the character positions where to add noise. Here we make sure to exactly pick a fraction of 1 λ. We ensure this constrain for chunks of 500 sentences. So this constraint is also fulfilled for our main experiment data set. Note that we do not ensure this constraint on sentence level, since 13

26 this seems to be rather unnatural. 14

27 Chapter 4 Supervised Spelling Correction with λ-model We first conduct experiments in a supervised way. That means that we pass the correct λ value, which was used during the generation of the noisy corpora, to the spelling correction algorithm. We conduct this this kind of experiments in order to obtain error rates for comparison with later experiments. Later experiments will be conducted in an unsupervised way, meaning that there the λ or even the full table {p(x c)} has to be learned from just the noisy data and a language model. During the experiments of this chapter we also investigate on the following matters of interest: 4.1 Research Questions 1. What is the effect of the LM quality (perplexity / m-gram order) and λ on the error rate? 2. How do the spelling correction results for the MAX-CRITERION differ from the results for the SUM-CRITERION? 4.2 Experiments We perform spelling correction on for various noise ratios, using LMs of different orders and performing search according to the SUM and MAX-CRITERION from Section 2.2. Due to runtime and memory limitations we apply histogram pruning if the m-gram order of the LM exceeds 5. As histogram size we choose H = 200,000. Then our implementation needs around 30 CPU hours for spelling correction on our main experiment data set (N = 63,972 / 500 sentences) for a fixed choice of λ and a LM. This chapter will present the error rates by plots rather than by tables. This allows for a good intuition for the dependencies between the several relevant measurands. However the exact values can be better read off from tables. Therefore the results are also given by Table B.1. 15

28 4.2.1 General Observations Figure 4.1 shows the error rates for the SUM and MAX-CRITERION in one plot. The x-axis represents the language model perplexity in logarithmic scaling, while the y-axis represents the character error rate. Curves for several λ are shown. The legend label is 1 λ since this is the error rate of the noisy data without correction. Because the spelling correction with a 0-gram language model passes through the noisy text, the error rate of the noisy text is also given by the right most data points. As one would expect, language models of higher quality help to correct spelling errors. Or speaking the other way around, the error rate increases with higher LM perplexity. The error rates of the SUM-CRITERION are always lower than the error rates of the MAX-CRITERION. This is as expected since the SUM-CRITERION is designated to minimize the symbol error rate, while the MAX-CRITERION can be interpreted as an approximation. For the data with a low noise ratio like 20% to 5% the differences are rather small but still observable. For higher noise ratios the differences are larger. E.g. for a noise ratio of 70% the absolute difference is nearly 10% if we use a 10-gram LM. For the very high noise ratios the error rates for the MAX-CRITERION fluctuate very strongly. There decreasing perplexity does not guarantee a decreasing error rate. We give a reasoning in Section where we examine examples from the data. In contrast, for the SUM- CRITERION decreasing perplexity results in decreasing error rate also for high noise ratios. There are some fluctuations as well, but they are very small in comparison and only occur for the high language model orders. We expect these fluctuations to be due to pruning and the fact, that the high order language models are very close in terms of perplexity. But that is only a reasoning, which has to be solidified or falsified by future work Details on the Functional Dependency between Perplexity and Error Rate Figure B.1 presents a subset of the data of the previous plots. There the y-axis (error rate) is logarithmic and the x-axis shows the word perplexity calculated from the character perplexity as described in Section 3.2. Using the word perplexity allows for better comparison with previous work like [Ney 15]. There the word error rate of speech recognition systems is shown as a function of the word perplexity. The data approximately follows a straight line. In log-log plots straight lines correspond to power laws like y = a x b. Our curves in Figure B.1 also can be interpreted as straight lines if the perplexity is restricted to a certain value and the noise ratio is not too high. Here b approximately equals 0.4, which agrees with [Ney 15]. However we want to point out that the comparability is not clear, since the systems differ in their structure and we also have the mismatch of directly comparing word error rate with character error rate. For word perplexity values higher than roughly 10, 000 a power law clearly does not apply. First attempts for explaining the higher regions came to no result. 16

29 4.2.3 Details on the Functional Dependency between λ and Error Rate As one would expect, λ has a significant impact on the error rate after correction. Figure 4.2 shows the absolute error rate reduction as a function of λ. The curves for the LMs of higher order seem to have the shape of a quadratic function with the maximum around λ = 0.6. Our first explanation is as follows. For a low λ, only a low amount of characters is correct in the noisy sequence. One can not expect a very high absolute error reduction from a low amount of valid context. Consequently a higher λ value, which means more valid context should help to correct more errors. Simultaneously a too high λ value reduces the error reduction potential, since the maximal absolute error rate reduction is naturally bounded by 1 λ. One might combine both aspects by multiplication, which would yield the term λ (1 λ). The shape of such a quadratic term is very similar to the shapes we observe in the plots Examples from Experiments In this section we have a detailed look at the resulting corrected sequences for the SUM and MAX-CRITERION. Table 4.1 shows a fragment for a very high noise ratio (λ = 0.1). It is striking that the MAX-CRITERION produces sequences which are written in nearly proper English for many language models. But they are almost unrelated to the correct sequence. These sequences seem to be very likely according to the language models. Some times the resulting sequence changes drastically from one order to another. This behavior might be an explanation for the fluctuations of the MAX-CRITERION for very high noise ratios. In contrast, the sequences resulting from the SUM-CRITERION change only slightly with changing language model order. It is striking that the SUM-CRITERION sequences contain many, even multiple times in a row. Since is the most frequent symbol it is a good bet if the noise ratio is very high. The MAX-CRITERION can not choose multiple in a row. The MAX-CRITERION maximizes over one sequence and a sequence with multiple has a very low language model probability. SUM-CRITERION has this choice because at any position, each symbol is chosen independently from all other symbol choices, and the sum is carried out over all possible sequences to the left and the right. The overall observation that the SUM-CRITERION is more likely to produce sequence which are linguistically not possible is also observed in [Merialdo 94]. We want to point out that the number of the decreases for the high order LMs also for the SUM-CRITERION. In control experiments we performed even stronger pruning and observed the same behavior also for lower m-gram orders. We see similarities between SUM with strong pruning and MAX (independently from pruning), since strong pruning implies that the sum is approximated only by few sequences. Pruning gets relatively stronger for higher orders because we fix the beam size for all language models to H = 200,000. This might be an explanation for the previous mentioned small error rate fluctuations of the SUM- 17

30 CRITERION for high noise ratios and high order LMs. While both criteria do not produce senseful corrections, the SUM-CRITERION tends to perform better for this very high noise ratio. This is mostly due to correct guessed characters. This is in alignment with the previous discussed error rate plot. Now we have a look at a medium noise ratio. Table 4.2 shows a sequence fragment for λ = 0.6. Here the LMs of the highest orders 9 and 10 produce for both criteria a corrected sequence without any error. Looking at the other LM orders the SUM and MAX-CRITERION perform very similarly, both in the resulting sequence and the accuracy. In three cases the SUM performs slightly better, in one case the MAX-CRITERION is better. Note that this is just a small excerpt. On our complete main test set the accuracy of the SUM-CRITERION is always higher. 18

31 ERROR RATE [%] λ no pruning P P L m-gram ORDER Figure 4.1. Supervised spelling correction according to SUM (solid) and MAX-CRITERION (dashed). N = 63,972 / 500 sentences. Data is also presented in Table B.1. Note that also in theory there is no difference between SUM and MAX if a 0-gram or 1-gram LM is used. The histogram size for pruning is H = 200,

32 ABSOLUTE ERROR RATE REDUCTION [%] gram 9-gram 8-gram 7-gram 6-gram 5-gram 4-gram 3-gram 2-gram 1-gram 0-gram λ Figure 4.2. Absolute error rate reduction depending on λ and LM m-gram order. Error rates are given for the SUM-CRITERION. Noisy data: N = 63,972 / 500 sentences. λ = 0.6 allows for the highest error rate reductions. Note that pruning with histogram size H = 200,000 was applied for LMs with m 6. 20

33 Table 4.1. Corrected sequences for an exemplary data fragment with λ = 0.1). The corrected sequences are according to SUM and MAX-CRITERION for LMs of different m-gram order. correct, was correct correct, was wrong wrong, was wrong (unchanged) wrong, was correct wrong, was wrong (changed) m c N 1 / xn 1 / ĉn 1 - c N 1 another makes clear that nobody is allowed to 46 x N 1 uiujtpvkii wdoaofor u pnqxayzfy jmvqylhp tbctj S U M i t ii oao o n a t t 7 M A X i t ii oao o n a t t 7 S U M ti t ii ao or t n a e t t 10 M A X the the in the ofor the the the thed the the t 6 S U M the t ii a or e n a e t t 9 M A X the the of the of the ing the of the the of th 1 S U M the t ii or e n a e t t 9 M A X the governation for thursday of the said that 6 S U M the t ii a or e n a e t t 9 M A X the government of the united by that the minis 3 with pruning: S U M the t ii or t n a e t t 10 M A X the government luiz inacio lula da silvio berl 7 S U M the t ii or t n a he t t 10 M A X uranium enriched uranium enriched uranium enri 3 S U M the t ii or t n a ee t the th t 9 M A X uranium enriched uranium enriched uranium enri 3 S U M the t ii a for t en a ay t e the th t 9 M A X deputy prime minister benjamin netanyahu and p 1 S U M the t r ii oa for e en at ay tee the th t 8 M A X ed yardeni at yardeni at yardeni at yardeni at 5 #CORR 21

34 Table 4.2. Corrected sequences for an exemplary data fragment with λ = 0.6). The corrected sequences are according to SUM and MAX-CRITERION for LMs of different m-gram order. correct, was correct correct, was wrong wrong, was wrong (unchanged) wrong, was correct wrong, was wrong (changed) m c N 1 / xn 1 / ĉn 1 - c N 1 the two judges hearing the appeal at the lahor 46 x N 1 xhz twovjbqmes heatvnkfthezappeabzdt thc fehor S U M h twov b mes heatvnkfthe appeab dt thc fehor 30 M A X h twov b mes heatvnkfthe appeab dt thc fehor 30 S U M the twoveb mes heatin fthe appeabe t the fehor 33 M A X the twoved mes heatin fthe appeabe t the fe or 32 S U M the two b mes heating the appeareat the fefor 36 M A X the twor bomes heating the appeadent the befor 34 S U M the two primes heating the appean at the refor 37 M A X the two primes heading the appean at the befor 37 S U M the two jaames heating the appearant the refor 36 M A X the two states heating the appearant the refor 35 with pruning: S U M the two fodmes heating the appeal at the recor 39 M A X the two judges heating the appeal at the recor 42 S U M the two former heating the appear at the recor 36 M A X the two former heating the appear at the recor 36 S U M the two former heating the appear at the lahor 39 M A X the two former heating the appear at the lahor 39 S U M the two judges hearing the appeal at the lahor 46 M A X the two judges hearing the appeal at the lahor 46 S U M the two judges hearing the appeal at the lahor 46 M A X the two judges hearing the appeal at the lahor 46 #CORR 22

35 Chapter 5 Unsupervised Training of λ-model In this chapter we will learn the parameter λ just from the noisy data and a language model. So in contrast to the previous chapter the algorithms will not know about the λ which was used during the noise generation. The algorithms will still know that the noise model p(x c) is the λ-model from Equation 2.3. Also the algorithms will know that the bijective mapping function ˆx c is the identity. In the experiments λ will be learned by the maximum likelihood criterion. Since we have just the parameter λ, we can evaluate the likelihood function for several different λ values and find the maximum via such a scan. We will use this parameter scan method to generate plots for the likelihood and the error rate. While this method allows for nice plots, the expectation maximization (EM) algorithm [Dempster & Laird + 77] can be used to learn λ more elegantly. Since learning is computational more demanding than correction, we will use only language models up to m-gram order 4. We show that these language models are sufficient to learn λ precisely and provide a cross check for all other language models. We summarize the matters of interest of this chapter by the following questions. 5.1 Research Questions 1. Can λ be learned by the maximum likelihood criterion in an unsupervised way? 2. How do deviations from the correct λ affect the error rate? 5.2 Estimation of λ via Maximum Likelihood Criterion The maximum likelihood method finds the parameter set (here just λ) which maximizes the probability of the training data. In our case of unsupervised training, the available training data is just the sequence x N 1. Therefore the likelihood is p λ(x N 1 ) and the maximization criterion becomes { ˆλ = arg max pλ (x N 1 ) }. (5.1) λ 23

36 Note that p λ (...) is a short notation for p(... λ). We calculate p λ (x N 1 ) via the joint-model from Equation 2.5 by summing over all sequences c N 1. ˆλ = arg max λ p λ (c N 1, x N 1 ) c N 1 (5.2) In the case of supervised training, instead of only x N 1, a pair (cn 1, xn 1 ) would be available. Then the computational hard calculation of the sum would not be necessary. However we have already shown how to calculate the very similar position dependent joint probability p n,λ (c, x N 1 ) (Equation 2.11) by a forward-backward algorithm. Here the only difference is the missing constraint for position n. If the position dependent joint-probability is already available, we can get rid of this constraint and obtain the likelihood by summing over all classes c. p λ (x N 1 ) = c p n,λ (c, x N 1 ) (5.3) Note that the value of this term is independent of the position n. Therefore one can also choose n = N. Then just the forward probabilities Q n (c) are needed. However in our experiments we are also interested in the error rates and therefore we still need the full forward-backward algorithm Learning λ via EM Algorithm Since meaningful values for λ are restricted by the interval [0, 1] the optimum of the likelihood function can be found by a parameter scan. However we also want to see how the EM algorithm behaves for this rather simple problem of optimizing one parameter. The update equation is as follows. ˆλ = γ id (5.4) N N γ id = p n,λ (c = ĉ xn x N 1 ) (5.5) n=1 There λ is the estimate of the previous iteration or initialization value and ˆλ is the new estimate. The equation is rather intuitive. γ id is the sum over all positions of the posterior probability of the class associated to the observation at this position ĉ xn. The division of γ id by N ensures that ˆλ takes on at most value 1. A detailed derivation is given in the Appendix C.1. In tests we have seen, that we can choose any value between 0 and 1 as initialization for λ, but we should avoid the border cases 0 and 1 itself. 24

37 5.3 Experiments In the first experiment section we will scan for several λ values. The second experiment section will cover the learning of λ via the EM algorithm. In experiment descriptions we deonte the parameter of the noisy data with λ Scanning λ We use the same noisy data as in the previous experiments. But since the experiments get more complicated and we want to look at more details, we just look at the two noisy data sets. In particular we look at the medium noise case λ = 0.7 and the high noise case λ = 0.3. We use normalized log likelihood values LL/N because the numbers are more handy and better for comparison than just the likelihood p λ (x N 1 ). LL/N = log p λ(x N 1 ) N (5.6) In the text we will still use the term likelihood since it is significantly shorter. Figure 5.1 shows the likelihood and error rate as a function of λ for the medium noise case λ = 0.7. For the computations language models from order 1 to order 4 are used. It is striking that the likelihood takes on the highest value for the correct value of the data. This is true for all LMs but the shape gets sharper with increasing m-gram order. The shape for the 1-gram LM is relatively flat, but it is nevertheless an interesting result, that even the 1-gram LM suffices to learn λ. As expected the error rate is lowest around the correct λ. But it is striking that in comparison to the likelihood function the shape is flat and the error rate is nearly constant around a certain interval around λ. Even for a huge deviation like λ = 0.5 the error rate increases only slightly. So we draw the conclusion that we can achieve the same error rates for unsupervised spelling correction as for supervised spelling correction assuming in both cases the λ-model. In Section we will perform a cross check experiment for the other LMs up to order 10. Figure 5.2 shows the likelihood and the error rate for the high noise case λ = 0.3. The observations are very similar to the observations for the medium noise case. As before the likelihood function allows to find the correct λ precisely while the error rate changes only slightly for a certain interval. So again we can obtain the same error rates as for supervised experiments. But some details are different. Note the different y-axis resolutions of the plots. The resolution for the log likelihood is approximately 45 times higher for the high noise case and the error rate resolution is approximately 3.5 times higher. Plotting the likelihood for the high noise case in the diagram for the medium noise case would yield very flat curves and the curves for the different orders would be very closely. This might indicate that learning becomes harder with an increasing noise ratio, although as indicated for the presented experiments λ can be learned precisely enough. In this context we want to point out that we 25

Improved Decipherment of Homophonic Ciphers

Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,