Unüberwachtes Lernen mit Anwendungen bei der Verarbeitung natürlicher Sprache. Unsupervised Training with Applications in Natural Language Processing

Size: px
Start display at page:

Download "Unüberwachtes Lernen mit Anwendungen bei der Verarbeitung natürlicher Sprache. Unsupervised Training with Applications in Natural Language Processing"

Transcription

1 Masterarbeit im Fach Informatik vorgelegt der Fakultät für Mathematik, Informatik und Naturwissenschaften der RWTH Aachen Lehrstuhl für Informatik 6 Prof. Dr.-Ing. H. Ney Unüberwachtes Lernen mit Anwendungen bei der Verarbeitung natürlicher Sprache Unsupervised Training with Applications in Natural Language Processing vorgelegt von Julian Schamper aus Mechernich Matrikelnummer: Gutachter: Prof. Dr.-Ing. Hermann Ney Prof. Dr. rer. nat. Thomas Seidl Betreuer: Dipl.-Phys., Dipl.-Inform. Malte Nuhn Aachen, 30. September 2015

2

3 Masterarbeit im Fach Informatik Unüberwachtes Lernen mit Anwendungen bei der Verarbeitung natürlicher Sprache Unsupervised Training with Applications in Natural Language Processing Julian Schamper September 30, 2015

4

5 Versicherung an Eides statt Hiermit erkläre ich, Julian Schamper, an Eides statt, dass ich die vorliegende Masterarbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe. Aachen, Julian Schamper iii

6

7 Abstract For most applications in natural language processing, the amount of available unlabeled data is tremendously higher than the amount of human annotated data. The performance of current state-of-the-art systems highly depends on the amount of available training data. Therefore the training of statistical models without human annotated data unsupervised training is attractive. This thesis studies certain aspects of unsupervised training, which are not covered in detail by previous work. The studied problems are spelling correction and the solving of noisy substitution ciphers performed on character level. These problems have a well controllable parameter set and allow for a large number of systematic experiments. All experiments are done on the same data set. The noisy data part follows a λ-model, which distributes noise uniformly on all characters. The first set of experiments investigates upon the relations between the noise ratio, the language model quality and the resulting error rate of the spelling correction algorithm. Additionally a typical approximation is studied. These first experiments are conducted in a supervised way in order to have results for later comparison. The next set of experiments shows, that the noise parameter of the λ-model can be learned very precisely by unsupervised training, using only a noisy text and a language model. The corresponding error rates do not significantly differ from the error rates obtained in the supervised setup. Besides the successfully used maximum likelihood training, we investigate on another training criterion and show that it is impractical for unsupervised training of the noise parameter. In the final experiment set we solve noisy substitution ciphers by using the expectation maximization (EM) algorithm. This involves to learn a full probabilistic substitution table in an unsupervised way. We evaluate the resulting error rate, convergence speed, likelihood, and the difference to the correct full table. We will show that the available training data size, the language model quality and the noise ratio have a high impact on the training performance. Here we observe cases where the results depend on the initialization of the EM algorithm. Nevertheless we obtain error rates closely to the corresponding supervised error rates if the training data size is sufficiently large. In the very last experiments, the noisy data follows a structure different to the λ-model. Then the problem seems to be simpler. The error rates are lower and the convergence is faster. v

8

9 Contents 1 Introduction Related Work Outline Notation Basic Theory of Probabilistic Spelling Correction Model Decision Rules Minimum Symbol Error Criterion (SUM-CRITERION) Minimum String Error Criterion (MAX-CRITERION) Data Setup Corpus Choice Language Models Adding Noise Supervised Spelling Correction with λ-model Research Questions Experiments General Observations Details on the Functional Dependency between Perplexity and Error Rate Details on the Functional Dependency between λ and Error Rate Examples from Experiments Unsupervised Training of λ-model Research Questions Learning of λ via Maximum Likelihood Criterion Learning λ via EM Algorithm Experiments Scanning λ Learning λ via EM Algorithm Error Rates for all LMs vii

10 6 Expected Accuracy Criterion Research Question Unsupervised Training of Full Table Model Research Questions New Aspects in Comparison to λ-model EM Algorithm for Full Table {p(x c)} Evaluation Methods Training Data Partitioning Experiments Effects of Training Data Size, Language Model, λ and Initialization Error Rates for all LMs Keyboard and Rival Model Research Question Keyboard Model Rival Model Experiments Conclusion 55 A Basic Theory of Probabilistic Spelling Correction 59 A.1 Derivation of Forward-Backward Algorithm A.1.1 Forward Recursion A.1.2 Backward Recursion A.2 Sequence Joint Probability Recursion B Supervised Spelling Correction with λ-model 63 B.1 Error Rate Tables B.2 Log Log Plot for Word Perplexity vs. Error Rate C Unsupervised Training of λ-model 66 C.1 Derivation of EM Algorithm for λ C.2 Spelling Correction for Higher Order LMs, λ trained by 4-gram LM D Unsupervised Training of Full Table Model 69 D.1 Derivation of EM Algorithm for Full Table D.2 Spelling Correction for Higher Order LMs, Full Table Is Learned with EM Algorithm using a 4-gram LM List of Figures 73 viii

11 List of Tables 75 Bibliography 77 ix

12

13 Chapter 1 Introduction In natural language processing, many state-of-the art systems require human annotated data to train all or at least some statistical models. This kind of training is often called supervised training. For example many statistical machine translation (SMT) systems use the word alignments obtained by the GIZA++ toolkit [Och & Ney 03]. These alignments are trained on a collection of sentence pairs in a source and in a target language (parallel corpus). Unfortunately, parallel data is a limited resource, and the performance of the current systems depends on the amount available for training. On the other hand so called monolingual data exists in higher orders of magnitude since it is produced on a daily basis naturally. One can think of books, news articles, blog articles and other parts of the Internet. Modern translation systems already make use of monolingual data. For example language models are trained on large monolingual data sets of the target language and then used inside the translation pipeline. However other important models of the state-of-the art systems still require parallel data. In contrast to that one can think of totally unsupervised approaches, which investigate the structure of a monolingual source and a monolingual target corpus at once during training time. Then the hope is that the structures within both languages give enough constraints to find a good translation model, which explains the existence of both corpora at once. Such an approach is often called machine translation decipherment or deciphering foreign language [Ravi & Knight 11]. It is an open research question, whether such a monolingual approach can reach or outperform the current state-of-the art systems. This thesis takes one step back from from machine translation decipherment, and covers unsupervised spelling correction and the solving of noisy substitution ciphers instead. By studying research questions upon these simpler problems in detail, we hope to provide a solid foundation for machine translation decipherment. The following section gives an overview of existing work regarding machine translation decipherment, solving of substitution ciphers and spelling correction. 1.1 Related Work [Ravi & Knight 11] train a simple word based machine translation model in a totally unsupervised way. Due to the typically increased complexity of unsupervised training the vocabulary size was limited to approximately five hundred words. [Nuhn & Mauser + 12] 1

14 continue this approach and are able to tackle a more complex task with approximately five thousand words. [Peleg & Rosenfeld 79] solve probabilistic substitution ciphers by learning a probabilistic substitution table via an iterative algorithm. Their algorithm is initialized by an interesting method we will also use in our work. [Lee 02] follow this work by using a model which is more accurate. Their model allows to use the well known EM algorithm for hidden Markov models. The performance is better especially for probabilistic substitution ciphers with noise.[knight & Nair + 06] follow the same approach, but they do not analyze noisy substitution ciphers. We will also follow the approach of using the EM algorithm to learn a probabilistic substitution table. In contrast to probabilistic substitution ciphers, deterministic substitution ciphers follow a strict mapping from the ciphertext to the plaintext. These models typically do not allow noise. Therefore, the model is more constrained and lower error rates are achievable. [Ravi & Knight 08] solve deterministic substitution ciphers by an integer linear program (ILP) solver. This approach allows for an optimal solution with respect to a score which is based on a plaintext language model (LM). The algorithm from [Nuhn & Schamper + 14] solve similar and harder problems while needing less time by using an advanced heuristic. Regarding spelling correction we revisited the following papers. In [Kernighan & Church + 90] the spelling correction algorithm is fed by a list of misspelled words, which is given by a external spell checking program. The algorithm learns a confusion model between words by means of an unigram word distribution. In comparison to that [Mays & Damerau + 91] make use of word context by using a trigram language model. [Tong & Evans 96] performs spelling correction on the output of a optical character recognition (OCR) system. Their approach can be seen as an instance of unsupervised training. It uses a word bigram model and learns character confusion tables iteratively by handling the decoding output as ground truth for re-estimation. [Huang & Learned-Miller + 06] goes one step further. They treat optical character recognition as a decipherment problem. There the image input is clustered into letter-like fragments and the repetition scheme of the clusters is compared to the character repetition scheme within the words of a dictionary. These approaches have promising results but they contain many crude approximations or no exact mathematical derivation to motivate the formulas. In contrast to that a detailed analysis of unsupervised training of exactly specified models, which parameters are well controllable is the purpose of this thesis. 1.2 Outline The remainder of this thesis is structured as follows. In Section 1.3 we introduce our notation and give a spelling correction example for illustration. Chapter 2 provides the mathematical basis which will be used by several subsequent chapters. We use the same data set for all experiments, Chapter 3 gives background information on this data. 2

15 In Chapter 4, we conduct supervised spelling correction experiments, for later reference. Simultaneously, we compare the performance of two different criteria for spelling correction. In Chapter 5 we perform unsupervised spelling correction experiments. There we use the maximum likelihood criterion for training the noise parameter of our spelling correction model. We analyze an alternative training criterion in Chapter 6. In Chapter 7 we solve noisy substitution ciphers. Due to the data choice the results are directly comparable to the spelling correction experiments. We also solve noisy substitution ciphers in Chapter 8, but differ the model structure for noise generation and analyze the effect of the different structures on the learning performance. In Chapter 9 we summarize the results by answering the research questions, which had motivated the experiments of the previous chapters. 1.3 Notation We introduce a set C of class labels and a set X of observations. Then we define sequences of length N build from elements of these sets. If we apply this notation to spelling correction, the class label sequence corresponds to the correct sequence. Consider a typist who converts the correct sequence deterministically into another sequence. During this process some errors may occur. We call the sequence generated by the typist noisy sequence. class label sequence / correct sequence: c N 1 observation sequence / noisy sequence: x N 1 = c 1... c n... c N, c n C = x 1... x n... x N, x n X Then the task we define is to recover the original sequence containing the class labels again, calling it the candidate class label sequence or the corrected sequence. candidate class label sequence / corrected sequence: ĉ N 1 = ĉ 1... ĉ n... ĉ N, ĉ n C Figure 1.1 shows an example for spelling correction. It is for the case that the class label vocabulary and the observation vocabulary are the same and consist of the lower case English characters and an additional space symbol. C = X = {a, b, c,..., x, y, z, } (1.1) While this is also the case for the data which is used within all experiments of this thesis, at many times the algorithms are not aware of the fact that for example the mapping of a in C to a in X means no error. 3

16 Correct Sequence c N 1 some even question the accuracy of figures showi ng a dramatic easing in price rises saying they are not only based on flawed data but that the g overnment may be massaging the results Noisy Sequence x N 1 somezevenlquxskioykthsascaurcmuvofgkiguwvs z dwi rp a draqatih easing in priae risesrsaitnghqoxi ale notsonoyebaseh on feawef daqa buo thao tje g ovfrnxenm mypwbe rbssxecng temjrllulqs Legend: correct: 120(65.9%) wrong: 62(34.1%) Corrected Sequence ĉ N 1 some even question the scourges of figures showi ng a dramatic easing in price rises saying they are not only based on flawed data but that the g overnment may be resolving the results Legend: correct, was correct 119(65.4%) correct, was wrong 53(29.1%) wrong, was wrong (unchanged) 2(01.1%) wrong, was wrong (changed) 7(03.8%) wrong, was correct 1(00.5%) Figure 1.1. Spelling correction example. Different possibilities which can occur during a correction process are marked by five different combinations of background and text color. A red background indicates in all sequences that at this position the noisy sequence contains an error. In all sequences green text color indicates a correct symbol. Red text color indicates a wrong symbol in the noisy sequence If a wrong symbol is preserved in the corrected sequence it also has red text color there. The following is only valid for coloring of the corrected sequence: If a wrong symbol is replaced by another wrong symbol or a correct symbol is replaced by a wrong symbol this symbol is marked by orange text color. 4

17 Chapter 2 Basic Theory of Probabilistic Spelling Correction 2.1 Model In this chapter we give the mathematical definitions of the spelling correction model, which describe the relations between the correct sequence c N 1, the noisy sequence xn 1 and the corrected sequence ĉ N 1. First we require, which allows us to introduce the following bijective function: C = X (2.1) ˆx c : C X, c ˆx c (2.2) For some formalizations the inverse of ˆx c is more handy. We denote the inverse as ĉ x. ˆx c is the deterministic mapping the typist has in mind during the generation of the noisy sequence. Sometimes the typist makes an error and does not follow the mapping. We formalize this behavior by a so called λ-model, where λ is the probability that the typist makes no error. λ-model: p λ (x c) In the case that the typist makes an error, he uniformly chooses any of the other noisy symbols from vocabulary X. { λ if x = ˆxc p λ (x c) = else 1 λ X 1 (2.3) Observation Model: p λ (x N 1 cn 1 ) We make the assumption that the typist makes an error independently from any previous context. With that the probability of transforming a whole correct sequence c N 1 into a noisy sequence xn 1 follows. p λ (x N 1 c N 1 ) = N p λ (x n c n ) (2.4) n=1 5

18 Joint Model: p λ (c N 1, xn 1 ) p λ (c N 1, x N 1 ) = p(c N 1 ) p λ (x N 1 c N 1 ) (2.5) N = p(c n cn+1 m n 1 ) p λ(x n c n ) (2.6) n=1 We will later show that for obtaining the corrected sequence we need the probability the other way around or the joint probability of c N 1 and xn 1. We get it by multiplying the observation model with a language model p(c N 1 ). As language model we use an m-gram language model p(c N 1 ) = N n=1 p(c n cn+1 m n 1 ). 2.2 Decision Rules After the basic mathematical models are defined we show how to search for the optimal corrected sequence if a noisy sequence is given. Optimality is due to the Bayes decision rule and the loss function we define for the criterion accordingly. For details on loss functions and the Bayes decision rule consider [Ney et al. 05] or [Duda & Hart + 00] Minimum Symbol Error Criterion (SUM-CRITERION) At places were space is restricted we will abbreviate the minimum symbol error criterion as SUM or SUM-CRITERION respectively. For a correct sequence c N 1 and a corrected sequence candidate c N 1 this criterion is defined by the following loss function: L[c N 1, c N 1 ] = 1 N [1 δ(c n, c n )] (2.7) N n=1 At every position n were c N 1 is wrong a penalty of 1 is added. The whole sum is divided by N, so the whole loss function equals the symbol error rate. To minimize the Bayes risk one has to select at every position the class symbol which is maximal according to the posterior over the classes at position n, which we denote as p n,λ (c n x N 1 ). Since p(xn 1 ) is constant with respect to the optimization over the classes, one can also use the joint probability instead of the posterior probability. { } x N 1 ĉ N 1 (x N 1 N 1 ) = arg max p n,λ (c n x N 1 ) (2.8) c N N 1 n=1 { } 1 N p n,λ (c n, x N 1 = arg max ) c N N p(x N 1 n=1 1 ) (2.9) { } 1 N = arg max p n,λ (c n, x N 1 ) (2.10) c N N 1 n=1 6

19 Position Dependent Class Probability: p n,λ (c, x N 1 ) The position dependent class probability can be calculated from the probability for sequences, by summing over all possible class sequences with just the constraint that at position n a specific c is fixed. p n,λ (c, x N 1 ) = p λ (c N 1, x N 1 ) (2.11) c N 1 :cn=c At first sight, the calculation of this sum seems to be computationally hard. Naively one would sum over C N 1 different class sequences. A forward-backward algorithm [Baum & Petrie + 70], which is an instance of dynamic programming can calculate these probabilities efficiently. For a 2-gram LM this algorithm uses two auxiliary tables Q n (c) and Q n (c) which contain probability entries for each position n [1, N] and each class c C. The tables are calculated by the forward recursion Q n (c) = p λ (x n c) c p(c c ) Q n 1 (c ) (2.12) and the backward recursion Q n (c) = c p λ (x n+1 c) p( c c) Q n+1 ( c). (2.13) Note that, for the forward recursion, the sum is carried out over all predecessors c, while for the backward recursion, the sum is carried out over all successors c. The position dependent class probability is obtained by multiplying the corresponding table entries. p n,λ (c, x N 1 ) = Q n (c) Q n (c) (2.14) A derivation of this forward-backward algorithm is given in Appendix A.1. The recursions can be easily extended for LMs of higher orders. Without proof, we give the result for a 3-gram LM: Q n (c, c) = p λ (x n c) c p(c c, c ) Q n 1 (c, c ) (2.15) Q n (c, c) = c p λ (x n+1 c) p( c c, c) Q n+1 (c, c) (2.16) p n,λ (c, x N 1 ) = c Q n (c, c) Q n (c, c) (2.17) The main difference is, that in this case we have to keep track of a history consisting of two classes c and c. 7

20 In general the tables Q and Q have C m 1 entries at every position n. Therefore the total number of entries per table is N C m 1. For every entry, m basic calculations are performed. So in total the runtime complexity of filling the table Q or Q is O(N C m ) (2.18) for an m-gram language model and sequence length N. In practice the runtime can become too high. A typical approach is to apply pruning such that only the most promising entries are kept in the tables. We apply histogram pruning at every position n we keep the H most promising entries. Figure 2.1 illustrates the iterative computation process for Q and also shows how pruning is incorporated in this process. Note that once these tables are computed, the position dependent joint probability can be calculated for all needed combinations of positions n and class labels c Minimum String Error Criterion (MAX-CRITERION) At places were space is restricted we will abbreviate the minimum symbol error criterion as MAX or MAX-CRITERION respectively. For a correct sequence c N 1 and a corrected sequence candidate c N 1 this criterion is defined by the following loss function: L[c N 1, c N 1 ] = { 1 if c N 1 c N 1 0 else (2.19) This loss function just takes into account whether the whole sequence is correct or not. Therefore, selecting the sequence which maximizes the posterior probability for the whole sequence minimizes the Bayes risk. Again the joint probability is sufficient since p(x N 1 ) is constant with respect to the optimization over the class sequences. x N 1 ĉ N 1 (x N { 1 ) = arg max pλ (c N 1 x N 1 ) } (2.20) c N 1 { pλ (c N 1 = arg max 1 ) } c N p(x N 1 1 ) (2.21) = arg max c N 1 { pλ (c N 1, x N 1 ) } (2.22) Similarly as for the calculation of the position dependent joint probability the naive implementation would enumerate all C N sequences. Again this can be avoided by dynamic programming. Here we use a Viterbi algorithm. To calculate the sequence joint probability the following forward recursion for n [1, N] can be used. Q n (c) = p λ (x n c) max p(c c ) Q n 1 (c ) (2.23) c 8

21 Initialization Q 1 (c) = p λ (x 1 c) p(c $, $) Q 2 (c, c) = p λ (x 2 c) p(c $, c ) Q 1 (c ) n := 3 Pruning sort {Q n 1 (c, c)} n < N n := n + 1 prune (c, c) iff rank(q n 1 (c, c)) > H Forward Recursion Q n (c, c) = p λ (x n c) p(c c, c ) Q n 1 (c, c ) c n = N Done Figure 2.1. Iterative calculation of Q n(c, c) with pruning (3-gram LM). For the first two positions the sentence delimiter $ is considered. H denotes the histogram size for pruning. Note that no pruning is applied to Q 1(c). The probability of the most likely sequence equals the highest probability in table Q at the last position N. max c N 1 p λ (c N 1, x N 1 ) = max Q N (c) (2.24) c To obtain the maximizing sequence we have to keep track of the maximization decisions in a separate table B n (c). B n (c) = arg max { p(c c ) Q n 1 (c ) } (2.25) c 9

22 Then the optimal sequence ĉ N 1 can be backtracked in the following recursive way. ĉ N = arg max {Q N (c)} (2.26) c ĉ n = B n+1 (ĉ n+1 ) (2.27) The forward recursion, used here, is similar to the forward recursion in Equation (2.12). The required runtimes for both forward recursions are comparable. The runtime consumed by backtracking is negligible. Since we need no backward recursion, the runtime of deciding according to the MAX-CRITERION tends to be the half of the runtime of deciding according to the SUM-CRITERION. The exact relation might be different since the SUM-CRITERION uses a summation over the predecessor classes, while the MAX-CRITERION uses a maximization. Note that this difference had been the motivation for the abbreviations SUM and MAX. We apply pruning to the MAX in a way similar way, we apply it to the SUM-CRITERION. But during our experiments we came aware of a conceptional difference. Regardless how smart the pruning method is, the value of the sum over the class sequences is always affected by pruning. In contrast to that, at least in theory, it is possible to find the maximizing class sequence even if very strict pruning is applied. The obtained probability would not differ from the probability obtained without pruning. Due to our impression from experiments this conceptional difference becomes relevant if the noise ratio is high and the model is uncertain. In these cases the sum over many sequences with small likelihood can outweigh a dominating sequence. The MAX-CRITERION might find the dominating sequence, even if rather strict pruning is applied and the decision would not differ for less strict pruning. But the decisions according to SUM-CRITERION would differ for the two different pruning settings. 10

23 Chapter 3 Data Setup In this chapter, we give a brief description of the data we use for experiments in subsequent chapters. 3.1 Corpus Choice For our experiments we are interested in language models of low perplexities, while still maintaining a real world setup, e.g. following the typical guideline of a train/test split, such that the test set only contains unseen sentences. All experiments are on character level using only the lower case English characters and a special symbol for space. C = X = {a, b, c,..., x, y, z, } (3.1) This allows for experiments with rather small search spaces if the language model order is sufficiently low. This is important in order to perform a huge number of systematical experiments. The restricted alphabet arises some problems, since text, even if its in proper English, often contains special characters and punctuation. A straight forward approach would be to remove all symbols which are not contained in C. But this yields the problem of creating strange character patterns which are not typical for proper English and therefore can confuse language model training. After some tests we made the following decision. Besides the common punctuation marks.,?,! and, we filtered out all sentences which contained symbols outside of C. Additionally we required that a sentence ends on either.,? or!, and we restricted the allowed ratio of, to 5% (measured on word level). After the filtering on sentence level, all punctuation marks are removed in a final step. While too moderate filtering hurts the quality of the language model by allowing strange character sequences, too hard filtering reduces the available training data, which also can hurt the language model quality. The described filtering is rather strict. While a more moderate filtering (e.g. increasing the ratio of allowed commas) did not improve the language model quality significantly, it significantly increased the size of the language models. As corpus we chose the AFP part of the English Gigaword corpus [Parker & Graff + 11]. In comparison to other corpora it provided a good compromise between size and homogeneity. We also looked at books and collection of books like the Corpus of English Novels 11

24 [Smet 15] since we expected them to be more homogeneous, but they were too small to train language models of high quality. For comparison the number of words in a book are in the ballpark of 2100 k, the corpus of English Novels contains around 25 M words and the brown corpus consists of roughly 21 M words. The AFP part of the English Gigaword contains around 2750 M words, and even after our strict filtering we had a corpus of around 251 M words. Table 3.1 gives the number of sentences, words and characters for a split into two parts A and B. Part B makes up 90% and was used for language model training. Part A was used as a test set reservoir from which for most experiments 500 sentences were used. The statistics for this main experiment data set is also given. 3.2 Language Models We train language models of different orders by means of the SRI language modeling toolkit [Stolcke 02]. Within our spelling correction implementation, we use KenLM [Heafield 11] for language model queries. While both toolkits provide functionality for both language model training and querying, KenLM is more modern and tends to be faster. However, for training the SRI toolkit has more options. Especially for our clean experiments on character level, we wanted to distribute no mass to unknown symbols (<unk>) and we found no option to achieve this via KenLM. After some tests we chose to use the following call of the SRI toolkit to train an m-gram language model. ngram-count -order ${m} -wbdiscount -interpolate -text splitb.gz -lm splitb.lm.${m}gram.gz Table 3.2 gives the number of states and perplexity measured on the main experiment data set for language models of m-gram order zero up to eleven. Because C contains 27 different symbols the 0-gram perplexity is also 27. For the low orders the perplexity drops significantly with increasing order e.g. from 1-gram to 2-gram it drops from 17.5 to The Table 3.1. Statistics: Corpus for experiments. Part A contains main experiment data set which is used for all experiments. For unsupervised training it serves as training and test set simultaneously. Part B is used for LM training. PART #SENTENCES #WORDS #SYMBOLS part A 220 k 5.1 M 30.7 M main experiment data set k 64.0 k part B 2.0 M 46.2 M 276 M 12

25 Table 3.2. Number of states and perplexity on main experiment data set, for LMs of different character m-gram order. The word perplexity is calculated from the character perplexity and the average word length (including blank symbols). LM ORDER #STATES P P L P P L word k , k , k M M M M M gram LM has the lowest perplexity and therefore we will not use the 11-gram LM for our experiments. Note that in this high order region the perplexity differences are low. E.g. the 9-gram LM has a perplexity of 2.61 and the 10-gram language model is only slightly better with While the perplexity does not change a lot for the higher orders the number of states still grows fast with increasing order. Table 3.2 gives also the word perplexity P P L word. These values are just calculated from the character perplexity and the average word length L, which is 5.95 for the main experiment data set. P P L word = P P L L (3.2) 3.3 Adding Noise We add noise according to the λ-model on part A. The straightforward way would be to just sample from the model at every character position. In this case it is not guaranteed that an exact fraction of λ characters remains unchanged. But for evaluation we would like to have this property, because then we do not need to differentiate between a λ for adding noise and an λ data of the resulting noisy data. Therefore we first randomly decide on the character positions where to add noise. Here we make sure to exactly pick a fraction of 1 λ. We ensure this constrain for chunks of 500 sentences. So this constraint is also fulfilled for our main experiment data set. Note that we do not ensure this constraint on sentence level, since 13

26 this seems to be rather unnatural. 14

27 Chapter 4 Supervised Spelling Correction with λ-model We first conduct experiments in a supervised way. That means that we pass the correct λ value, which was used during the generation of the noisy corpora, to the spelling correction algorithm. We conduct this this kind of experiments in order to obtain error rates for comparison with later experiments. Later experiments will be conducted in an unsupervised way, meaning that there the λ or even the full table {p(x c)} has to be learned from just the noisy data and a language model. During the experiments of this chapter we also investigate on the following matters of interest: 4.1 Research Questions 1. What is the effect of the LM quality (perplexity / m-gram order) and λ on the error rate? 2. How do the spelling correction results for the MAX-CRITERION differ from the results for the SUM-CRITERION? 4.2 Experiments We perform spelling correction on for various noise ratios, using LMs of different orders and performing search according to the SUM and MAX-CRITERION from Section 2.2. Due to runtime and memory limitations we apply histogram pruning if the m-gram order of the LM exceeds 5. As histogram size we choose H = 200,000. Then our implementation needs around 30 CPU hours for spelling correction on our main experiment data set (N = 63,972 / 500 sentences) for a fixed choice of λ and a LM. This chapter will present the error rates by plots rather than by tables. This allows for a good intuition for the dependencies between the several relevant measurands. However the exact values can be better read off from tables. Therefore the results are also given by Table B.1. 15

28 4.2.1 General Observations Figure 4.1 shows the error rates for the SUM and MAX-CRITERION in one plot. The x-axis represents the language model perplexity in logarithmic scaling, while the y-axis represents the character error rate. Curves for several λ are shown. The legend label is 1 λ since this is the error rate of the noisy data without correction. Because the spelling correction with a 0-gram language model passes through the noisy text, the error rate of the noisy text is also given by the right most data points. As one would expect, language models of higher quality help to correct spelling errors. Or speaking the other way around, the error rate increases with higher LM perplexity. The error rates of the SUM-CRITERION are always lower than the error rates of the MAX-CRITERION. This is as expected since the SUM-CRITERION is designated to minimize the symbol error rate, while the MAX-CRITERION can be interpreted as an approximation. For the data with a low noise ratio like 20% to 5% the differences are rather small but still observable. For higher noise ratios the differences are larger. E.g. for a noise ratio of 70% the absolute difference is nearly 10% if we use a 10-gram LM. For the very high noise ratios the error rates for the MAX-CRITERION fluctuate very strongly. There decreasing perplexity does not guarantee a decreasing error rate. We give a reasoning in Section where we examine examples from the data. In contrast, for the SUM- CRITERION decreasing perplexity results in decreasing error rate also for high noise ratios. There are some fluctuations as well, but they are very small in comparison and only occur for the high language model orders. We expect these fluctuations to be due to pruning and the fact, that the high order language models are very close in terms of perplexity. But that is only a reasoning, which has to be solidified or falsified by future work Details on the Functional Dependency between Perplexity and Error Rate Figure B.1 presents a subset of the data of the previous plots. There the y-axis (error rate) is logarithmic and the x-axis shows the word perplexity calculated from the character perplexity as described in Section 3.2. Using the word perplexity allows for better comparison with previous work like [Ney 15]. There the word error rate of speech recognition systems is shown as a function of the word perplexity. The data approximately follows a straight line. In log-log plots straight lines correspond to power laws like y = a x b. Our curves in Figure B.1 also can be interpreted as straight lines if the perplexity is restricted to a certain value and the noise ratio is not too high. Here b approximately equals 0.4, which agrees with [Ney 15]. However we want to point out that the comparability is not clear, since the systems differ in their structure and we also have the mismatch of directly comparing word error rate with character error rate. For word perplexity values higher than roughly 10, 000 a power law clearly does not apply. First attempts for explaining the higher regions came to no result. 16

29 4.2.3 Details on the Functional Dependency between λ and Error Rate As one would expect, λ has a significant impact on the error rate after correction. Figure 4.2 shows the absolute error rate reduction as a function of λ. The curves for the LMs of higher order seem to have the shape of a quadratic function with the maximum around λ = 0.6. Our first explanation is as follows. For a low λ, only a low amount of characters is correct in the noisy sequence. One can not expect a very high absolute error reduction from a low amount of valid context. Consequently a higher λ value, which means more valid context should help to correct more errors. Simultaneously a too high λ value reduces the error reduction potential, since the maximal absolute error rate reduction is naturally bounded by 1 λ. One might combine both aspects by multiplication, which would yield the term λ (1 λ). The shape of such a quadratic term is very similar to the shapes we observe in the plots Examples from Experiments In this section we have a detailed look at the resulting corrected sequences for the SUM and MAX-CRITERION. Table 4.1 shows a fragment for a very high noise ratio (λ = 0.1). It is striking that the MAX-CRITERION produces sequences which are written in nearly proper English for many language models. But they are almost unrelated to the correct sequence. These sequences seem to be very likely according to the language models. Some times the resulting sequence changes drastically from one order to another. This behavior might be an explanation for the fluctuations of the MAX-CRITERION for very high noise ratios. In contrast, the sequences resulting from the SUM-CRITERION change only slightly with changing language model order. It is striking that the SUM-CRITERION sequences contain many, even multiple times in a row. Since is the most frequent symbol it is a good bet if the noise ratio is very high. The MAX-CRITERION can not choose multiple in a row. The MAX-CRITERION maximizes over one sequence and a sequence with multiple has a very low language model probability. SUM-CRITERION has this choice because at any position, each symbol is chosen independently from all other symbol choices, and the sum is carried out over all possible sequences to the left and the right. The overall observation that the SUM-CRITERION is more likely to produce sequence which are linguistically not possible is also observed in [Merialdo 94]. We want to point out that the number of the decreases for the high order LMs also for the SUM-CRITERION. In control experiments we performed even stronger pruning and observed the same behavior also for lower m-gram orders. We see similarities between SUM with strong pruning and MAX (independently from pruning), since strong pruning implies that the sum is approximated only by few sequences. Pruning gets relatively stronger for higher orders because we fix the beam size for all language models to H = 200,000. This might be an explanation for the previous mentioned small error rate fluctuations of the SUM- 17

30 CRITERION for high noise ratios and high order LMs. While both criteria do not produce senseful corrections, the SUM-CRITERION tends to perform better for this very high noise ratio. This is mostly due to correct guessed characters. This is in alignment with the previous discussed error rate plot. Now we have a look at a medium noise ratio. Table 4.2 shows a sequence fragment for λ = 0.6. Here the LMs of the highest orders 9 and 10 produce for both criteria a corrected sequence without any error. Looking at the other LM orders the SUM and MAX-CRITERION perform very similarly, both in the resulting sequence and the accuracy. In three cases the SUM performs slightly better, in one case the MAX-CRITERION is better. Note that this is just a small excerpt. On our complete main test set the accuracy of the SUM-CRITERION is always higher. 18

31 ERROR RATE [%] λ no pruning P P L m-gram ORDER Figure 4.1. Supervised spelling correction according to SUM (solid) and MAX-CRITERION (dashed). N = 63,972 / 500 sentences. Data is also presented in Table B.1. Note that also in theory there is no difference between SUM and MAX if a 0-gram or 1-gram LM is used. The histogram size for pruning is H = 200,

32 ABSOLUTE ERROR RATE REDUCTION [%] gram 9-gram 8-gram 7-gram 6-gram 5-gram 4-gram 3-gram 2-gram 1-gram 0-gram λ Figure 4.2. Absolute error rate reduction depending on λ and LM m-gram order. Error rates are given for the SUM-CRITERION. Noisy data: N = 63,972 / 500 sentences. λ = 0.6 allows for the highest error rate reductions. Note that pruning with histogram size H = 200,000 was applied for LMs with m 6. 20

33 Table 4.1. Corrected sequences for an exemplary data fragment with λ = 0.1). The corrected sequences are according to SUM and MAX-CRITERION for LMs of different m-gram order. correct, was correct correct, was wrong wrong, was wrong (unchanged) wrong, was correct wrong, was wrong (changed) m c N 1 / xn 1 / ĉn 1 - c N 1 another makes clear that nobody is allowed to 46 x N 1 uiujtpvkii wdoaofor u pnqxayzfy jmvqylhp tbctj S U M i t ii oao o n a t t 7 M A X i t ii oao o n a t t 7 S U M ti t ii ao or t n a e t t 10 M A X the the in the ofor the the the thed the the t 6 S U M the t ii a or e n a e t t 9 M A X the the of the of the ing the of the the of th 1 S U M the t ii or e n a e t t 9 M A X the governation for thursday of the said that 6 S U M the t ii a or e n a e t t 9 M A X the government of the united by that the minis 3 with pruning: S U M the t ii or t n a e t t 10 M A X the government luiz inacio lula da silvio berl 7 S U M the t ii or t n a he t t 10 M A X uranium enriched uranium enriched uranium enri 3 S U M the t ii or t n a ee t the th t 9 M A X uranium enriched uranium enriched uranium enri 3 S U M the t ii a for t en a ay t e the th t 9 M A X deputy prime minister benjamin netanyahu and p 1 S U M the t r ii oa for e en at ay tee the th t 8 M A X ed yardeni at yardeni at yardeni at yardeni at 5 #CORR 21

34 Table 4.2. Corrected sequences for an exemplary data fragment with λ = 0.6). The corrected sequences are according to SUM and MAX-CRITERION for LMs of different m-gram order. correct, was correct correct, was wrong wrong, was wrong (unchanged) wrong, was correct wrong, was wrong (changed) m c N 1 / xn 1 / ĉn 1 - c N 1 the two judges hearing the appeal at the lahor 46 x N 1 xhz twovjbqmes heatvnkfthezappeabzdt thc fehor S U M h twov b mes heatvnkfthe appeab dt thc fehor 30 M A X h twov b mes heatvnkfthe appeab dt thc fehor 30 S U M the twoveb mes heatin fthe appeabe t the fehor 33 M A X the twoved mes heatin fthe appeabe t the fe or 32 S U M the two b mes heating the appeareat the fefor 36 M A X the twor bomes heating the appeadent the befor 34 S U M the two primes heating the appean at the refor 37 M A X the two primes heading the appean at the befor 37 S U M the two jaames heating the appearant the refor 36 M A X the two states heating the appearant the refor 35 with pruning: S U M the two fodmes heating the appeal at the recor 39 M A X the two judges heating the appeal at the recor 42 S U M the two former heating the appear at the recor 36 M A X the two former heating the appear at the recor 36 S U M the two former heating the appear at the lahor 39 M A X the two former heating the appear at the lahor 39 S U M the two judges hearing the appeal at the lahor 46 M A X the two judges hearing the appeal at the lahor 46 S U M the two judges hearing the appeal at the lahor 46 M A X the two judges hearing the appeal at the lahor 46 #CORR 22

35 Chapter 5 Unsupervised Training of λ-model In this chapter we will learn the parameter λ just from the noisy data and a language model. So in contrast to the previous chapter the algorithms will not know about the λ which was used during the noise generation. The algorithms will still know that the noise model p(x c) is the λ-model from Equation 2.3. Also the algorithms will know that the bijective mapping function ˆx c is the identity. In the experiments λ will be learned by the maximum likelihood criterion. Since we have just the parameter λ, we can evaluate the likelihood function for several different λ values and find the maximum via such a scan. We will use this parameter scan method to generate plots for the likelihood and the error rate. While this method allows for nice plots, the expectation maximization (EM) algorithm [Dempster & Laird + 77] can be used to learn λ more elegantly. Since learning is computational more demanding than correction, we will use only language models up to m-gram order 4. We show that these language models are sufficient to learn λ precisely and provide a cross check for all other language models. We summarize the matters of interest of this chapter by the following questions. 5.1 Research Questions 1. Can λ be learned by the maximum likelihood criterion in an unsupervised way? 2. How do deviations from the correct λ affect the error rate? 5.2 Estimation of λ via Maximum Likelihood Criterion The maximum likelihood method finds the parameter set (here just λ) which maximizes the probability of the training data. In our case of unsupervised training, the available training data is just the sequence x N 1. Therefore the likelihood is p λ(x N 1 ) and the maximization criterion becomes { ˆλ = arg max pλ (x N 1 ) }. (5.1) λ 23

36 Note that p λ (...) is a short notation for p(... λ). We calculate p λ (x N 1 ) via the joint-model from Equation 2.5 by summing over all sequences c N 1. ˆλ = arg max λ p λ (c N 1, x N 1 ) c N 1 (5.2) In the case of supervised training, instead of only x N 1, a pair (cn 1, xn 1 ) would be available. Then the computational hard calculation of the sum would not be necessary. However we have already shown how to calculate the very similar position dependent joint probability p n,λ (c, x N 1 ) (Equation 2.11) by a forward-backward algorithm. Here the only difference is the missing constraint for position n. If the position dependent joint-probability is already available, we can get rid of this constraint and obtain the likelihood by summing over all classes c. p λ (x N 1 ) = c p n,λ (c, x N 1 ) (5.3) Note that the value of this term is independent of the position n. Therefore one can also choose n = N. Then just the forward probabilities Q n (c) are needed. However in our experiments we are also interested in the error rates and therefore we still need the full forward-backward algorithm Learning λ via EM Algorithm Since meaningful values for λ are restricted by the interval [0, 1] the optimum of the likelihood function can be found by a parameter scan. However we also want to see how the EM algorithm behaves for this rather simple problem of optimizing one parameter. The update equation is as follows. ˆλ = γ id (5.4) N N γ id = p n,λ (c = ĉ xn x N 1 ) (5.5) n=1 There λ is the estimate of the previous iteration or initialization value and ˆλ is the new estimate. The equation is rather intuitive. γ id is the sum over all positions of the posterior probability of the class associated to the observation at this position ĉ xn. The division of γ id by N ensures that ˆλ takes on at most value 1. A detailed derivation is given in the Appendix C.1. In tests we have seen, that we can choose any value between 0 and 1 as initialization for λ, but we should avoid the border cases 0 and 1 itself. 24

37 5.3 Experiments In the first experiment section we will scan for several λ values. The second experiment section will cover the learning of λ via the EM algorithm. In experiment descriptions we deonte the parameter of the noisy data with λ Scanning λ We use the same noisy data as in the previous experiments. But since the experiments get more complicated and we want to look at more details, we just look at the two noisy data sets. In particular we look at the medium noise case λ = 0.7 and the high noise case λ = 0.3. We use normalized log likelihood values LL/N because the numbers are more handy and better for comparison than just the likelihood p λ (x N 1 ). LL/N = log p λ(x N 1 ) N (5.6) In the text we will still use the term likelihood since it is significantly shorter. Figure 5.1 shows the likelihood and error rate as a function of λ for the medium noise case λ = 0.7. For the computations language models from order 1 to order 4 are used. It is striking that the likelihood takes on the highest value for the correct value of the data. This is true for all LMs but the shape gets sharper with increasing m-gram order. The shape for the 1-gram LM is relatively flat, but it is nevertheless an interesting result, that even the 1-gram LM suffices to learn λ. As expected the error rate is lowest around the correct λ. But it is striking that in comparison to the likelihood function the shape is flat and the error rate is nearly constant around a certain interval around λ. Even for a huge deviation like λ = 0.5 the error rate increases only slightly. So we draw the conclusion that we can achieve the same error rates for unsupervised spelling correction as for supervised spelling correction assuming in both cases the λ-model. In Section we will perform a cross check experiment for the other LMs up to order 10. Figure 5.2 shows the likelihood and the error rate for the high noise case λ = 0.3. The observations are very similar to the observations for the medium noise case. As before the likelihood function allows to find the correct λ precisely while the error rate changes only slightly for a certain interval. So again we can obtain the same error rates as for supervised experiments. But some details are different. Note the different y-axis resolutions of the plots. The resolution for the log likelihood is approximately 45 times higher for the high noise case and the error rate resolution is approximately 3.5 times higher. Plotting the likelihood for the high noise case in the diagram for the medium noise case would yield very flat curves and the curves for the different orders would be very closely. This might indicate that learning becomes harder with an increasing noise ratio, although as indicated for the presented experiments λ can be learned precisely enough. In this context we want to point out that we 25

Improved Decipherment of Homophonic Ciphers

Improved Decipherment of Homophonic Ciphers Improved Decipherment of Homophonic Ciphers Malte Nuhn and Julian Schamper and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department, RWTH Aachen University, Aachen,

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

IBM Model 1 for Machine Translation

IBM Model 1 for Machine Translation IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model: Chapter 5 Text Input 5.1 Problem In the last two chapters we looked at language models, and in your first homework you are building language models for English and Chinese to enable the computer to guess

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

1. Markov models. 1.1 Markov-chain

1. Markov models. 1.1 Markov-chain 1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a

More information

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III CS221 / Autumn 2017 / Liang & Ermon Lecture 15: Bayesian networks III cs221.stanford.edu/q Question Which is computationally more expensive for Bayesian networks? probabilistic inference given the parameters

More information

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School - University of Pisa Pisa, 7-19 May 008 Part III: Search Problem 1 Complexity issues A search: with single

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

SYNTHER A NEW M-GRAM POS TAGGER

SYNTHER A NEW M-GRAM POS TAGGER SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Decipherment of Substitution Ciphers with Neural Language Models

Decipherment of Substitution Ciphers with Neural Language Models Decipherment of Substitution Ciphers with Neural Language Models Nishant Kambhatla, Anahita Mansouri Bigvand, Anoop Sarkar School of Computing Science Simon Fraser University Burnaby, BC, Canada {nkambhat,amansour,anoop}@sfu.ca

More information

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP Recap: Language models Foundations of atural Language Processing Lecture 4 Language Models: Evaluation and Smoothing Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp

More information

Natural Language Processing (CSE 490U): Language Models

Natural Language Processing (CSE 490U): Language Models Natural Language Processing (CSE 490U): Language Models Noah Smith c 2017 University of Washington nasmith@cs.washington.edu January 6 9, 2017 1 / 67 Very Quick Review of Probability Event space (e.g.,

More information

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning, Midterm Exam: Spring 2009 SOLUTION 10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are

More information

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009 CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Lecture - 21 HMM, Forward and Backward Algorithms, Baum Welch

More information

Design and Implementation of Speech Recognition Systems

Design and Implementation of Speech Recognition Systems Design and Implementation of Speech Recognition Systems Spring 2013 Class 7: Templates to HMMs 13 Feb 2013 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Parametric Models Part III: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2014 CS 551, Spring 2014 c 2014, Selim Aksoy (Bilkent

More information

Augmented Statistical Models for Speech Recognition

Augmented Statistical Models for Speech Recognition Augmented Statistical Models for Speech Recognition Mark Gales & Martin Layton 31 August 2005 Trajectory Models For Speech Processing Workshop Overview Dependency Modelling in Speech Recognition: latent

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

DT2118 Speech and Speaker Recognition

DT2118 Speech and Speaker Recognition DT2118 Speech and Speaker Recognition Language Modelling Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 56 Outline Introduction Formal Language Theory Stochastic Language Models (SLM) N-gram Language

More information

Natural Language Processing SoSe Words and Language Model

Natural Language Processing SoSe Words and Language Model Natural Language Processing SoSe 2016 Words and Language Model Dr. Mariana Neves May 2nd, 2016 Outline 2 Words Language Model Outline 3 Words Language Model Tokenization Separation of words in a sentence

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi) Natural Language Processing SoSe 2015 Language Modelling Dr. Mariana Neves April 20th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline 2 Motivation Estimation Evaluation Smoothing Outline 3 Motivation

More information

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm + September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage: Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS 6501: Natural Language Processing 1 This lecture Language Models What are

More information

Seq2Seq Losses (CTC)

Seq2Seq Losses (CTC) Seq2Seq Losses (CTC) Jerry Ding & Ryan Brigden 11-785 Recitation 6 February 23, 2018 Outline Tasks suited for recurrent networks Losses when the output is a sequence Kinds of errors Losses to use CTC Loss

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16 VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 16 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Based on slides by

More information

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech CS 294-5: Statistical Natural Language Processing The Noisy Channel Model Speech Recognition II Lecture 21: 11/29/05 Search through space of all possible sentences. Pick the one that is most probable given

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

Hidden Markov models

Hidden Markov models Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures

More information

Theory of Alignment Generators and Applications to Statistical Machine Translation

Theory of Alignment Generators and Applications to Statistical Machine Translation Theory of Alignment Generators and Applications to Statistical Machine Translation Raghavendra Udupa U Hemanta K Mai IBM India Research Laboratory, New Delhi {uraghave, hemantkm}@inibmcom Abstract Viterbi

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

Word Alignment for Statistical Machine Translation Using Hidden Markov Models Word Alignment for Statistical Machine Translation Using Hidden Markov Models by Anahita Mansouri Bigvand A Depth Report Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of

More information

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer + Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward

More information

Variational Decoding for Statistical Machine Translation

Variational Decoding for Statistical Machine Translation Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1

More information

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013 Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48 Roadmap

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Hidden Markov Models Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Additional References: David

More information

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015 S 88 Spring 205 Introduction to rtificial Intelligence Midterm 2 V ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Gaussian Models

Gaussian Models Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density

More information

Hidden Markov Models. x 1 x 2 x 3 x N

Hidden Markov Models. x 1 x 2 x 3 x N Hidden Markov Models 1 1 1 1 K K K K x 1 x x 3 x N Example: The dishonest casino A casino has two dice: Fair die P(1) = P() = P(3) = P(4) = P(5) = P(6) = 1/6 Loaded die P(1) = P() = P(3) = P(4) = P(5)

More information

Decision Trees.

Decision Trees. . Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

CHAPTER 8 Viterbi Decoding of Convolutional Codes

CHAPTER 8 Viterbi Decoding of Convolutional Codes MIT 6.02 DRAFT Lecture Notes Fall 2011 (Last update: October 9, 2011) Comments, questions or bug reports? Please contact hari at mit.edu CHAPTER 8 Viterbi Decoding of Convolutional Codes This chapter describes

More information

Training the linear classifier

Training the linear classifier 215, Training the linear classifier A natural way to train the classifier is to minimize the number of classification errors on the training data, i.e. choosing w so that the training error is minimized.

More information

Languages, regular languages, finite automata

Languages, regular languages, finite automata Notes on Computer Theory Last updated: January, 2018 Languages, regular languages, finite automata Content largely taken from Richards [1] and Sipser [2] 1 Languages An alphabet is a finite set of characters,

More information

TnT Part of Speech Tagger

TnT Part of Speech Tagger TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation

More information