Example of Parallel Corpus. Machine Translation: Word Alignment Problem. Outline. Word Alignments

Size: px
Start display at page:

Download "Example of Parallel Corpus. Machine Translation: Word Alignment Problem. Outline. Word Alignments"

Transcription

1 Example of Parallel Corpus 2 Machine Translation: Word Alignment Problem Marcello Federico FBK, Trento - Italy 206 Darum liegt die Verantwortung für das Erreichen des Effizienzzieles und der damit einhergehenden CO2 -Reduzierung bei der Gemeinschaft, die nämlich dann tätig wird, wenn das Ziel besser durch gemeinschaftliche Massnahmen erreicht werden kann. Und genaugenommen steht hier die Glaubwürdigkeit der EU auf dem Spiel. Notice di erent positions of corresponding verb groups. That is why the responsibility for achieving the efficiency and at the same time reducing CO2 lies with the Community, which in fact takes action when an objective can be achieved more effectively by Community measures. Strictly speaking, it is the credibility of the EU that is at stake here. MT has to take into account word re-ordering! Outline Word Alignments Word alignments Word alignment models Alignment search Alignment estimation EM algorithm Model 2 Fertility alignment models HMM alignment models Let us considers possible alignments a between words in f and e. 2 serata di domani soffierà un freddo vento orientale This part contains advanced material (marked with *) suited to students interested in the mathematical details of the presented models. since tomorrow evening an eastern chilly wind will blow

2 Word Alignments Word Alignments Let us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Ley us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Some words might be not aligned (virtually aligned with NULL) 2 serata di domani soffierà un freddo vento orientale 2 serata di domani soffierà un freddo vento orientale since tomorrow evening an eastern chilly wind will blow NULL since tomorrow evening an eastern chilly wind will blow Word Alignments Word Alignments Let us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Some words might be not aligned Let us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Some words might be not aligned (virtually aligned with NULL) These and even more general alignments are machine learnable. 2 serata di domani soffierà un freddo vento orientale 2 serata di domani soffierà un freddo vento orientale since tomorrow evening an eastern chilly wind will blow NULL since tomorrow evening an eastern chilly wind will blow

3 Word Alignments Word Alignment: Matrix Representation Let us considers possible alignments a between words in f and e. Typically, alignments are restricted to maps between positions of f and of e. Some words might be not aligned (virtually aligned with NULL) These and even more general alignments are machine learnable. Notice also that alignments induce word re-ordering blow 9 will 8 wind 7 chilly 6 eastern 5 an evening tomorrow 2 since NULL 0 2 serata di domani soffierà un freddo vento orientale serata di domani soffierà un fblackdo vento orientale NULL since tomorrow evening an eastern chilly wind will blow serata di domani soffierà un freddo vento orientale NULL since tomorrow evening an eastern chilly wind will blow Word Alignment: Matrix Representation Word Alignment: Direct Alignment 5 blow 9 will 8 wind 7 chilly 6 eastern 5 an evening tomorrow 2 since serata di domani soffierà un fblackdo vento orientale A : {,...,m}! {,...,l} implemented 6 been 5 has program the 2 and position il programma è stato messo in pratica serata di domani soffierà un freddo vento orientale We allow only one link (point) in each column. Some columns may be empty. since tomorrow evening an eastern chilly wind will blow

4 Word Alignment: Inverted Alignment A : {,...,l}! {,...,m} people 6 aborigenal 5 the of territory 2 the position 2 il territorio degli autoctoni You can get a direct alignment by swapping and sentence. 6 Word Alignment Model In SMT we will model the translation probability Pr(f e) by summing the probabilities of all possible (l + ) m hidden alignments a between the and the strings: Pr(f e) X a Hence we will consider statistical word alignment models: defined by specific sets of parameters. Pr(f, a e) p (f, a e) Pr(f, a e) () The art of statistical modelling consists in designing statistical models which capture the relevant properties of the considered phenomenon, in our case the relationship between a language string and a language string. There are 5 models of increasing complexity (number of parameters) 8 Alignment Variable 7 Word Alignment Models 9 Modelling the alignment as an arbitrary relation between and language is very general but computationally unfeasible: 2 l m possible alignments! A generally applied restriction is to let each word be assigned to exactly one word(see Example 2). Hence, alignment is a map from to positions: A : {,...,m}! {0,...,l} Alignment variable: a a,...,a m consists of associations j! i a j, from position j to position i a j. We may include null word alignments, thatisa j 0to account for words not aligned to any word. Hence, only (l + ) m possible alignments. In order to find automatic methods to learn word alignments from data we use mathematical models that explain how translations are generated. The way models explain translations may appear very naïve if not silly! Indeed they are very simplistic... However, simple explanations often do work better than complex ones! We need to be a little bit formal here, just to give names to ingredients we will use in our recipes to learn word alignments: English sentence e is a sequence of l words French sentence f is a sentence of m words Word alignment a is a map from m positions to l +positions We will have to relax a bit our conception of sentence: it is just a sequence of words, which might have or not sense at all...

5 Word Alignment Models 0 On Probability Factorization 2 There are five models, of increasing complexity, that explain how a translation and an alignment can be generated from a foreign sentence. Chain Rule The prob. of a sequence of events e e,e 2,e,...e l can be factorized as: e Alignment Model Pr(a,f e) a,f Pr(e e,e 2,e,...e l ) Pr(e ) Pr(e 2 e ) Pr(e e,e 2 )... Pr(e l e,...,e l ) Complexity refers to the amount of parameters that define the model! We start from the simplest model, called Model! The joint probability is factorized over single event probabilities Factors however introduce dependencies of increasing complexity the last factor has the same complexity of the complete joint probability! There are two basic approximations for sequential models which eliminate dependencies in the conditional part of the chain factors Notice that for non-sequential events, we might change the order of factors, e.g: Pr(f, a e) Pr(a, f e) Pr(a e) Pr(f e, a) Model Basic Sequential Models e Alignment Model Pr(a,f e) a,f Bag-of-word model: We assume that each event is independent from the others: Pr(e e,e 2,e,...e l ) Pr(e ) Pr(e 2 ) Pr(e )... Pr(e l ) Model generates the translation and the alignment as follows:. guess the length m of f on the basis of the length l of e 2. for each position j in f repeat the following two steps: (a) randomly pick a corresponding position i in e (b) generate word j of f by picking a translation of word i in e Step is executed by using a translation length predictor Step 2.(a) is performed by throwing a dice with l +faces Step 2.(b) is carried out by using a word translation table Markov chain model: We assume that each factor event only depends from the previous one: Pr(e e,e 2,e,...e l ) Pr(e ) Pr(e 2 e ) Pr(e e 2 )... Pr(e l e l ) We reduce complexity by removing dependencies Event space becomes smaller and probabilities easier to estimate This simplification might reduce accuracy of the model We want to include the null word.

6 Word Alignment Model Factorization Model 6 One of the many ways to exactly decompose Pr(f m,a m e l ) is: Given the alignment factorization Pr(f m,am el ) Pr(m el ) m Y j Pr(m e l ) m Y j Pr(f j,a j f j,a j,m,e l ) Pr(a j f j,a j,m,e l ) Pr(f j f j,a j,m,el ) Looks dense but it s just the plain application of the chain rule! Pr(f m,a m e l )Pr(m e l ) Pr(a j...) Pr(f j a j,...) j We simplify all interactions by means of pairwise dependencies: Pr(m e l ) p(m l) length probability Pr(a j...) (l + ) alignment probability Pr(f j a j,...) p(f j e aj ) translation probability Hence, we get the following translation model: Pr(f m e l ) X a m Pr(f m,a m e l ) p(m l) (l + ) m j i0 p(m l) (l + ) m X a m p(f j e aj ) j p(f j e i ) nice complexity reduction! Word Alignment Model Factorization 5 Model 7 One of the many ways to exactly decompose Pr(f m,a m Let s make it look simpler Pr(f m,a m e l )Pr(m e l ) Generative stochastic process: e l ) is: Pr(a j...) Pr(f j a j,...). choose length m of the French string, given knowledge of the English string e l 2. cover one English position for each French position j, given.... choose French word for each position j, given the covered English position... Remark: the process works in the wrong direction: it generates f from e. In fact, it is used to calculate Pr(f, a e) Pr(e) in the search problem. Though, it can work in both directions by exchanging f and e. j Model has a very simple stochastic generative process:. Choose a length m for f according to p(m l) 2. For each j,...,m, choose a j in {0,,...,l} at random. For each j,...,m, choose French word f j according to p(f j e aj ) Properties: Model is very naive but is a good starting point for better models Parameters are the probabilities p(f j e aj ) Computation of Pr(f e) can be very e cient Search of the most probable alignment is straightforward Estimation is trivial given a parallel corpus with alignments Estimation is e cient given a parallel corpus without alignments

7 Model : Generative Process 8 Model : Translation Table 20 Assume very simple German and English languages: just words each. 2 length l5 the program2 has been implemented5... m positions picked randomly alignment house book a0. 0, has been implemented5 implemented5 implemented5 the program2 the 0.85 das 0.2 ein 0.0 Haus 0.02 Buch e' stato2 messo in pratica5 il6 programma7 words chosen through a probability table translation MODEL ONLY RELIES ON WORD-TO-WORD TRANSLATION PROBs! Model needs a table x : each raw shows German translations of each English word each raw contains probabilities summing up to one Of course, the majority of cells should ideally be equal to zero. Learning Model, basically means filling the table with some good values... 2 Let s forget about the null word here. Model 9 Model : Learning 2 e Alignment Model Pr(a,f e) a,f Let us assume that we have a parallel corpus with alignments: serata di domani soffierà un freddo vento orientale un vento freddo da est interessa le Alpi since tomorrow evening an eastern chilly wind will blow an eastern cool breeze affects the Alps Let us see how we can can implement Model and at its complexity:. length predictor of the translation this is not di cult to build, we look for instance at many English-French translations and study how sentence lengths are related (few parameters) 2. dice of l +faces: very simple to simulate by a computer (no parameters). translation table of words: this is the tricky part. We need a big table that tells us for each French word f and English word e if e is either a good or bad translation of f (fair amount of parameters) We can estimate translation probabilities by counting aligned word-pairs. The maximum likelihood estimation for a discrete distribution is: p(e f) count(f,e) count(e, f) count(e, f) count(f) P e for a word-pair chilly-freddo we count how often they are aligned together p(chilly freddo) count(chilly, freddo) count(freddo) we end up with reliable probabilities by using a very large parallel corpus!

8 Model : Aligning 22 MLE of a Discrete Distribution 2 Let us assume that we have probabilities p(f e) for all word pairs. Given a parallel corpus without alignments serata di domani soffierà un freddo vento orientale un vento freddo da est interessa le Alpi since tomorrow evening an eastern chilly wind will blow an eastern cool breeze affects the Alps the most probable (or Viterbi) alignment of each sentence pair a arg max Pr(a f, e) / a p(f j e aj ) can be computed by finding the most probable translation for each position: j a j arg max p(f j e i ) i0,,...,l The time complexity of the Viterbi search for Model is just O(M L) Let x x,...,x S be a random sample of outcomes of a dice X p (X). Parameters { (w)!2, (!) 0, P! (!) } where {, 2,...,6} Assume outcomes in x are independent and identically distributed (iid) Maximum likelihood estimation looks for that maximize the sample likelihood L( ) SY (!) c(!) p (X x i ) Y p (X!) c(!) Y i!2!2 We apply a monotonic map to get something equivalent but easier to maximize: L( ) log Y (!) c(!) X c(!) log (!)!2!2 then we can apply Lagrange multipliers to get the closed form solution: ˆ (!) c(!) S that is the well known relative frequency! [c( ) sample count] Model : Best Alignment Search 2 Training of Word Alignment Models 25 Let us assume that we have translation probabilities p(f e). Given a parallel corpus without alignments serata di domani soffierà un freddo vento orientale un vento freddo da est interessa le Alpi since tomorrow evening an eastern chilly wind will blow an eastern cool breeze affects the Alps We can compute the most probable alignment of each sentence pair as follows: Let p (f e) be a translation model with unknown parameters, thatwewantto estimate from a sample of iid translations {(f s, e s ):s,...,s} by maximizing: L( ) SX log p (f s e s ) X c(f, e) log p (f e) [c( ) sample count] (2) f,e s p (f s e s ) is the marginal probability of an alignment model: for each word e in the text e we pick the most probable word f in the text e according to the available probabilities. p (f e) X a p (f, a e) () [Exercise 2. Given the following translation probabilities of the word freddo, what alignments will be generate for this word? where the hidden variable a is not observed in the training sample. cold chilly cool wind freddo ] Unfortunately, there is no closed-form solution for maximizing L( ). There is an iterative algorithm which is proven to converge at least to a maximum of L( ).

9 Estimation of Word Alignment Models 26 Model : Estimation 28 How to train alignment models from parallel data? data + word alignments ) model parameters data + model parameters ) word alignments Idea to solve this chicken & egg problem INITIAL PARAM IMPROVE ESTIMATE BILINGUAL CORPUS PARAM loop until convergencence We start from the first sentence pair of the bilingual corpus: the house - das Haus and apply the following two steps:. We weight each word co-occurrences with its probability in the table: co(the,das) x 0.25 co(the,haus) x 0.25 co(house,das) x 0.25 co(house,haus) x We transform weighted co-occurrences in conditional probabilities: Pr(das/the) 0.25/( ) Pr(Haus/the) 0.25/( ) Pr(das/house) 0.25/( ) Pr(Haus/house) 0.25/( ) the das ein Haus Buch co pr house das ein Haus Buch co pr Notice: in this sentence the can be only linked either to das or to Haus We apply the same steps to all the sentence pairs of the bilingual corpus Model : Estimation 27 Model : Estimation 29 Let s go back to our simplified English and German languages. Ingredients of Expectation Maximization algorithm: Initial parameters: translation table with uniform probabilities Bilingual corpus: collection of human translations the house - das Haus the das ein Haus Buch co pr house das ein Haus Buch co pr Blingual corpus the house - das Haus the book - das Buch a book - ein Buch Probability table: das ein Haus Buch the a house book the book - das Buch the das ein Haus Buch co 0, pr 0, a book - ein Buch book das ein Haus Buch co pr Let us now see how to improve our probabilities... a das ein Haus Buch co pr book das ein Haus Buch co pr

10 Model : Estimation 0 Model : Estimation 2 We sum up all sentence level probs in a co-occurrence table... Again, we sum all probabilities in a co-occurrence table Bilingual corpus Co-occurrence table: Bilingual corpus Co-occurrence table: the house - das Haus the book - das Buch a book - ein Buch das ein Haus Buch total the a house book the house - das Haus the book - das Buch a book - ein Buch das ein Haus Buch total the a house book Finally, we compute updated word translation probabilities from the counts. Probability table: das ein Haus Buch the a house book and compute updated word translation probabilities from the counts. Probability table: das ein Haus Buch the a house book Let is start a second iteration... We iterate this procedure several times, until prob get stable values Model : Estimation Model : Estimation the house - das Haus the das ein Haus Buch co pr 0, the book - das Buch house das ein Haus Buch co pr Iteration... das ein Haus Buch the a house book the das ein Haus Buch co 0, pr 0, a book - ein Buch book das ein Haus Buch co pr Iteration 2 das ein Haus Buch the a house book a das ein Haus Buch co pr book das ein Haus Buch co pr This procedure is called Expectation Maximization algorithm Here, EM could only learn translations of the and book! We need more data to learn more translations and... better models, too!

11 Expectation Maximization Algorithm Expectation Maximization Algorithm 6 We introduce an auxiliary function Q(, ) which has two properties: Q(, ) 0 () Q(, ) > 0 ) L( ) >L( ) (5) Hence, with this iterative procedure we can find a maximum point of L( ): 0. ˆ initialization. do 2. ˆ. ˆ arg max Q(, ). while L(ˆ ) >L( ) 0. ˆ initialize. do 2. ˆ.. generate all possible alignments for the training data. 2. accumulate observed counts weighted by alignment model.. compute relative frequencies ˆ from counts c (!). while L(ˆ ) >L( ) Property: Q(, ) 0 ) {max Q(, )} 0. Definition is in the appendix. Expectation Maximization Algorithm 5 EM Algorithm of Model 7 For our alignment models, the solution of max Q(, ) is: Parameters of Model are probs p(f e), i.e.word f is aligned with word e. where: ˆ (!) c (!) P c (!)!2 c (!) µ SX X p (a f s, e s )c(!; a, f s, e s ) s (!) is one element of, i.e. the probability of! are the old parameter values, ˆ are the new values a! is any elementary event of the model to be normalized over a sub-space µ (!) is a relative frequency estimator based on the expected count c (!), which is taken by generating all possible alignments a with the old probabilities. ˆp(f e) c (f e) c (f e) P e c (f e) SX X p (a f s, e s )c(f e; a, f s, e s ) s a c(f e; a, f s, e s ): count how many times f is aligned with e in the triple a, f s, e s With some manipulation (see appendix) we get: c (f e) SX mx s k i0 p(f k e i ) P l a0 p(f k e a ) (e, e i) (f,f k )

12 EM Algorithm of Model 8 Model 2 0 EM-M(F,E,S) Init-Params(P); // P[f,e]uniform 2 do Reset-Expected-Counts; //p[]0 ptot[]0; for s:to S; // loop over training data 5 do Expected-Counts(F[s] Length(F[s]),E[s],Length(E[s])); 6 for f 2 F; 7 do for e 2 E; // new parameters 8 do P[f,e] : p[f,e]/ptot[e]; 9 until convergence Replaces the uniform alignment probability of M with: Pr(a j...) p(a j j, l, m) Properties: Model does not care where words appear in the two strings! Model 2 introduces alignment probs, i.e. a table of size (L M) 2 Training of Models -2 is easy from a bilingual corpus with given alignments: both models are products of discrete distributions their likelihood function is a product of multinomial distributions MLE just needs relative counts of events (su cient statistics) E cient computation of Pr(f e) like Model Problems and limitations of Model and Model 2: do not model the # of foreign words to be connected to each English word the alignment probability of Model 2 is complex and shallow EM Algorithm of Model 9 Example of Alignment with Model Expected-Counts(F,m,E,l) // Update counters p[], ptot[],using current parameters P[] 2 for j:to m; do t : 0; for i:0to l; 5 do ff[j]; ee[i]; 6 t : t + P[f,e] ; 7 for i:0to l; 8 do f:f[j]; e:e[i]; 9 p[f,e] : p[f,e] + P[f,e] / t; 0 ptot[e]:ptot[e] + P[f,e] / t;. mehr nicht wohl das geht dann, ja ah NULL oh well, then, I guess, that will not work anymore. Problem: three words (,) are mapped to the same word! in fact, words are aligned independently from each other

13 Example: alignment with fertility models 2 Model. mehr nicht wohl das geht dann, ja ah NULL oh well, then, I guess, that will not work anymore. e Alignment Model Pr(a,f e) Model generates the translation and the alignment as follows:. for each word i of e it generates a fertility value i 2. fore each word i of e it applies the following steps: (a) generate translations of word i (b) pick one positions for each of the words a,f Fertility models explicitly consider the number of words covered by each English word e.g. if comma has fertility, then only one word can be aligned to it Step implicitly defines the length m of the translation Steps -2 all rely of specific probability tables This model is significantly more complex than Model! Estimation of M follows the principle used for M, it s just more tricky! Fertility Models Model : Generative Process 5 The number of French words covered by e is a r.v. Models -2 do not explicitly model fertilities Models,, and 5 parameterize fertilities directly e: namely,thefertility of e Fertility models imply a di erent generative process of f and a given e:. For i,...,l,0, choose a fertility value i 0 for word e i 2. For i,...,l,0, choose a tablet i of i French words to translate e i. Choose a permutation over the tableau (,..., l, 0 ) to generate f. IF any position was chosen more than once THEN return FAILURE 5. ELSE return (a,f) corresponding to (, ) Notice: for correct pairs (, ) there is a many-to-one mapping to (f, a) the notion of fertility is embedded into and 0 null0 the program2 has been implemented5 il programma e` stato pratica in messo fertility tablet permutation e' stato2 messo in pratica5 il6 programma7

14 Model : Generative Process 5 Incremental Training Procedure 7 null0 the program2 has been implemented5 BILINGUAL CORPUS 0 fertility EM ALGORITHM EM ALGORITHM EM ALGORITHM il programma e` stato pratica in messo PERMUTATION MUST BE CHECKED BEFORE GENERATING (a,f)!!! tablet permutation MODEL INITIAL PARAM MODEL 2 INITIAL PARAM MODEL INITIAL PARAM... e' stato2 messo in pratica5 il6 programma7 use previous model to initialize some of the parameters of next model! Model : Training 6 HMM Alignment Model 8 Problem: no way to e ciently compute summation over alignments Trick: limit summation to a neighborhood of best alignment from Problem: no e c (!) SX X s a2n (a ) p (a f s, e s )c(!; a, f s, e s ) cient way to compute Viterbi alignments with M Trick:do hill-climbing in alignment space: start from best alignment by Model 2: a V (f e; M2) hill-climbing operator: b(a ) arg max p(a e, f; M) (6) a2n (a ) neighborhood N (a): alignments di ering from a by one move or one swap move operator: m [j,i] (a): seta j : i swap operator: s [j,j 2 ](a): exchanging a j with a j2 Another alignment model which follows from the general alignment model: Pr(f m,a m e l )Pr(m e l ) j Let us define the following parameters: Pr(a j f j,a j, m, e l ) Pr(f j f j,a j, m, el ) Pr(m e l ) p(m l) string length probabilities Pr(a j...) p(a j a j,l) alignment probabilities Pr(f j a j,...) p(f j e aj ) translation probabilities Hence, we get the following translation model: Pr(f m e l ) X a m Pr(f m,a m EM training can be carried out e e l )p(m l) X a m p(a j a j,l) p(f j e aj ) j ciently through dynamic programming

15 Combinations of Word Alignments 9 Grow Diagonal Word Alignment 5 Given parallel sentences we can train an alignment model and then align them. We have di erent options: direct alignment: we learn alignments from to inverted alignment: we learn alignment from so We can get better alignments by combining direct and inverted alignments. union: greedy collection of alignment points, higher coverage intersection: selective collection, higher precision grow-diagonal: take the best of two Properties: direct/inverted alignments are maps betwen two sets of positions union alignment is a many-to-many partial alignment intersection is is a - partial alignment direct inverted intersection grow diagonal grow diagonal grow diagonal Union and Intersection Alignments 50 How to measure quality of word alignments 52 direct inverted intersection Automatic Manual Matches direct inverted union automatic alignments sure alignments possible alignments AER - #( ) + #( ) #( ) + #( ) Alignment Error Rate

16 Use of word alignments 5 55 Bilingual concordance Source: Search string: Select corpus: EN-English EQUAL TO Alice in Wonderland She felt very sleepy, when suddenly a White rabbit with pink eyes ran close by her. nor did Alice think it so unusual to hear the rabbit say to itself "Oh dear! Oh dear! I shall be too late!" rabbit But when the rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for she remembered that she had never before seen a rabbit with either a waistcoat-pocket or a watch to take out of it, and she ran across the field after it, and was just in time to see it pop down a large rabbit-hole under the hedge. The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had no time to think about stopping herself before she found herself falling down what seemed to be a very deep well. Target: ZH-Chinese A Done Appendix Last words on word alignments 5 Often used symbols 56 Given a parallel corpus we can automatically learn alignments to: discover interesting lexical relationships generate a probabilistic translation lexicon extract phrase-pairs Alignments have limitations in terms of allowed word mappings Better alignments can be obtained by: estimating alignments from to and viceversa computing a suitable combination of the two alignments l, m length of and sentences f f m f...,f m sentence e e l e...,e l sentence i, j and positions e i,f j and words e 0 empty word (of the sentence) F language dictionary E language dictionary i 2 {0,,...,l} positions L,M maximum length of and sentences

17 Auxiliary Function and Theorem 57 EM Theorem (cont d) 59 Given parameters, we search for better values through the auxiliary function: where: and Q(, ) X f,e c(f, e) X a p (a f, e) p (f, a e) p (f e) p (a f, e) log p (f, a e) p (f, a e) p (f, a e) P a 0 p (f, a 0 e) (7) (8) Q(, ) 0 (9) Q is only similar to an entropy formula and is explained by the following property EM Theorem If Q(, ) > 0 then L( ) >L( ) Hence, for any e and f, wehavethat X p (a f, e) log p (f, a e) p (f, a e) a X p (a p (f, a e)/p (f e) f, e) log p (f, a e)/ p a (f e) p (f e) p (f e) X p (a f, e) log p (a f, e) p (a f, e) + log p (f e) X p (f e) p (a f, e) a a {z } apple X a p (a p (a f, e) f, e) p (a f, e) {z } 0 + log p (f e) p (f e) (2) () () (5) In the last step, we applied the inequality log x apple x with x p (a f,e) p (a f,e). EM Theorem 58 EM Theorem (cont d) 60 EM Theorem. Givenparametervalues and of an alignment model, it holds: if Q(, ) > 0 then L( ) >L( ) (0) Proof We can show that Q is related to the likelihood function L by: L( ) L( )+Q(, ) () which is is equivalent to the theorem s statement. The proof of the inequality is based on this simple geometric property: log x apple (x ) (equality for x ) By summing up over all (f, e) we get the desired inequality: X (f,e) c(f, e) log p (f e) p (f e) End of the Proof. X (f,e) c(f, e) X a L( ) L( ) Q(, ) L( ) L( )+Q(, ) p (a f, e) log p (f, a e) p (f, a e) The role of Q(, ) is now clear: are the current parameters, are the new unknown parameters if we find such that Q(, ) > 0 then we have better parameters... but we need some parameters to start with... the good news is that we can start with any settings (uniform, random,...) Proof.

18 EM with Alignment Models 6 Training Model 2 6 All our word alignment models have the general exponential form: For Model 2, {(i, j, l, m)} [ {(e, f)}, whichispartitioned as follows: p (f, a e) Y (!) c(!;a,f,e) (6)!2 where parameters satisfy multiple normalization constraints X (!), µ, 2,... (7)!2 µ where the subsets µ, µ, 2,...,formapartition of. The partition corresponds to all the conditional probabilities in the model. Each conditional probability to be estimated has indeed to sum-up to. j,l,m {(i j, l, m) :0appleiapplel}, 0 apple j, m apple M,0 apple l apple L e {(f e) :f 2 F}, e 2 E (2) c(i j, l, m; a, f, e) (i, a j ) (22) mx c(f e; a, f, e) (e, e aj ) (f,f j ) (2) j We can now directly derive the iterative re-estimation formulae from: c (!; f, e) X p (a f, e)c(!; a, f, e) (2) a Problem: the above formula requires summing over (l + ) m alignments! M-5 are products of discrete distributions defined over di erent events within (a, f, e). EM with Alignment Models 62 Training Model 2: Useful Formulas 6 Constraints leads to the system of equations: (!) Q(, )+ P µ µ! 2 µ,µ, 2,... P!2 (!) Q(, ) µ µ Q(, )+ µ ( P!2 µ (!) 0 µ, 2,... (8) For! 2 µ, by applying Lagrange multipliers we get the re-estimation formula ˆ (!) c (!) X f,e P!2 mu (!) µ c (!) µ X c(f, e) X a!2 µ c (!) (9) p (a f, e)c(!; a, f, e) (20) Parameter update based on expected counts, which are collected by averaging all possible alignments a with the posterior p (a f, e) of the current value! Model 2 permits to e p (f e) X a ciently calculate the sum over alignments: p (f, a e) p(m l)... a 0 p(m l) j i0 am0 j p(f j e aj )p(a j j, l, m) p(f j e i )p(i j, l, m) (25) Proof. Let m and l,andletx jaj p(f j e aj )p(a j j, l, m). It is routine to verify that: x 0 x 20 x x x 2 x 0 + x x 2 x (x 0 + x )(x 20 + x 2 )(x 0 + x ) Hence we can write: p (a f, e) p (f, a e) Pa p (f, a e) Q m j p(f j e aj )p(a j j, l, m) m P l i0 p(f j e i )p(i j, l, m) Y p (a j j, f, e) (26) Q m j Important: we need only 2 m (l + ) operations! j

19 Training Model 2 65 Model 2: Training Algorithm 67 c (f e; f, e) X a p (a f, e)c(f e; a, f, e)... a 0 mx k a 0 mx 0 p (a ` j j, f, e) A (e, eak ) (f, f k ) am0... k a k 0 mx k i0 j am0 j k p (a j j, f, e) (e, e ak ) (f, f k ) p (a k k, f, e) (e, e ak ) (f, f k ) (27) p(f k e i ) p(i k, l, m) P l a0 p(f k e a ) p(a k, l, m) (e, e i) (f, f k ) (28) EM-Model2(F,m,E,l) Init-Params(P,Q); // P[f,e]uniform; Q[i,j,l,m]uniform; 2 do Reset-Expected-Counts; //p[]0 ptot[]0; q[]0; qtot[]0; for s:to S; // loop over training data 5 do Expected-Counts(F[s] Length(F[s]),E[s],Length(E[s])); 6 for m: to M; //max length 7 do for l:to L; // max length 8 do for j:to m; 9 do for i: 0 to l; //new current parameters 0 do Q[i,j,l,m] : q[i,j,l,m]/qtot[j,l,m]; for f 2 F; 2 do for e 2 E; // new current parameters do P[f,e] : p[f,e]/ptot[e]; until convergence Training Model 2 66 Model 2: Training Algorithm 68 c (i j, l, m; f, e) X p (a f, e)c(i j, l, m; a, f, e) a!... p (a k k, f, e) (i, a j ) a 0 a m0 k p (a j j, f, e) (i, a j ) (29) a j 0 p (i j, f, e) p(f j e i ) p(i j, l, m) P l i 0 0 p(f j e i 0) p(i 0 j, l, m) (0) Expected-Counts(F,m,E,l) // Update counters p[], q[], ptot[], qtot[] using current parameters P[],Q[] 2 for j:to m; do t : 0; for i:0to l; 5 do ff[j]; ee[i]; 6 t : t + P[f,e] * Q[i,j,l,m]; 7 for i:0to l; 8 do f:f[j]; e:e[i]; 9 c: P[f,e] * Q[i,j,l,m] / t; 0 q[i,j,l,m] : q[i,j,l,m] + c; qtot[j,l,m]qtot[j,l,m] + c; 2 p[f,e] : p[f,e] + c; ptot[e]:ptot[e] + c;

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013

Machine Translation. CL1: Jordan Boyd-Graber. University of Maryland. November 11, 2013 Machine Translation CL1: Jordan Boyd-Graber University of Maryland November 11, 2013 Adapted from material by Philipp Koehn CL1: Jordan Boyd-Graber (UMD) Machine Translation November 11, 2013 1 / 48 Roadmap

More information

Word Alignment. Chris Dyer, Carnegie Mellon University

Word Alignment. Chris Dyer, Carnegie Mellon University Word Alignment Chris Dyer, Carnegie Mellon University John ate an apple John hat einen Apfel gegessen John ate an apple John hat einen Apfel gegessen Outline Modeling translation with probabilistic models

More information

Lexical Translation Models 1I. January 27, 2015

Lexical Translation Models 1I. January 27, 2015 Lexical Translation Models 1I January 27, 2015 Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } { z } { p(e f,m)=

More information

IBM Model 1 for Machine Translation

IBM Model 1 for Machine Translation IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of

More information

Lexical Translation Models 1I

Lexical Translation Models 1I Lexical Translation Models 1I Machine Translation Lecture 5 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Last Time... X p( Translation)= p(, Translation)

More information

Probability Review. September 25, 2015

Probability Review. September 25, 2015 Probability Review September 25, 2015 We need a tool to 1) Formulate a model of some phenomenon. 2) Learn an instance of the model from data. 3) Use it to infer outputs from new inputs. Why Probability?

More information

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School - University of Pisa Pisa, 7-19 May 008 Part III: Search Problem 1 Complexity issues A search: with single

More information

Word Alignment III: Fertility Models & CRFs. February 3, 2015

Word Alignment III: Fertility Models & CRFs. February 3, 2015 Word Alignment III: Fertility Models & CRFs February 3, 2015 Last Time... X p( Translation)= p(, Translation) Alignment = X Alignment Alignment p( p( Alignment) Translation Alignment) {z } {z } X z } {

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Phrase-Based Statistical Machine Translation with Pivot Languages

Phrase-Based Statistical Machine Translation with Pivot Languages Phrase-Based Statistical Machine Translation with Pivot Languages N. Bertoldi, M. Barbaiani, M. Federico, R. Cattoni FBK, Trento - Italy Rovira i Virgili University, Tarragona - Spain October 21st, 2008

More information

Probability and Statistics

Probability and Statistics Probability and Statistics January 17, 2013 Last time... 1) Formulate a model of pairs of sentences. 2) Learn an instance of the model from data. 3) Use it to infer translations of new inputs. Why Probability?

More information

IBM Model 1 and the EM Algorithm

IBM Model 1 and the EM Algorithm IBM Model 1 and the EM Algorithm Philipp Koehn 14 September 2017 Lexical Translation 1 How to translate a word look up in dictionary Haus house, building, home, household, shell. Multiple translations

More information

EM with Features. Nov. 19, Sunday, November 24, 13

EM with Features. Nov. 19, Sunday, November 24, 13 EM with Features Nov. 19, 2013 Word Alignment das Haus ein Buch das Buch the house a book the book Lexical Translation Goal: a model p(e f,m) where e and f are complete English and Foreign sentences Lexical

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017 1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

the time it takes until a radioactive substance undergoes a decay

the time it takes until a radioactive substance undergoes a decay 1 Probabilities 1.1 Experiments with randomness Wewillusethetermexperimentinaverygeneralwaytorefertosomeprocess that produces a random outcome. Examples: (Ask class for some first) Here are some discrete

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Statistical NLP Spring Corpus-Based MT

Statistical NLP Spring Corpus-Based MT Statistical NLP Spring 2010 Lecture 17: Word / Phrase MT Dan Klein UC Berkeley Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it

More information

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1

Corpus-Based MT. Statistical NLP Spring Unsupervised Word Alignment. Alignment Error Rate. IBM Models 1/2. Problems with Model 1 Statistical NLP Spring 2010 Corpus-Based MT Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana I will do it tomorrow Hasta pronto See you soon Hasta pronto See

More information

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples Statistical NLP Spring 2009 Machine Translation: Examples Lecture 17: Word Alignment Dan Klein UC Berkeley Corpus-Based MT Levels of Transfer Modeling correspondences between languages Sentence-aligned

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015 Decoding Revisited: Easy-Part-First & MERT February 26, 2015 Translating the Easy Part First? the tourism initiative addresses this for the first time the die tm:-0.19,lm:-0.4, d:0, all:-0.65 tourism touristische

More information

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations EECS 70 Discrete Mathematics and Probability Theory Fall 204 Anant Sahai Note 5 Random Variables: Distributions, Independence, and Expectations In the last note, we saw how useful it is to have a way of

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information

Massachusetts Institute of Technology

Massachusetts Institute of Technology Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses)

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses) CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18 Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th Feb, 2011 Going forward

More information

Word Alignment for Statistical Machine Translation Using Hidden Markov Models

Word Alignment for Statistical Machine Translation Using Hidden Markov Models Word Alignment for Statistical Machine Translation Using Hidden Markov Models by Anahita Mansouri Bigvand A Depth Report Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

COMS 4705, Fall Machine Translation Part III

COMS 4705, Fall Machine Translation Part III COMS 4705, Fall 2011 Machine Translation Part III 1 Roadmap for the Next Few Lectures Lecture 1 (last time): IBM Models 1 and 2 Lecture 2 (today): phrase-based models Lecture 3: Syntax in statistical machine

More information

Exercise 1: Basics of probability calculus

Exercise 1: Basics of probability calculus : Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016] Ex 1.1: Conditional

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

MATH2206 Prob Stat/20.Jan Weekly Review 1-2

MATH2206 Prob Stat/20.Jan Weekly Review 1-2 MATH2206 Prob Stat/20.Jan.2017 Weekly Review 1-2 This week I explained the idea behind the formula of the well-known statistic standard deviation so that it is clear now why it is a measure of dispersion

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Hidden Markov Models Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Additional References: David

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

CRF Word Alignment & Noisy Channel Translation

CRF Word Alignment & Noisy Channel Translation CRF Word Alignment & Noisy Channel Translation January 31, 2013 Last Time... X p( Translation)= p(, Translation) Alignment Alignment Last Time... X p( Translation)= p(, Translation) Alignment X Alignment

More information

7.1 What is it and why should we care?

7.1 What is it and why should we care? Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should

More information

Bundle Methods for Machine Learning

Bundle Methods for Machine Learning Bundle Methods for Machine Learning Joint work with Quoc Le, Choon-Hui Teo, Vishy Vishwanathan and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program Canberra, ACT 0200

More information

Brief Review of Probability

Brief Review of Probability Brief Review of Probability Nuno Vasconcelos (Ken Kreutz-Delgado) ECE Department, UCSD Probability Probability theory is a mathematical language to deal with processes or experiments that are non-deterministic

More information

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model: Chapter 5 Text Input 5.1 Problem In the last two chapters we looked at language models, and in your first homework you are building language models for English and Chinese to enable the computer to guess

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Lecture - 21 HMM, Forward and Backward Algorithms, Baum Welch

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Conditional Random Fields

Conditional Random Fields Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

Hidden Markov Models: All the Glorious Gory Details

Hidden Markov Models: All the Glorious Gory Details Hidden Markov Models: All the Glorious Gory Details Noah A. Smith Department of Computer Science Johns Hopkins University nasmith@cs.jhu.edu 18 October 2004 1 Introduction Hidden Markov models (HMMs, hereafter)

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

KENDRIYA VIDYALAYA SANGATHAN CHENNAI REGION- MADURAI CLUSTER HALF YEARLY EXAM 2017 MODEL PAPER SUBJECT: ENGLISH CLASS: IV SEC: NAME OF THE STUDENT:

KENDRIYA VIDYALAYA SANGATHAN CHENNAI REGION- MADURAI CLUSTER HALF YEARLY EXAM 2017 MODEL PAPER SUBJECT: ENGLISH CLASS: IV SEC: NAME OF THE STUDENT: KENDRIYA VIDYALAYA SANGATHAN CHENNAI REGION- MADURAI CLUSTER CLASS: IV SEC: NAME OF THE STUDENT: HALF YEARLY EXAM 2017 MODEL PAPER SUBJECT: ENGLISH TIME:2 ½HOURS DATE: ROLL NO: Invigilator s Sign: Evaluator

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out

More information

Probability 1 (MATH 11300) lecture slides

Probability 1 (MATH 11300) lecture slides Probability 1 (MATH 11300) lecture slides Márton Balázs School of Mathematics University of Bristol Autumn, 2015 December 16, 2015 To know... http://www.maths.bris.ac.uk/ mb13434/prob1/ m.balazs@bristol.ac.uk

More information

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009 CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments

Overview (Fall 2007) Machine Translation Part III. Roadmap for the Next Few Lectures. Phrase-Based Models. Learning phrases from alignments Overview Learning phrases from alignments 6.864 (Fall 2007) Machine Translation Part III A phrase-based model Decoding in phrase-based models (Thanks to Philipp Koehn for giving me slides from his EACL

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Statistical NLP Spring HW2: PNP Classification

Statistical NLP Spring HW2: PNP Classification Statistical NLP Spring 2010 Lecture 16: Word Alignment Dan Klein UC Berkeley HW2: PNP Classification Overall: good work! Top results: 88.1: Matthew Can (word/phrase pre/suffixes) 88.1: Kurtis Heimerl (positional

More information

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment

HW2: PNP Classification. Statistical NLP Spring Levels of Transfer. Phrasal / Syntactic MT: Examples. Lecture 16: Word Alignment Statistical NLP Spring 2010 Lecture 16: Word Alignment Dan Klein UC Berkeley HW2: PNP Classification Overall: good work! Top results: 88.1: Matthew Can (word/phrase pre/suffixes) 88.1: Kurtis Heimerl (positional

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram

More information

Lecture 4. f X T, (x t, ) = f X,T (x, t ) f T (t )

Lecture 4. f X T, (x t, ) = f X,T (x, t ) f T (t ) LECURE NOES 21 Lecture 4 7. Sufficient statistics Consider the usual statistical setup: the data is X and the paramter is. o gain information about the parameter we study various functions of the data

More information

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016 Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import

More information

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

MATH MW Elementary Probability Course Notes Part I: Models and Counting

MATH MW Elementary Probability Course Notes Part I: Models and Counting MATH 2030 3.00MW Elementary Probability Course Notes Part I: Models and Counting Tom Salisbury salt@yorku.ca York University Winter 2010 Introduction [Jan 5] Probability: the mathematics used for Statistics

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are

More information

Basic Probability and Statistics

Basic Probability and Statistics Basic Probability and Statistics Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Jerry Zhu, Mark Craven] slide 1 Reasoning with Uncertainty

More information

N-gram Language Modeling

N-gram Language Modeling N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides slide 1 Inference with Bayes rule: Example In a bag there are two envelopes one has a red ball (worth $100) and a black ball one

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

1. Markov models. 1.1 Markov-chain

1. Markov models. 1.1 Markov-chain 1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited

More information

Give students a few minutes to reflect on Exercise 1. Then ask students to share their initial reactions and thoughts in answering the questions.

Give students a few minutes to reflect on Exercise 1. Then ask students to share their initial reactions and thoughts in answering the questions. Student Outcomes Students understand that an equation is a statement of equality between two expressions. When values are substituted for the variables in an equation, the equation is either true or false.

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Review of Basic Probability Theory

Review of Basic Probability Theory Review of Basic Probability Theory James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) 1 / 35 Review of Basic Probability Theory

More information

EM (cont.) November 26 th, Carlos Guestrin 1

EM (cont.) November 26 th, Carlos Guestrin 1 EM (cont.) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 26 th, 2007 1 Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w 2 = Gets a B P(B) = µ

More information

Unit 1: Sequence Models

Unit 1: Sequence Models CS 562: Empirical Methods in Natural Language Processing Unit 1: Sequence Models Lecture 5: Probabilities and Estimations Lecture 6: Weighted Finite-State Machines Week 3 -- Sep 8 & 10, 2009 Liang Huang

More information

Formalizing Probability. Choosing the Sample Space. Probability Measures

Formalizing Probability. Choosing the Sample Space. Probability Measures Formalizing Probability Choosing the Sample Space What do we assign probability to? Intuitively, we assign them to possible events (things that might happen, outcomes of an experiment) Formally, we take

More information